eece476: computer architecture lecture 19: pipelining reducing control hazard penalty chapter 6.6...

14
EECE476: Computer Architecture Lecture 19: Pipelining Reducing Control Hazard Penalty Chapter 6.6 The University of British Columbia EECE 476 © 2005 Guy Lemieux

Post on 20-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

EECE476: Computer Architecture

Lecture 19: PipeliningReducing Control Hazard Penalty

Chapter 6.6

The University ofBritish Columbia EECE 476 © 2005 Guy Lemieux

2

Reminders• Midterm 1: Tomorrow!

– 50 minutes, not open book, calculator OK– Based on your assignments + lecture material– Covers EVERYTHING up to end of week 5

– No Verilog, Altera tools on midterm

– Try Lecture 14 study problems– Try extra study problems on web

• Partner signup deadline: Friday!– MUST work in pairs

• EMAIL ME IMMEDIATELY IF YOU’RE STILL ALONE– MUST email to project TA: 2 names, stud #s, emails– Late penalty: 5% of final grade

3

Directly Reducing Penaltyof Control Hazards

• Control hazards demand solutions:– Stalling– Nullify– Branch delay slots– All of these negatively impact performance (in various ways)

• Alternative– Can we directly reduce the negative impact of control hazards?

• Yes!– Execute branch/jump instruction earlier in pipeline– Outcome known sooner– Fetch fewer instructions enter pipeline after branch (before outcome

known)– For BEQ, we must detect “equals” earlier

4

BEQ in “X” Stage Instead?

• Move logic gates from “M” stage into “X” stage– “AND” gate, PCSrc mux– These logic gates depend on results *after* ALU

• Benefit– Only 2 instructions follow BEQ into pipeline– Improves only BEQ-if-taken performance

• But…– ALU may be slowest part of pipeline– Causes longer delay path after ALU, “X” stage slower– Clock rate may be affected– This may negatively affect ALL instructions, be careful!!!!

5

Moving BEQ into “X” Stage

Redshowsextra signaldelay afterALU

6

BEQ in “D” Stage Instead?• Move logic gates from “M” stage into “D” stage

– But we need ALU to compute (Rs – Rt) and Zero– How can we do this?

• Key benefit– Only 1 instruction follows “BEQ” into pipeline

• Notice– “EQ” can be computed efficiently in “D” stage

(Rs XOR Rt) == 32’b0• Simple bitwise XOR• “== 32’b0” is simple: wide 32-input NOR gate, single output

– No need for subtraction• Simple logic, no carry chain

– Move “ + SgnExt(Imm16) logic into “D” stage as well

7

Detecting “EQ” Earlier

8

Reducing Branch PenaltyReducing branch penalty

– Compute (Rs == Rt) and target address in “D” stage– Reduces branch delay to 1 cycle– Works well

But, this introduces a forwarding error!• Suppose

ADD $1, $2, $3 previous instructionBEQ $1, $2, 7 RAW hazard: needs new $1, forwarding?NOP delay slot

– Dependency causes data hazard

• Solutions?– Option 1: Stall until writeback of dependent instruction– Option 2A: Forward as much as possible (stall 1 cycle for LW)– Option 2B: Forward a bit less (stall 2 cycles for LW, 1 cycle for others)

9

Data Hazard with “BEQ”• Example: Clock cycle 1

1 ADD $1,$2,$3 I D X M W

BEQ $1,$2, 7 I D X M W

NOP I D X M W

I D X M W

I D X M W

I D X M W

I D X M W

I D X M W

I D X M W

I D X M W

WMXDI

10

Data Hazard with “BEQ”

• Clock cycle 2

1 ADD $1,$2,$3 I D X M W

2 BEQ $1,$2, 7 I D X M W

NOP I D X M W

I D X M W

I D X M W

I D X M W

I D X M W

I D X M W

I D X M W

I D X M W

WMXDI

11

Data Hazard with “BEQ”• Clock cycle 3 normal forwarding into X doesn’t work,

arrives too late!

1 ADD $1,$2,$3 I D X M W

2 BEQ $1,$2, 7 I D X M W

3 NOP I D X M W

I D X M W

I D X M W

I D X M W

I D X M W

I D X M W

I D X M W

I D X M W

WMXDI

12

Forwarding with “BEQ”• Clock cycle 4 Option 2B: insert “bubble”, forward to D

1 ADD $1,$2,$3 I D X M W

2 ? I D X ? ?

3 BEQ $1,$2, 7 I D D X M W

4 NOP I D X M W

I D X M W

I D X M W

I D X M W

I D X M W

I D X M W

I D X M W

WM?DI

13

Cause of the Error?• Moved “BEQ” execution from X to D

– PROBLEM: pipeline currently only forwards data into X stage

• To resolve, we have two options:

– Option 1: Low performance, add Hazard Detection Unit (HDU) condition• BEQ depends on earlier instruction• Stall >= 1 cycle until dependent instruction instruction finishes writeback• If not told otherwise, assume this approach

– Option 2, Higher performance, add Forward Detection Unit (FDU) and muxes

• Option 2A: Forward data from ALU out, DataMem out, W result into D stage– Avoids most stalls (no HDU needed, except for LW case again).– Longer delay, will probably slow clock and affect all instructions.

• Option 2B: Stall if dependent instr in X, forward data from M, W stages into D stage, stall if dependence in X

– HDU must now stall 1 cycle when BEQ depends on immediately prior R-type instruction.

14

Control Hazards Summary• Branches/jumps cause interruptions to control flow

– This affects the stream of instructions entering pipeline afterward the branch/jump

• These interruptions cause a utilization problem– We may fetch the wrong instruction(s) after branch/jump

• Option 1: stall after every branch/jump• Option 2: nullify-if-branch-taken (small performance improvement)• Option 3: declare as a “delay slot”, always-execute (bad idea for future ISAs)

– Default: assume MIPS behaviour – Option 3 with 1 Delay Slot

• To reduce Utilization problem, move branch/jump to D stage– This introduces new data hazards if the branch depends upon recent instructions

• These hazards introduce a new forwarding problem– Branch/jump may depend on result of recent instruction(s)

• Option 1: HDU forces stall until writeback (multiple cycles)• Option 2: minimal HDU stall, forward data when dependence can be resolved

– 2A: more forwarding needed, stall only for LW, may slow down clock speed– 2B: less fowarding needed, stall for LW and R-type until forwarding can be used

– Default: assume FORWARDING OPTION 1 for this course