-
Lecture 3 Slide 1 EECS 470
EECS 470
Lecture 3
Pipelining & Hazards I
Jon Beaumont
GAS STATION
Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Mudge, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, University of Pennsylvania, and University of Wisconsin.
-
Lecture 3 Slide 2 EECS 470
Announcements
• Reminder Lab #1 due Friday by end of lab (12:30 pm)
Get checked off during GSI/IA OH
Verilog assignment #1 due Friday evening Submit to autograder by 11:59p 25 submissions so far (30% of class)
HW # 1 due Thursday 2/4 Submit through Gradescope by 11:59p
• Adding more staff OH Also experimenting with different formats to see what works best Check the description on the Google calendar for each session for
details We'll try to come up with a consistent format to avoid confusion
-
Lecture 3 Slide 3 EECS 470
Project 1
-
Lecture 3 Slide 4 EECS 470
Project 1
-
Lecture 3 Slide 5 EECS 470
Last Time
• Quantifying performance Latency vs throughput Different averaging techniques (arithmetic, harmonic, geometric)
• Power and Energy
-
Lecture 3 Slide 6 EECS 470
Today
• Baseline processor discussion Review 5-stage pipeline from EECS 370 Introduce Hazards
-
Lecture 3 Slide 7 EECS 470
Lingering Questions
• "Could you give a few more examples of computing applications where we care more about throughput??"
• Remember, you can submit lingering questions to cover next lecture at: https://bit.ly/3oSr5FD
Latency Throughput
Real-time systems (self driving
cars, drones, etc)
Scientific computing (e.g.
simulations)
Web search Autograding a class's projects
Processing audio/video Training machine learning
models
https://bit.ly/3oSr5FD
-
Lecture 3 Slide 8 EECS 470
Capacitive Power dissipation
Power ~ ½ CV2Af
Capacitance: Function of wire length, transistor size
Supply Voltage: Has been dropping with successive fab generations
Clock frequency: Increasing…
Activity factor: How often, on average, do wires switch?
Wh
at u
ses
po
wer
in a
ch
ip?
-
Lecture 3 Slide 9 EECS 470
Voltage Scaling
• Scenario: 80W, 1 BIPS, 1.5V, 1GHz Cache Optimization:
IPC decreases by 10%, reduces power by 20% => Final Processor: 900 MIPS, 64W
• What if we just adjust frequency/voltage on processor? How to reduce power by 20%?
P = CV2F = CV3 => Drop voltage by 7% (and also Freq) => .93*.93*.93 = .8x
So for equal power (64W)
Cache Optimization = 900MIPS Simple Voltage/Frequency Scaling = 930MIPS W
hat
use
s p
ow
er in
a c
hip
?
-
Lecture 3 Slide 10 EECS 470
Power scales roughly cubically
with frequency Scale clock frequency to 80% Now add a second core
Same power budget, but 1.6x performance!
But: Must parallelize application Remember Amdahl’s Law!
Multicore: Solution to Power-constrained design?
Performance Power
-
Lecture 3 Slide 11 EECS 470
The Execution Core: Pipelining
-
Lecture 3 Slide 12 EECS 470
Outline for next several lectures: Understanding the Execution Core
High-level design feature -> Actual microarchitecture example 1. Pipelining: 370’s 5-stage pipeline (review)
2. Dynamic scheduling: Scoreboard (CDC 6600)
3. Register Renaming: Tomasulo’s algorithm (IBM 360)
4. Precise interrupts with Reorder Buffer:
P6 MIPS R10K
-
Lecture 3 Slide 13 EECS 470
Before there was pipelining…
Basic datapath: fetch, decode, execute • Single-cycle control:
+ Low CPI (1) – Long clock period (to accommodate slowest instruction)
• Multi-cycle control: + Short clock period – High CPI
+ Potentially better overall latency if designed well (could it ever be worse?)
insn0.fetch, dec, exec
Single-cycle
Multi-cycle
insn1.fetch, dec, exec
insn0.dec insn0.fetch
insn1.dec insn1.fetch
insn0.exec
insn1.exec
-
Lecture 3 Slide 14 EECS 470
Speeding Up
Remember, three ways to speed up a process:
• Reduce number of tasks (possible?) • Decrease latency of tasks (what would that include?) • Parallelize
How do we parallelize this pipeline?
insn0.fetch, dec, exec
Single-cycle
Multi-cycle
insn1.fetch, dec, exec
insn0.dec insn0.fetch
insn1.dec insn1.fetch
insn0.exec
insn1.exec
-
Lecture 3 Slide 15 EECS 470
Parallelize
Duplicate pipeline (superscalar)
• Effective, but expensive (~2x hardware overhead) • Discuss more later in semester
• Or… pipeline!
insn0.dec insn0.fetch
insn1.dec insn1.fetch
insn0.exec
insn1.exec
insn0.dec insn0.fetch
insn2.dec insn2.fetch
insn0.exec
insn2.exec
insn1.dec insn1.fetch
insn3.dec insn3.fetch
insn1.exec
insn3.exec
-
Lecture 3 Slide 16 EECS 470
Pipelining
• Important performance technique Improves throughput at the expense of latency
Why does latency go up?
• Begin with multi-cycle design When instruction advances from stage 1 to 2… … allow next instruction to enter stage 1 Each instruction still passes through all stages + But instructions enter and leave at a much faster rate + More instructions executed in parallel
• Not much hardware overhead (what needs to be added?)
insn0.dec insn0.fetch
insn1.dec insn1.fetch Multi-cycle
Pipelined
insn0.exec
insn1.exec
insn0.dec insn0.fetch
insn1.dec insn1.fetch
insn0.exec
insn1.exec
-
Lecture 3 Slide 17 EECS 470
Pipeline Illustrated:
GateDelay
Comb. Logicn Gate Delay
GateDelayL
GateDelayL
L GateDelayLGateDelayL
L BW = ~(1/n)
n--2
n--2
n--3
n--3
n--3
BW = ~(2/n)
BW = ~(3/n)
-
Lecture 3 Slide 18 EECS 470
370 Processor Pipeline Review
I-cache Reg
File PC
+1
D-cache ALU
Fetch Decode Memory
(Write-back)
Tpipeline = Tbase / 5
Execute
-
Lecture 3 Slide 19 EECS 470
Stage 1: Fetch
Fetch an instruction from memory every cycle. Use PC to index memory Increment PC (assume no branches for now)
Write state to the pipeline register (IF/ID) The next stage will read this pipeline register. Note that pipeline register must be edge triggered
-
20
Inst
ruct
ion
bit
s
IF / ID Pipeline register
PC
Instruction
Memory/
Cache
en
en
1
+
M
U
X
Res
t of
pip
elin
ed d
ata
pa
th
PC
+ 1
-
Lecture 3 Slide 21 EECS 470
Stage 2: Decode
Decodes opcode bits
May set up control signals for later stages
Read input operands from registers file
specified by regA and regB of instruction bits
Write state to the pipeline register (ID/EX)
Opcode
Register contents
Offset & destination fields
PC+1 (even though decode didn’t use it)
-
22
Destreg
Data
ID / EX Pipeline register
Con
ten
ts
Of
reg
A
Con
ten
ts
Of
reg
B Register File
regA
regB
en
Res
t of
pip
elin
ed d
ata
pa
th
Inst
ruct
ion
bit
s
IF / ID Pipeline register
PC
+ 1
PC
+ 1
C
on
tro
l
Sig
na
ls
Sta
ge
1:
Fet
ch d
ata
path
-
Lecture 3 Slide 23 EECS 470
Stage 3: Execute
Perform ALU operation.
Input operands can be:
Contents of regA or RegB
Offset field on the instruction
Branches: calculate PC+1+offset
Write state to the pipeline register (EX/Mem)
ALU result, contents of RegB and PC+1+offset
Instruction bits for opcode and destReg specifiers
-
24
ID / EX Pipeline register
Con
ten
ts
Of
reg
A
Con
ten
ts
Of
reg
B
Res
t of
pip
elin
ed d
ata
pa
th
AL
U
Res
ult
EX/Mem Pipeline register
PC
+ 1
C
on
tro
l
Sig
na
ls
Sta
ge
2:
Dec
od
e d
ata
path
Con
trol
Sig
nals
PC
+1
+o
ffse
t
+
con
ten
ts
of
reg
B
A
L
U M
U
X
-
Lecture 3 Slide 25 EECS 470
Stage 4: Memory Operation
Perform data cache access for memory ops ALU result contains address for ld and st
Opcode bits control mem R/W and enable signals
Write state to the pipeline register (Mem/WB) ALU result and MemData
Instruction bits for opcode and destReg specifiers
-
26
Alu
Res
ult
Mem/WB Pipeline register
Res
t of
pip
elin
ed d
ata
pa
th
Alu
Res
ult
EX/Mem Pipeline register
Sta
ge
3:
Ex
ecu
te d
ata
path
Con
trol
Sig
nals
PC
+1
+o
ffse
t
con
ten
ts
of
reg
B
This goes back to the MUX
before the PC in stage 1.
Mem
ory
Rea
d D
ata
Data Memory
en R/W
Con
trol
Sig
nals
MUX control
for PC input
-
Lecture 3 Slide 27 EECS 470
Stage 5: Write back
Writing result to register file (if required) Write MemData to destReg for ld instruction
Write ALU result to destReg for arithmetic instruction
Opcode bits control register write enable signal
-
28
Alu
Res
ult
Mem/WB Pipeline register
Sta
ge
4:
Mem
ory
da
tap
ath
Con
trol
Sig
nals
Mem
ory
Rea
d D
ata
M
U
X
This goes back to data
input of register file
This goes back to the
destination register specifier
M
U
X
bits 0-2
bits 16-18
register write enable
-
Lecture 3 Slide 29 EECS 470
Sample Code (Simple)
Run the following code on a pipelined datapath: add 1 2 3 ; reg 3 = reg 1 + reg 2 nand 4 5 6 ; reg 6 = reg 4 & reg 5 lw 2 4 20 ; reg 4 = Mem[reg2+20] add 2 5 5 ; reg 5 = reg 2 + reg 5 sw 3 7 10 ; Mem[reg3+10] =reg 7
-
30
PC Inst
mem
Reg
iste
r fi
le
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
Bits 0-2
Bits 16-18
op
dest
offset
valB
valA
PC+1 PC+1
target
ALU
result
op
dest
valB
op
dest
ALU
result
mdata
eq? instru
ction
0
R2
R3
R4
R5
R1
R6
R0
R7
regA
regB
Bits 22-24
data
dest
-
31
PC Inst
mem
Reg
iste
r fi
le
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
Bits 0-2
Bits 16-18
noop
0
0
0
0
0 0
0
0
noop
0
0
noop
0
0
0
0
no
op
9 12 18 7
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits 22-24
data
dest
Initial
State
-
32
PC Inst
mem
Reg
iste
r fi
le
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
Bits 0-2
Bits 16-18
noop
0
0
0
0
0 1
0
0
noop
0
0
noop
0
0
0
0 ad
d 1
2 3
9 12 18 7
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits 22-24
data
dest
Fetch:
add 1 2 3
add 1 2 3
Time: 1
-
33
PC Inst
mem
Reg
iste
r fi
le
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
Bits 0-2
Bits 16-18
add
3
3
9
36
1 2
0
0
noop
0
0
noop
0
0
0
0 na
nd
4 5
6
9 12 18 7
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
1
2
Bits 22-24
data
dest
Fetch:
nand 4 5 6
nand 4 5 6 add 1 2 3
Time: 2
-
34
PC Inst
mem
Reg
iste
r fi
le
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
Bits 0-2
Bits 16-18
nand
6
6
7
18
2 3
4
45
add
3
9
noop
0
0
0
0 lw 2
4 2
0
9 12 18 7
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
4
5
Bits 22-24
data
dest
Fetch:
lw 2 4 20
lw 2 4 20 nand 4 5 6 add 1 2 3
Time: 3
36
9
1
3
3
-
35
PC Inst
mem
Reg
iste
r fi
le
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
Bits 0-2
Bits 16-18
lw
4
20
18
9
3 4
8
-3
nand
6
7
add
3
45
0
0 ad
d 2
5 5
9 12 18 7
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
2
4
Bits 22-24
data
dest
Fetch:
add 2 5 5
add 2 5 5 lw 2 4 20 nand 4 5 6 add 1 2 3
Time: 4
18
7
2
6
6
45
3
-
36
PC Inst
mem
Reg
iste
r fi
le
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
Bits 0-2
Bits 16-18
add
5
5
7
9
4 5
23
29
lw
4
18
nand
6
-3
0
0 sw 3
7 1
0
9 45 18 7
36
41
0
22
R2
R3
R4
R5
R1
R6
R0
R7
2
5
Bits 22-24
data
dest
Fetch:
sw 3 7 10
sw 3 7 10 add 2 5 5 lw 2 4 20 nand 4 5 6 add
Time: 5
9
20
3
20
4
-3
6
45
3
-
37
PC Inst
mem
Reg
iste
r fi
le
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
Bits 0-2
Bits 16-18
sw
7
10
22
45
5
9
16
add
5
7
lw
4
29
99
0
9 45 18 7
36
-3
0
22
R2
R3
R4
R5
R1
R6
R0
R7
3
7
Bits 22-24
data
dest
No more
instructions
sw 3 7 10 add 2 5 5 lw 2 4 20 nand
Time: 6
9
7
4
5
5
29
4
-3
6
-
38
PC Inst
mem
Reg
iste
r fi
le
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
Bits 0-2
Bits 16-18
15
55
sw
7
22
add
5
16
0
0
9 45 99 7
36
-3
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits 22-24
data
dest
No more
instructions
sw 3 7 10 add 2 5 5 lw
Time: 7
45
5
10
7
10
16
5
99
4
-
39
PC Inst
mem
Reg
iste
r fi
le
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
Bits 0-2
Bits 16-18
sw
7
55
0
9 45 99 16
36
-3
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits 22-24
data
dest
No more
instructions
sw 3 7 10 add
Time: 8
22 55
22
16
5
-
40
PC Inst
mem
Reg
iste
r fi
le
M
U
X A
L
U
M
U
X
1
Data
memory
+ +
M
U
X
IF/
ID
ID/
EX
EX/
Mem
Mem/
WB
M
U
X
Bits 0-2
Bits 16-18
9 45 99 16
36
-3
0
22
R2
R3
R4
R5
R1
R6
R0
R7
Bits 22-24
data
dest
No more
instructions
sw
Time: 9
-
Lecture 3 Slide 41 EECS 470
Time graphs
Time: 1 2 3 4 5 6 7 8 9
add
nand
lw
add
sw
fetch decode execute memory writeback
fetch decode execute memory writeback
fetch decode execute memory writeback
fetch decode execute memory writeback
fetch decode execute memory writeback
-
Lecture 3 Slide 42 EECS 470
Balancing Pipeline Stages
IF
ID
EX
MEM
WB
TIF= 6 units
TID= 2 units
TEX= 9 units
TMEM= 5 units
TWB= 8 units
What is the speedup of the pipelined processor over the single-cycle processor (assuming no hazards)?
a) 2/30
b) 6/30
c) 9/30
d) 30/9
e) 30/6
f) 30/2
g) No idea
Can we do better in terms of either performance or efficiency?
-
Lecture 3 Slide 43 EECS 470
Balancing Pipeline Stages
Two Methods for Stage Quantization: Merging of multiple stages Further subdividing a stage
Recent Trends: Deeper pipelines (more and more stages)
Pipeline depth growing more slowly since Pentium 4. Why?
Multiple pipelines Pipelined memory/cache accesses (tricky)
-
Lecture 3 Slide 44 EECS 470
The Cost of Deeper Pipelines
Instruction pipelines are not ideal i.e. Instructions in different stages can have dependencies
Suppose add 1 2 3
nand 3 4 5
F D E M W F D E M W
t0 t1 t2 t3 t4 t5 Inst0 Inst1
F D E M W F D E M W
t0 t1 t2 t3 t4 t5 add nand E Stall
F E M D Stall D
RAW!!
(read-after-write
dependency)
-
Lecture 3 Slide 45 EECS 470
Types of Dependencies and Hazards
Data Dependence (Both memory and register) True dependence (RAW)
Instruction must wait for all required input operands
Anti-Dependence (WAR) Later write must not clobber a still-pending earlier read
Output dependence (WAW) Earlier write must not clobber already-completed later write
Control Dependence (aka Procedural Dependence) Conditional branches may change instruction sequence Instructions after cond. branch depend on outcome (more exact definition later)
Not an
issue
now, but
stay
tuned
-
Lecture 3 Slide 46 EECS 470
Terminology
Pipeline Hazards: Potential violations of program dependences Must ensure program dependences are not violated
Hazard Resolution: Static Method: Performed at compiled time in software Dynamic Method: Performed at run time using hardware
Pipeline Interlock: Hardware mechanisms for dynamic hazard resolution Must detect and enforce dependences at run time
-
Lecture 3 Slide 48 EECS 470
Handling Data Hazards
Avoidance (static) Make sure there are no hazards in the code
Detect and Stall (dynamic) Stall until earlier instructions finish
Detect and Forward (dynamic) Get correct value from elsewhere in pipeline
-
Lecture 3 Slide 49 EECS 470
Handling Data Hazards: Avoidance
Programmer/compiler must know implementation details Insert noops between dependent instructions
add 1 2 3 noop noop nand 3 4 5
write R3 in cycle 5
read R3 in cycle 6
-
Lecture 3 Slide 50 EECS 470
Problems with Avoidance
Binary compatibility New implementations may require more noops
Code size Higher instruction cache footprint Longer binary load times Worse in machines that execute multiple instructions / cycle
Intel Itanium – 25-40% of instructions are noops
Slower execution CPI=1, but many instructions are noops
-
Lecture 3 Slide 51 EECS 470
Handling Data Hazards: Detect & Stall
Detection Compare regA & regB with DestReg of preceding insn.
3 bit comparators
Stall Do not advance pipeline register for Fetch/Decode Pass noop to Execute
-
Lecture 3 Slide 52 EECS 470
Next Time
• Continue 5-stage review Discuss detect-and-forward in depth
• Lingering questions / feedback? I'll include an anonymous form at the end of every lecture: https://bit.ly/3oSr5FD
https://bit.ly/3oSr5FD