pipelined mips microarchitecture · pipelined mips microarchitecture. carnegie mellon 2 what will...
TRANSCRIPT
Carnegie Mellon
1
Design of Digital Circuits 2014Srdjan CapkunFrank K. Gürkaynak
Adapted from Digital Design and Computer Architecture, David Money Harris & Sarah L. Harris ©2007 Elsevier
http://www.syssec.ethz.ch/education/Digitaltechnik_14
Pipelined MIPS Microarchitecture
Carnegie Mellon
2
What Will We Learn
How to do more per unit time
Parallelism
Pipelining
Single Cycle vs Pipelined
Pipelined MIPS architecture
Hazards and how to solve them
Data hazards
Control hazards
Performance of the Pipelined architecture
Carnegie Mellon
3
Parallelism
Two types of parallelism:
Spatial parallelism
duplicate hardware performs multiple tasks at once
Temporal parallelism
task is broken into multiple stages
also called pipelining
for example, an assembly line
Carnegie Mellon
4
Parallelism Definitions
Some definitions:
Token: A group of inputs processed to produce a group of outputs
Latency: Time for one token to pass from start to end
Throughput: The number of tokens that can be produced per unit time
Parallelism increases throughput.
Carnegie Mellon
5
Parallelism Example
Example:
Ben Bitdiddle is baking cookies to celebrate the installation of his traffic light controller. It takes 5 minutes to roll the cookies and 15 minutes to bake them. After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben doesn’t use parallelism?
Latency =
Throughput =
Carnegie Mellon
6
Parallelism Example
Example:
Ben Bitdiddle is baking cookies to celebrate the installation of his traffic light controller. It takes 5 minutes to roll the cookies and 15 minutes to bake them. After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben doesn’t use parallelism?
Latency = 5 + 15 = 20 minutes = 1/3 hour
Throughput = 1 tray/ 1/3 hour = 3 trays/hour
Carnegie Mellon
7
Parallelism Example
What is the latency and throughput if Ben uses parallelism?
Spatial parallelism: Ben asks Allysa P. Hacker to help, using her own oven
Temporal parallelism: Ben breaks the task into two stages: roll and baking. He uses two trays. While the first batch is baking he rolls the second batch, and so on.
Carnegie Mellon
8
Spatial ParallelismS
pa
tial
Para
lleli
sm Roll
Bake
Ben 1 Ben 1
Alyssa 1 Alyssa 1
Ben 2 Ben 2
Alyssa 2 Alyssa 2
Time
0 5 10 15 20 25 30 35 40 45 50
Tray 1
Tray 2
Tray 3
Tray 4
Latency:
time to
first tray
Legend
Latency =
Throughput=
Carnegie Mellon
9
Spatial ParallelismS
pa
tial
Para
lleli
sm Roll
Bake
Ben 1 Ben 1
Alyssa 1 Alyssa 1
Ben 2 Ben 2
Alyssa 2 Alyssa 2
Time
0 5 10 15 20 25 30 35 40 45 50
Tray 1
Tray 2
Tray 3
Tray 4
Latency:
time to
first tray
Legend
Latency = 5 + 15 = 20 minutes = 1/3 hour
Throughput= 2 trays/ 1/3 hour = 6 trays/hour
Carnegie Mellon
10
Temporal ParallelismT
em
po
ral
Para
lleli
sm Ben 1 Ben 1
Ben 2 Ben 2
Ben 3 Ben 3
Time
0 5 10 15 20 25 30 35 40 45 50
Latency:
time to
first tray
Tray 1
Tray 2
Tray 3
Latency =
Throughput=
Carnegie Mellon
11
Temporal ParallelismT
em
po
ral
Para
lleli
sm Ben 1 Ben 1
Ben 2 Ben 2
Ben 3 Ben 3
Time
0 5 10 15 20 25 30 35 40 45 50
Latency:
time to
first tray
Tray 1
Tray 2
Tray 3
Latency = 5 + 15 = 20 minutes = 1/3 hour
Throughput= 1 trays/ 1/4 hour = 4 trays/hour
Using both techniques, the throughput would be 8 trays/hour
Carnegie Mellon
12
Pipelined MIPS Processor
Temporal parallelism
Divide single-cycle processor into 5 stages:
Fetch
Decode
Execute
Memory
Writeback
Add pipeline registers between stages
Carnegie Mellon
13
Single-Cycle vs. Pipelined Performance
Time (ps)Instr
Fetch
InstructionDecodeRead Reg
Execute
ALU
Memory
Read / Write
Write
Reg1
2
0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 1500 1600 1700 1800 19001000
Instr
1
2
3
Fetch
InstructionDecodeRead Reg
Execute
ALU
Memory
Read / Write
Write
Reg
Fetch
InstructionDecodeRead Reg
Execute
ALU
Memory
Read/Write
Write
Reg
Fetch
InstructionDecodeRead Reg
Execute
ALU
Memory
Read/Write
Write
Reg
Fetch
InstructionDecodeRead Reg
Execute
ALU
Memory
Read/Write
Write
Reg
Single-Cycle
Pipelined
Carnegie Mellon
14
Pipelining Abstraction
Time (cycles)
lw $s2, 40($0) RF 40
$0
RF$s2
+ DM
RF $t2
$t1
RF$s3
+ DM
RF $s5
$s1
RF$s4
- DM
RF $t6
$t5
RF$s5
& DM
RF 20
$s1
RF$s6
+ DM
RF $t4
$t3
RF$s7
| DM
add $s3, $t1, $t2
sub $s4, $s1, $s5
and $s5, $t5, $t6
sw $s6, 20($s1)
or $s7, $t3, $t4
1 2 3 4 5 6 7 8 9 10
add
IM
IM
IM
IM
IM
IMlw
sub
and
sw
or
Carnegie Mellon
15
Single-Cycle and Pipelined Datapath
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PCF0
1
PC' InstrD25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
ALU
WriteRegE4:0
CLK
CLK
CLK
SignImm
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PC0
1
PC' Instr25:21
20:16
15:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
Zero
CLK
ALU
Fetch Decode Execute Memory Writeback
Carnegie Mellon
16
Corrected Pipelined Datapath
WriteReg must arrive at the same time as Result
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PCF0
1
PC' InstrD25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
WriteRegW4:0
ALU
WriteRegE4:0
CLK
CLK
CLK
Fetch Decode Execute Memory Writeback
Carnegie Mellon
17
Pipelined Control
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE0
1
PCF0
1
PC' InstrD25:21
20:16
15:0
5:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4EPCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD
ALUSrcD
RegWriteD
Op
Funct
Control
Unit
ZeroM
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
BranchE BranchM
RegDstE
ALUSrcE
WriteRegE4:0
Same control unit as single-cycle processorControl delayed to proper pipeline stage
Carnegie Mellon
18
Pipeline Hazard
Occurs when an instruction depends on results from previous instruction that hasn’t completed.
Types of hazards:
Data hazard: register value not written back to register file yet
Control hazard: next instruction not decided yet (caused by branches)
Carnegie Mellon
19
Data Hazard
The register file can be read and written in the same cycle:
write takes place during the 1st half of the cycle
read takes place during the 2nd half of the cycle => no hazard !!!
However operations that involve register file have only half a clock cycle to complete the operation!!
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2
RF$s0
+ DM
RF $s1
$s0
RF$t0
& DM
RF $s0
$s4
RF$t1
| DM
RF $s5
$s0
RF$t2
- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMadd
or
sub
Carnegie Mellon
20
Data Hazard
One instruction writes a register ($s0) and next instructions read this register => read after write (RAW) hazard.
add writes into $s0 in the first half of cycle 5
and reads $s0 on cycle 3, obtaining the wrong value
or reads $s0 on cycle 4, again obtaining the wrong value.
sub reads $s0 in the second half of cycle 5, obtaining the correct value
subsequent instructions read the correct value of $s0
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2
RF$s0
+ DM
RF $s1
$s0
RF$t0
& DM
RF $s0
$s4
RF$t1
| DM
RF $s5
$s0
RF$t2
- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMadd
or
sub
Carnegie Mellon
21
How Can You Handle Data Hazards?
Insert “NOP”s (No OPeration) in code at compile time
Rearrange code at compile time
Forward data at run time
Stall the processor at run time
Carnegie Mellon
22
Compile-Time Hazard Elimination
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2
RF$s0
+ DM
RF $s1
$s0
RF$t0
& DM
RF $s0
$s4
RF$t1
|DM
RF $s5
$s0
RF$t2
-DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMadd
or
sub
nop
nop
RF RFDMnopIM
RF RFDMnopIM
9 10
Insert enough NOPs for result to be ready
Or (if you can) move independent useful instructions forward
Carnegie Mellon
23
Data Forwarding
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2
RF$s0
+ DM
RF $s1
$s0
RF$t0
& DM
RF $s0
$s4
RF$t1
| DM
RF $s5
$s0
RF$t2
- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMadd
or
sub
Carnegie Mellon
24
Data Forwarding
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign
Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE
1
0
PCF0
1
PC' InstrD25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
Control
Unit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
Forw
ard
AE
Forw
ard
BE
20:16RtE
RsD
RdD
RtD
RegW
rite
M
RegW
rite
W
Hazard Unit
PCPlus4E
BranchE BranchM
ZeroM
Carnegie Mellon
27
Stalling
Time (cycles)
lw $s0, 40($0) RF 40
$0
RF$s0
+ DM
RF $s1
$s0
RF$t0
& DM
RF $s0
$s4
RF$t1
| DM
RF $s5
$s0
RF$t2
- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMlw
or
sub
Trouble!
Forwarding is sufficient to solve RAW data hazards
but …
Carnegie Mellon
28
Stalling
Time (cycles)
lw $s0, 40($0) RF 40
$0
RF$s0
+ DM
RF $s1
$s0
RF$t0
& DM
RF $s0
$s4
RF$t1
| DM
RF $s5
$s0
RF$t2
- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMlw
or
sub
Trouble!
The lw instruction does not finish reading data until the end of the Memory stage, so its result cannot be forwarded to the Execute stage of the next instruction.
Carnegie Mellon
29
Stalling
Time (cycles)
lw $s0, 40($0) RF 40
$0
RF$s0
+ DM
RF $s1
$s0
RF$t0
& DM
RF $s0
$s4
RF$t1
| DM
RF $s5
$s0
RF$t2
- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMlw
or
sub
Trouble!
The lw instruction has a two-cycle latency, therefore a dependent instruction cannot use its result until two cycles later.
The lw instruction receives data from memory at the end of cycle 4. But the and instruction needs that data as a source operand at the beginning of cycle 4. There is no way to solve this hazard with forwarding.
Carnegie Mellon
30
Stalling
Time (cycles)
lw $s0, 40($0) RF 40
$0
RF$s0
+ DM
RF $s1
$s0
RF$t0
& DM
RF $s0
$s4
RF$t1
| DM
RF $s5
$s0
RF$t2
- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMlw
or
sub
9
RF $s1
$s0
IMor
Stall
Carnegie Mellon
31
Stalling Hardware
Stalls are supported by:
adding enable inputs (EN) to the Fetch and Decode pipeline registers
and a synchronous reset/clear (CLR) input to the Execute pipeline register.
When a lw stall occurs
StallD and StallF are asserted to force the Decode and Fetch stage pipeline registers to hold their old values.
FlushE is also asserted to clear the contents of the Execute stage pipeline register, introducing a bubble
Carnegie Mellon
32
Stalling Hardware
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign
Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE
1
0
PCF0
1
PC' InstrD25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
Control
Unit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
Sta
llF
Sta
llD
Forw
ard
AE
Forw
ard
BE
20:16RtE
RsD
RdD
RtD
RegW
rite
M
RegW
rite
W
Mem
toR
egE
Hazard Unit
Flu
shE
PCPlus4E
BranchE BranchM
ZeroM
EN
EN
CLR
Carnegie Mellon
33
Control Hazards
beq:
branch is not determined until the fourth stage of the pipeline
Instructions after the branch are fetched before branch occurs
These instructions must be flushed if the branch happens
Branch misprediction penalty
number of instruction flushed when branch is taken
May be reduced by determining branch earlier
Carnegie Mellon
34
Control Hazards: Original Pipeline
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign
Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE
1
0
PCF0
1
PC' InstrD25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
Control
Unit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
Sta
llF
Sta
llD
Forw
ard
AE
Forw
ard
BE
20:16RtE
RsD
RdD
RtD
RegW
rite
M
RegW
rite
W
Mem
toR
egE
Hazard Unit
Flu
shE
PCPlus4E
BranchE BranchM
ZeroM
EN
EN
CLR
Carnegie Mellon
35
Control Hazards
Time (cycles)
beq $t1, $t2, 40 RF $t2
$t1
RF- DM
RF $s1
$s0
RF& DM
RF $s0
$s4
RF| DM
RF $s5
$s0
RF- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMlw
or
sub
20
24
28
2C
30
...
...
9
Flush
these
instructions
64 slt $t3, $s2, $s3 RF $s3
$s2
RF$t3s
lt
DMIMslt
Carnegie Mellon
36
Control Hazards: Early Branch Resolution
EqualD
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign
Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE
1
0
PCF0
1
PC' InstrD25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchD
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
Control
Unit
PCSrcD
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
=
SignImmD
Sta
llF
Sta
llD
Forw
ard
AE
Forw
ard
BE
20:16RtE
RsD
RdE
RtD
RegW
rite
M
RegW
rite
W
Mem
toR
egE
Hazard Unit
Flu
shE
EN
EN
CLR
CLR
Introduced another data hazard in Decode stage.. this is another story
Carnegie Mellon
37
Early Branch Resolution
Time (cycles)
beq $t1, $t2, 40 RF $t2
$t1
RF- DM
RF $s1
$s0
RF& DMand $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
andIM
IMlw
20
24
28
2C
30
...
...
9
Flush
this
instruction
64 slt $t3, $s2, $s3 RF $s3
$s2
RF$t3s
lt
DMIMslt
Carnegie Mellon
38
Handling Data and Control Hazards
EqualD
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign
Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE
1
0
PCF0
1
PC' InstrD25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchD
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
Control
Unit
PCSrcD
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
0
1
0
1
=
SignImmD
Sta
llF
Sta
llD
Forw
ard
AE
Forw
ard
BE
Forw
ard
AD
Forw
ard
BD
20:16RtE
RsD
RdD
RtD
RegW
rite
E
RegW
rite
M
RegW
rite
W
Mem
toR
egE
Bra
nchD
Hazard Unit
Flu
shE
EN
EN
CLR
CLR
Possible solution to data hazard in Decode stage.
Carnegie Mellon
40
Branch Prediction
Guess whether branch will be taken
Backward branches are usually taken (loops)
Perhaps consider history of whether branch was previously taken to improve the guess
Good prediction reduces the fraction of branches requiring a flush
Carnegie Mellon
41
Pipelined Performance Example
SPECINT2000 benchmark:
25% loads
10% stores
11% branches
2% jumps
52% R-type
Suppose:
40% of loads used by next instruction
25% of branches mispredicted
All jumps flush next instruction
What is the average CPI?
Carnegie Mellon
42
Pipelined Performance Example Solution
Load/Branch CPI = 1 when no stalling, 2 when stalling.Thus:
CPIlw = 1(0.6) + 2(0.4) = 1.4 Average CPI for load
CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI for branch
And
Average CPI =
Carnegie Mellon
43
Pipelined Performance Example Solution
Load/Branch CPI = 1 when no stalling, 2 when stalling.Thus:
CPIlw = 1(0.6) + 2(0.4) = 1.4 Average CPI for load
CPIbeq = 1(0.75) + 2(0.25) = 1.25 Average CPI for branch
And
Average CPI = (0.25)(1.4) + load(0.1)(1) + store(0.11)(1.25) + beq(0.02)(2) + jump(0.52)(1) r-type
= 1.15
Carnegie Mellon
44
Pipelined Performance
There are 5 stages, and 5 different timing paths:
Tc = max {tpcq + tmem + tsetup fetch
2(tRFread + tmux + teq + tAND + tmux + tsetup ) decode
tpcq + tmux + tmux + tALU + tsetup execute
tpcq + tmemwrite + tsetup memory
2(tpcq + tmux + tRFwrite) writeback
}
The operation speed depends on the slowest operation
Decode and Writeback use register file and have only half aclock cycle to complete, that is why there is a 2 in front of them
Carnegie Mellon
45
Pipelined Performance Example
Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20
Equality comparator teq 40
AND gate tAND 15
Memory write Tmemwrite 220
Register file write tRFwrite 100
Tc = 2(tRFread + tmux + teq + tAND + tmux + tsetup )= 2[150 + 25 + 40 + 15 + 25 + 20] ps= 550 ps
Carnegie Mellon
46
Pipelined Performance Example
For a program with 100 billion instructions executing on a pipelined MIPS processor:
CPI = 1.15
Tc = 550 ps
Execution Time = (# instructions) × CPI × Tc
= (100 × 109)(1.15)(550 × 10-12)= 63 seconds
Carnegie Mellon
47
Performance Summary for MIPS arch.
Processor
Execution Time
(seconds)
Speedup
(single-cycle is baseline)
Single-cycle 95 1
Multicycle 133 0.71
Pipelined 63 1.51
Fastest of the three MIPS architectures is Pipelined.
However, even though we have 5 fold pipelining, it is not 5 times faster than single cycle.
Carnegie Mellon
48
What Did We Learn?
How to design a pipelined architecture
Break down long combinational path by registers
Shortens the clock period
You can start processing next instruction once the first part is complete
Problems of the pipelined architecture
Data you need for the next instruction may not be ready (Data Hazard)
You may not yet know the next instruction (Control Hazard)
Solutions to Hazards
Performance of pipelined MIPS architecture