fundamentals of computer systems - a pipelined mips processor
TRANSCRIPT
Fundamentals of Computer SystemsA Pipelined MIPS Processor
Stephen A. Edwards and Martha A. Kim
Columbia University
Spring 2012
Technical Illustrations Copyright ©2007 Elsevier
Sequential Laundry
Alice
Bob
Cindy
Time
Pipelined Laundry
Alice
Bob
Cindy
Time
Why let the washer ordryer sit idle when bothcan be running?
Single-Cycle Datapath Timing
�������
�
� �
����������
���
�
�
��
��
� �
� �
� ����
��
�
�����������
��������
����
�
�
�
�
� �
����
���
�
���
�
���
�
��� ����������
�����
����
����
�����
�����
!!�
�
�"�&�$(� �&' '�'
����& '�'
����
���($��
����'��)
����&�&����
�&�$(�
*&�+
�
�"
Clk
PC
Instruction Memory
Register File
ALU
Data Memory
Single-Cycle vs. Pipelined Datapath
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PCF0
1PC' InstrD
25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
ALU
WriteRegE4:0
CLK
CLK
CLK
SignImm
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PC0
1PC' Instr
25:21
20:16
15:0
SrcB
20:16
15:11
<<2
+
ALUResult ReadData
WriteData
SrcA
PCPlus4
PCBranch
WriteReg4:0
Result
Zero
CLK
ALU
Fetch Decode Execute Memory Writeback
Corrected Pipelined DatapathThe register number to write (WriteReg) must staysynchronized with the result.
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PCF0
1PC' InstrD
25:21
20:16
15:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4EPCPlus4F
ZeroM
CLK CLK
WriteRegW4:0
ALU
WriteRegE4:0
CLK
CLK
CLK
Fetch Decode Execute Memory Writeback
Pipeline ControlSame control unit as the single-cycle processor;Control signals delayed across pipeline stages
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign Extend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE0
1
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
20:16
15:11
RtE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4EPCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
ZeroM
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
BranchE BranchM
RegDstE
ALUSrcE
WriteRegE4:0
Single-Cycle vs. Pipeline Timing
Time (ps)Instr
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead / Write
WriteReg
1
2
0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 1500 1600 1700 1800 19001000
Instr
1
2
3
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead / Write
WriteReg
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead/Write
WriteReg
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead/Write
WriteReg
FetchInstruction
DecodeRead Reg
ExecuteALU
MemoryRead/Write
WriteReg
Single-Cycle
Pipelined
Pipelining Abstraction
Time (cycles)
lw $s2, 40($0) RF 40
$0RF
$s2+ DM
RF $t2
$t1RF
$s3+ DM
RF $s5
$s1RF
$s4- DM
RF $t6
$t5RF
$s5& DM
RF 20
$s1RF
$s6+ DM
RF $t4
$t3RF
$s7| DM
add $s3, $t1, $t2
sub $s4, $s1, $s5
and $s5, $t5, $t6
sw $s6, 20($s1)
or $s7, $t3, $t4
1 2 3 4 5 6 7 8 9 10
add
IM
IM
IM
IM
IM
IMlw
sub
and
sw
or
Data Hazard
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMadd
or
sub
The first instruction produces a result (in $s0) that laterinstructions need.
The ALU has computed the value in cycle 3, but it won’tbe written to the register file until cycle 5.
Bioh
azard
Water
Hazard
Hazard
Lights
Dukes
ofH
azzard
Eliminating Hazards at Compile-Time
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IM add
or
sub
nop
nop
RF RFDMnopIM
RF RFDMnopIM
9 10
Insert nops to delay later instructions; sometimespossible to put useful work in those slots.
Eliminating Data Hazards through Forwarding
Time (cycles)
add $s0, $s2, $s3 RF $s3
$s2RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMadd
or
sub
Add logic to send data “between” instructions; registerfile eventually written.
Here, the result is available at the end of cycle 3 andneeded in cycles 4 and 5.
Datapath with Data Forwarding
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
ForwardAE
ForwardBE
20:16RtE
RsD
RdD
RtD
RegW
riteM
RegWriteW
Hazard Unit
PCPlus4E
BranchE BranchM
ZeroM
ForwardAE =
10 if rsE 6= 0∧ rsE = WriteRegM∧RegWriteM,01 if rsE 6= 0∧ rsE = WriteRegW∧RegWriteW,00 otherwise.
Data Hazard that Demands a Stall
Time (cycles)
lw $s0, 40($0) RF 40
$0RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMlw
or
sub
Trouble!
This data hazard can’t be solved with forwardingbecause the value is only available at the end ofcycle 4, yet is needed at the beginning.
Stalling a Pipeline
Time (cycles)
lw $s0, 40($0) RF 40
$0RF
$s0+ DM
RF $s1
$s0RF
$t0& DM
RF $s0
$s4RF
$t1| DM
RF $s5
$s0RF
$t2- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IM lw
or
sub
9
RF $s1
$s0
IM or
Stall
A stall tells an instruction to wait for a cycle beforeproceeding.
Stalling Hardware
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchM
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcM
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
SignImmD
StallF
StallD
ForwardA
E
ForwardB
E
20:16RtE
RsD
RdD
RtD
RegWriteM
RegWriteW
Mem
toRegE
Hazard Unit
FlushE
PCPlus4E
BranchE BranchM
ZeroM
EN
EN
CLR
lwstall = MemToRegE∧�
(rsD = rtE)∨ (rtD = rtE)�
StallF,StallD,FlushE = lwstall
Control Hazards
Time (cycles)
beq $t1, $t2, 40 RF $t2
$t1RF- DM
RF $s1
$s0RF& DM
RF $s0
$s4RF| DM
RF $s5
$s0RF- DM
and $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
and
IM
IM
IM
IMlw
or
sub
20
24
28
2C
30
...
...
9
Flushthese
instructions
64 slt $t3, $s2, $s3 RF $s3
$s2RF
$t3slt DMIM
slt
Whether to branch isn’t determined until the beginningof the fourth cycle; three instructions would beexecuted erroneously if the branch were taken.
Early Branch Resolution
EqualD
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchD
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcD
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
=
SignImmD
Sta
llF
Sta
llD
For
war
dAE
For
war
dBE
20:16RtE
RsD
RdE
RtD
Reg
Writ
eM
Re
gWrit
eW
Mem
toR
egE
Hazard Unit
Flu
shE
EN
EN
CLR
CLR
Introduced another data hazard in the decode stage
Control Hazards w/ Early Branch Resolution
Time (cycles)
beq $t1, $t2, 40 RF $t2
$t1RF- DM
RF $s1
$s0RF& DMand $t0, $s0, $s1
or $t1, $s4, $s0
sub $t2, $s0, $s5
1 2 3 4 5 6 7 8
andIM
IMlw20
24
28
2C
30
...
...
9
Flushthis
instruction
64 slt $t3, $s2, $s3 RF $s3
$s2RF
$t3slt DMIM
slt
Handling Data and Control Hazards
EqualD
SignImmE
CLK
A RD
InstructionMemory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
SignExtend
RegisterFile
0
1
0
1
A RD
DataMemory
WD
WE
1
0
PCF0
1PC' InstrD
25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchD
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
ControlUnit
PCSrcD
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
0
1
0
1
=
SignImmD
Sta
llF
Sta
llD
For
war
dAE
For
war
dBE
For
war
dAD
For
war
dBD
20:16RtE
RsD
RdD
RtD
Reg
Writ
eE
Reg
Writ
eM
Reg
Writ
eW
Mem
toR
egE
Bra
nchD
Hazard Unit
Flu
shE
EN
EN
CLR
CLR
Forwarding and Stalling Logic
“Forward ALU result to branch-if-equal comparator if we would readits destination register”
ForwardAD = (rsD 6= 0∧ rsD = WriteRegM∧RegWriteM)
ForwardBD = (rtD 6= 0∧ rtD = WriteRegM∧RegWriteM)
“Stall if the branch would test the result of an ALU operation or amemory read”
branchstall =(BranchD∧RegWriteE∧ (WriteRegE = rsD∨WriteRegE = rtD))∨(BranchD∧MemToRegM∧ (WriteRegM = rsD∨WriteRegM = rtD))
“Stall if we need to read the result of a memory read or of a branch”
StallF,StallD,FlushE = lwstall∨ branchstall
Pipeline Performance CPI Example
Ideal CPI = 1; stalls reduce this, but how much?
Instructions in SPECINT2000 benchmark:
52% R-type25% Loads10% Stores11% Branches
2% Jumps
If 40% of loads are used by the next instruction and 25% ofbranches are mispredicted, what is the average CPI?
Load CPI = 2 when next instruction uses; 1 otherwise
Branch CPI = 2 when mispredicted; 1 otherwise
Jump CPI = 2
Average CPI = 1 · 0.52+ R-type(2 · 0.40+ 1 · 0.60) · 0.25+ Loads1 · 0.10+ Stores(2 · 0.25+ 1 · 0.75) · 0.11+ Branches2 · 0.02 Jumps
= 1.1475
Pipeline Performance CPI Example
Ideal CPI = 1; stalls reduce this, but how much?
Instructions in SPECINT2000 benchmark:
52% R-type25% Loads10% Stores11% Branches
2% Jumps
If 40% of loads are used by the next instruction and 25% ofbranches are mispredicted, what is the average CPI?
Load CPI = 2 when next instruction uses; 1 otherwise
Branch CPI = 2 when mispredicted; 1 otherwise
Jump CPI = 2
Average CPI = 1 · 0.52+ R-type(2 · 0.40+ 1 · 0.60) · 0.25+ Loads1 · 0.10+ Stores(2 · 0.25+ 1 · 0.75) · 0.11+ Branches2 · 0.02 Jumps
= 1.1475
Fully Bypassed Processor
EqualD
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign
Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE
1
0
PCF0
1
PC' InstrD25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchD
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
Control
Unit
PCSrcD
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
0
1
0
1
=
SignImmD
20:16 RtE
RsD
RdD
RtD
Hazard Unit
Sta
llF
Sta
llD
Forw
ard
AE
Forw
ard
BE
Forw
ard
AD
Forw
ard
BD
RegW
riteE
RegW
riteM
RegW
riteW
Mem
toR
egE
Bra
nchD
Flu
shE
EN
CLRENCLR
Pipelined Processor Critical Path
EqualD
SignImmE
CLK
A RD
Instruction
Memory
+
4
A1
A3
WD3
RD2
RD1WE3
A2
CLK
Sign
Extend
Register
File
0
1
0
1
A RD
Data
Memory
WD
WE
1
0
PCF0
1
PC' InstrD25:21
20:16
15:0
5:0
SrcBE
25:21
15:11
RsE
RdE
<<2
+
ALUOutM
ALUOutW
ReadDataW
WriteDataE WriteDataM
SrcAE
PCPlus4D
PCBranchD
WriteRegM4:0
ResultW
PCPlus4F
31:26
RegDstD
BranchD
MemWriteD
MemtoRegD
ALUControlD2:0
ALUSrcD
RegWriteD
Op
Funct
Control
Unit
PCSrcD
CLK CLK CLK
CLK CLK
WriteRegW4:0
ALUControlE2:0
ALU
RegWriteE RegWriteM RegWriteW
MemtoRegE MemtoRegM MemtoRegW
MemWriteE MemWriteM
RegDstE
ALUSrcE
WriteRegE4:0
000110
000110
0
1
0
1
=
SignImmD
20:16 RtE
RsD
RdD
RtD
Hazard Unit
Sta
llF
Sta
llD
Forw
ard
AE
Forw
ard
BE
Forw
ard
AD
Forw
ard
BD
RegW
riteE
RegW
riteM
RegW
riteW
Mem
toR
egE
Bra
nchD
Flu
shE
EN
CLRENCLR
Element Delay
Register clk-to-Q tpcq 30 psRegister setup tsetup 20Multiplexer tmux 25ALU tALU 200Memory Read tmemread 250Register file read tRFread 150Register file setup tRFsetup 20Equality teq 40AND gate tAND 15Memory Write tmemwrite 220Register file write tRFwrite 100
Tc=max
tpcq + tmemread + tsetup Fetch2(tRFread + tmux + teq + tAND + tmux + tRFsetup) Decodetpcq + tmux + tmux + tALU + tsetup Executetpcq + tmemwrite + tsetup Memory2(tpcq + tmux + tRFwrite) Writeback
=2(150+ 25+ 40+ 15+ 25+ 20) ps = 550 ps
Why 2? We assume it takes half a cycle for a newly writtenregister’s value (WD3) to propagate to RD1 or RD2, i.e., when anearlier instruction writes a register used by the current one.
Pipelined Processor Performance
For a 100 billion-instruction task on our pipelinedprocessor, each instruction takes 1.15 cycles onaverage. With a 550 ps clock period,
time = 100× 109 × 1.15× 550 ps = 63 seconds
Processor Execution Time Speedup
Single-Cycle 92.5 s 1.00 (by definition)Multi-Cycle 133.9 0.70Pipelined 63.25 1.46