pipelining the mips datapath - cs 233 · pipelining the mips datapath today: pipelining monday:...
TRANSCRIPT
Pipelining the MIPS Datapath
Today: PipeliningMonday: Forwarding values in the pipeline
Wednesday: Exam 4 review session (optional)Friday: Exam 5 review session (optional)
Today’s lecture Pipeline implementation Single‐cycle Datapath Pipelining performance Pipelined datapath Example
Refresher: The Full Single-cycle Datapath
wr_enable
clk
resetalu_op[2:0]
A[31:0]
B[31:0]
out[31:0]0
1 s
overflowzero
negative
alu_src2
alu_op[2:0]write_enable
alu_src2rd_src
3
inst[31:0]
inst[25:21]
inst[20:16]
inst[15:11]
inst[20:16]
rd_src
rs
rt
rd
rt
5
5
5
16
6
6
32
inst[15:0]
inst[31:26]
inst[5:0]
imm16
32
32
32
3
32
3
ADD
432
32
1
30
PC[31:0]
nextPC[31:0]
PC[31:2]
except
A_addr
B_addr
W_addr
A_data
B_data
W_en
W_data
reset
25x32 Register File
MIPS instruction decoderalu_op[2:0]
write_enable
alu_src2rd_src
except
opcode[5:0]
funct[5:0]
A
B
out
0
1s
Instruction Memory
addr[29:0]data[31:0]
PC RegisterD[31:0]
Q[31:0]resetenable
3ADD
branch offset
32ALU
0 1 2 3
PC+4[31:28] 2'b0
ALU
control_type
inst[25:0] 26
4
32
3232
32
32
32
control_type[1:0]2
control_type[1:0]
zero
32<<2in[29:0] out[31:0]
Sign Extenderin[15:0] out[31:0]
branch offset
30'b0
10 slt
16'b0
16
lui 1 0
luiluisltslt
word_webyte_we
out[1:0]
data_o
ut[31:0]
data_o
ut[31:24]
data_o
ut[23:16]
data_o
ut[15:8]
data_o
ut[7:0]
mem_readbyte_load
32
1
0
addr[31:0]
data_out[31:0]
data_in[31:0]
word_webyte_we
Data Memory
reset
0123
1
0 24'b0
32
32
32
32
32
byte_loadbyte_loadword_weword_webyte_webyte_wemem_readmem_read
We will use a simplified implementation of MIPS to create a pipelined version
Arithmetic: add sub and or slt
Data Transfer:
lw sw
Control: beq
lw $t0, –4($sp)
Sign Extend
4
ADD
PCSrc
RegWrite
rd1
0
RegDst
MIPS Decoder
0
1
ALUSrc
ALUOp
<< 2
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToReg
0
1
Read Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rs
rt
MemReadMemWrite
PC
EN
beq $at, $0, offset
A) ALUSrc=0 B) ALUSrc=1
Sign Extend
4
ADD
PCSrc
RegWrite
rd1
0
RegDst
MIPS Decoder
0
1
ALUSrc
ALUOp
<< 2
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToReg
0
1
Read Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rs
rt
MemReadMemWrite
PC
EN
Sign Extend
4
ADD
PCSrc
RegWrite
rd1
0
RegDst
MIPS Decoder
0
1
ALUSrc
ALUOp
<< 2
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToReg
0
1
Read Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rs
rt
MemReadMemWrite
PC
EN
2ns
2ns 2ns1ns
2ns 2ns
Worst-case delay from register-read to register-write determines clock speed
What is the worst‐case delay for an instruction?a) 6 nsb) 7 nsc) 8 nsd) 10 nse) 11 ns
1ns
i>clickerIF ID EX MEM WB
IF ID EX MEM WB
2 ns 1 ns 2 ns 2 ns 1 ns
What is the latency to execute one instruction on the single‐cycle datapath?A) 1 nsB) 2 nsC) 8 nsD) 10 nsE) 12 ns
IF ID EX MEM WB
IF ID EX MEM WB
0ns 2 3 5 7 8 10 11 13 15 16 ns
clk
Single-cycle datapath completes one instruction per clock cycle
Add pipeline registers in between stages to increase clock speed
Approximate as one big pipeline register between each stage. The registers are named for the stages they connect.
IF/ID ID/EX EX/MEM MEM/WB
No pipeline register after the WB stage, because write is to the register file.
Paths from register-read to register-write are shorter, clock cycle shorter
Sign Extend
4
ADD
PCSrc
RegWrite
rd1
0
RegDst
MIPS Decoder
0
1
ALUSrcALUOp
<< 2
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToReg
IF/ID ID/EX
WBMEX
EX/MEM
WBM
Mem/WB
WB
0
1
Read Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rd
rt
rs
rt
MemReadMemWrite
PC
EN
IF EX MEM WBID
IF EX MEM WBID
i>clicker2 ns 2 ns 2 ns 2 ns 2 ns
What is the latency to execute one instruction on the pipelined datapath?A) 1 nsB) 2 nsC) 8 nsD) 10 nsE) 12 ns
IF EX MEM WB
0ns 2 4 6 8 10 12 14 16 ns
ID
IF EX MEM WBID
IF EX MEM WBID
IF EX MEM WBID
clk
Pipeline datapath completes one stage per clock cycle
Pipelining improves throughput at the cost of increased latency
How long does it take to run 1000 instructions if the clock period is 8 ns?
Throughput: Time to run N instructions on the single-cycle datapath is N x clock period
Ideal pipeline performance is time to fill the pipeline + one cycle per instruction
How long does it take to run 1000 instructions if the clock period is 2 ns?a) 1008 ns b) 2000 ns
i>clicker
c) 2008 nsd) 8000 ns
e) 8008 ns
For large N, this 5-stage pipeline quadruples performanceSingle‐cycle throughput performance
= N instructions / N * 8 ns
Pipeline throughput performance = N instructions / N * 2 ns
Why did the 5-stage pipeline not achieve the maximum throughput increase?because it is lame
Oh god did I sleep through lecture?
Why did the 5-stage pipeline not achieve the maximum throughput increase?because of waiting
The individual component latencies were not the same, and a pipelined system can only be as fast as the slowest stage
The different stages take different amounts of time.
time to fill and drain the pipeline adds to the total time and decreases throughput a bit
Data values required in later stages must be propagated forwardthrough the pipeline registers.
Sign Extend
RegWrite
rd1
0
RegDst
0
1
ALUSrcALUOp
Write Data
Read Data
Data Memory
1
0
MemToRegRead Address
Instruction[31:0]
Instruction Memory
BWriteAddr
WriteData
ReadData2
Register File
rt
imm
rd
rt
EN
Note – We cannot keep values like destination register in the “IF/ID register”
Sign Extend
4
ADD
PCSrc
RegWrite
rd1
0
RegDst
MIPS Decoder
0
1
ALUSrcALUOp
<< 2
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToReg
IF/ID ID/EX
WBMEX
EX/MEM
WBM
Mem/WB
WB
0
1
Read Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rd
rt
rs
rt
MemReadMemWrite
PC
EN
zero
Sign Extend
4
ADD
PCSrc
RegWrite
rd1
0
RegDst
MIPS Decoder
0
1
ALUSrcALUOp
<< 2
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToReg
IF/ID ID/EX
WBMEX
EX/MEM
WBM
Mem/WB
WB
0
1
Read Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rd
rt
rs
rt
MemReadMemWrite
PC
EN
zero
Control signals are generated in the decode stage and are propagated across stages
Categorize control signals by the pipeline stage that uses them
Stage Control signals neededEX ALUSrc ALUOp RegDst PCSrcMEM MemRead MemWriteWB RegWrite MemToReg
The pipeline registers and program counter update every clock cycle, so they do not have write enable controls
Sign Extend
4
ADD
PCSrc
RegWrite
rd1
0
RegDst
MIPS Decoder
0
1
ALUSrcALUOp
<< 2
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToReg
IF/ID ID/EX
WBMEX
EX/MEM
WBM
Mem/WB
WB
0
1
Read Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rd
rt
rs
rt
MemReadMemWrite
PC
EN
1000: lw $8, 4($29)1004: sub $2, $4, $51008: and $9, $10, $111012: or $16, $17, $181016: add $13, $14, $0
ASSUMPTIONS Each register contains its number plus 100. Example: R[8] == 108, R[29] == 129 Every data memory location contains 99. Example: M[8] == 99, M[29] == 99
CONVENTIONS X indicates values that are not important, Example: Imm16 for R-type. Question marks ??? indicate values we do not know, usually resulting from
instructions coming before and after the ones in our example.
An example execution sequence
addresses in decimal
Cycle 1 (filling)IF: lw $8, 4($29) MEM: ??? WB: ???EX: ???ID: ???
Sign Extend
ADD
RegWrite
rd1
0
RegDst
0
1
ALUSrcALUOp
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToRegRead Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rd
rt
rs
rt
MemReadMemWrite
PC
EN
1000
??
??
???
?
? ??
?
?
???
? ?
?
?
?
?
?
Cycle 2ID: lw $8, 4($29)IF: sub $2, $4, $5 MEM: ??? WB: ???EX: ???
Sign Extend
ADD
RegWrite
rd1
0
RegDst
0
1
ALUSrcALUOp
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToRegRead Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rd
rt
rs
rt
MemReadMemWrite
PC
EN
1004
____
??
______
__
__?
??
?
???
? ?
?
?
?
?
?
Cycle 3ID: sub $2, $4, $5IF: and $9, $10, $11 EX: lw $8, 4($29) MEM: ??? WB: ???
Sign Extend
ADD
RegWrite
rd1
0
RegDst
0
1
ALUSrcALUOp
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToRegRead Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rd
rt
rs
rt
MemReadMemWrite
PC
EN
1008
45
??
x25
104
105__
?
__
__
__
____
__ ?
?
?
?
?
?
__
__ __
__
Cycle 4ID: and $9, $10, $11IF: or $16, $17, $18 EX: sub $2, $4, $5 MEM: lw $8, 4($29) WB: ???
Sign Extend
ADD
RegWrite
rd1
0
RegDst
0
1
ALUSrcALUOp
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToRegRead Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rd
rt
rs
rt
MemReadMemWrite
PC
EN
1012
1011
??
x910
110
111‐1
__104
105
x
25
2 __
__
?
?
?
?
105
0 SUB
1
__
Cycle 5 (full)ID: or $16, $17, $18IF: add $13, $14, $0 EX: and $9, $10, $11 MEM: sub $2, $4, $5 WB: lw $8, 4($29)
Sign Extend
ADD
RegWrite
rd1
0
RegDst
0
1
ALUSrcALUOp
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToRegRead Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rd
rt
rs
rt
MemReadMemWrite
PC
EN
1016
1718
____
x1618
117
118110
‐1110
111
x
911
9 2
x
__
__
__
__
111
0 AND
1
105__
__
Sign Extend
ADD
RegWrite
rd1
0
RegDst
0
1
ALUSrcALUOp
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToRegRead Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rd
rt
rs
rt
MemReadMemWrite
PC
EN
ID: add $13, $14, $0IF: ??? EX: or $16, $17, $18 MEM: and $9, $10, $11 WB: sub $2, $4, $5
Cycle 6 (emptying)
00
x x
117
111
00
x
Cycle 7ID: ???IF: ??? EX: add $13, $14, $0 MEM: or $16, $17, $18 WB: and $9, $10,
$11
Sign Extend
ADD
RegWrite
rd1
0
RegDst
0
1
ALUSrcALUOp
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToRegRead Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rd
rt
rs
rt
MemReadMemWrite
PC
EN
Cycle 8IF: ??? EX: ??? MEM: add $13, $14, $0 WB: or $16, $17,
$18
Sign Extend
ADD
RegWrite
rd1
0
RegDst
0
1
ALUSrcALUOp
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToRegRead Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rd
rt
rs
rt
MemReadMemWrite
PC
EN
ID: ???
Cycle 9MEM: ??? WB: add $13, $14, $0IF: ??? EX: ???ID: ???
Sign Extend
ADD
RegWrite
rd1
0
RegDst
0
1
ALUSrcALUOp
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToRegRead Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rd
rt
rs
rt
MemReadMemWrite
PC
EN
Things to notice from the last nine slides
Instruction executions overlap Each functional unit is used by a different instruction in each cycle. In clock cycle 5, all of the hardware units are used (the pipeline is full).
This is the ideal situation, and what makes pipelined processors so fast Similar example in the book available at the end of Section 6.3.
Clock cycle1 2 3 4 5 6 7 8 9
add $sp, $sp, -4 IF ID EX NOP WBsub $v0, $a0, $a1 IF ID EX NOP WBlw $t0, 4($sp) IF ID EX MEM WBor $s0, $s1, $s2 IF ID EX NOP WBlw $t1, 8($sp) IF ID EX MEM WB
MIPs ISA makes pipelining “easy” Instruction formats are the same length and uniform Addressing modes are simple Each instruction takes only one cycle
Everything goes left to right, except …Next time: We will discuss Data Hazards
Sign Extend
4
ADD
PCSrc
RegWrite
rd1
0
RegDst
MIPS Decoder
0
1
ALUSrcALUOp
<< 2
ADD
Address
Write Data
Read Data
Data Memory
1
0
MemToReg
IF/ID ID/EX
WBMEX
EX/MEM
WBM
Mem/WB
WB
0
1
Read Address
Instruction[31:0]
Instruction Memory
A
B
ReadAddr1
ReadAddr2
WriteAddr
WriteData
ReadData1
ReadData2
Register File
rt
imm
rd
rt
rs
rt
MemReadMemWrite
PC
EN
zero
Some instructions do not require all five stages, can we skip stages?
R-type IF ID EX WB
Can we skip a stage?a)yesb)no
Some instructions do not require all five stages, can we skip stages?
Clock cycle1 2 3 4 5 6 7 8 9
add$sp, $sp, -4 IF ID EX WBsub $v0, $a0, $a1 IF ID EX WBlw $t0, 4($sp) IF ID EX MEM WBor $s0, $s1, $s2 IF ID EX WBlw $t1, 8($sp) IF ID EX MEM WB
Trying to use the single stage for multiple instructions creates a structural hazard Each functional unit can only be used once per instruction
Clock cycle1 2 3 4 5 6 7 8 9
add$sp, $sp, -4 IF ID EX WBsub $v0, $a0, $a1 IF ID EX WBlw $t0, 4($sp) IF ID EX MEM WBor $s0, $s1, $s2 IF ID EX WBlw $t1, 8($sp) IF ID EX MEM WB
Insert NOP (No OPeration) stages to avoid structural hazards
Stores and Branches have NOP stages, too…
Clock cycle1 2 3 4 5 6 7 8 9
add $sp, $sp, -4 IF ID EX NOP WBsub $v0, $a0, $a1 IF ID EX NOP WBlw $t0, 4($sp) IF ID EX MEM WBor $s0, $s1, $s2 IF ID EX NOP WBlw $t1, 8($sp) IF ID EX MEM WB
R-type IF ID EX NOP WB
store IF ID EX MEM NOPbranch IF ID EX NOP NOP