computer science 104: pipelining - duke...
TRANSCRIPT
– 1 –! CS:APP!
Alvin R. Lebeck!
Computer Science 104:!Pipelining!
Slides based on those from Randy Bryant, G. Kedem
– 2 –! CS:APP!
Administrative!Homework #5 !!Due Nov 2, in class!
Midterm II, Monday Nov 7 in class, closed book/notes!Project:!!16-bit processor implementationGroups of 2-3 (not individual)Due near end of semester More specifics to come!
Reading 4.4 (4.5 if you want more details)!
– 3 –! CS:APP!
SEQ Problems!n Stages occur in sequence!n One operation in process
at a time!n Very slow, but works!!
Instructionmemory
Instructionmemory
PCincrement
PCincrement
CCCC ALUALU
Datamemory
Datamemory
NewPC
rB
dstE dstM
ALUA
ALUB
Mem.control
Addr
srcA srcB
read
write
ALUfun.
Fetch
Decode
Execute
Memory
Write back
data out
Registerfile
Registerfile
A BM
E
Registerfile
Registerfile
A BM
E
Bch
dstE dstM srcA srcB
icode ifun rA
PC
valC valP
valBvalA
Data
valE
valM
PC
newPC
– 4 –! CS:APP!
Understanding Performance!• Execution time = Inst/prog * cycles/inst * seconds/cycle!• Inst/prog: ~ O(n)…take algorithms!• Cycles/inst: for SEQ = 1!• Seconds/cycle: for SEQ assume 5ns!
• Problem with SEQ: every instruction takes full 5ns!• E.g., ADD doesn’t need to use data memory!
– 5 –! CS:APP!
Overview!General Principles of Pipelining!
n Goal!n Difficulties!
Creating a Pipelined Y86 Processor!n Rearranging SEQ!n Inserting pipeline registers!n Problems with data and control hazards!
– 6 –! CS:APP!
Real-World Pipelines: Car Washes!
Idea!n Divide process into
independent stages!n Move objects through stages
in sequence!n At any given time, multiple
objects being processed!
Sequential! Parallel!
Pipelined!
– 7 –! CS:APP!
Computational Example!
System!n Computation requires total of 300 picoseconds!n Additional 20 picoseconds to save result in register!n Can must have clock cycle of at least 320 ps!
Combinational logic
R e g
300 ps 20 ps
Clock
Delay = 320 ps Throughput = 3.12 GOPS
– 8 –! CS:APP!
3-Way Pipelined Version!
System!n Divide combinational logic into 3 blocks of 100 ps each!n Can begin new operation as soon as previous one passes
through stage A.!l Begin new operation every 120 ps!
n Overall latency increases!l 360 ps from start to finish!
R e g
Clock
Comb. logic
A
R e g
Comb. logic
B
R e g
Comb. logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
Delay = 360 ps Throughput = 8.33 GOPS
– 9 –! CS:APP!
Pipeline Diagrams!Unpipelined!!!!
n Cannot start new operation until previous one completes!
3-Way Pipelined!!!!
n Up to 3 operations in process simultaneously!
Time
OP1!OP2!OP3!
Time
A! B! C!A! B! C!
A! B! C!
OP1!OP2!OP3!
– 10 –! CS:APP!
Operating a Pipeline!
Time
OP1!OP2!OP3!
A! B! C!A! B! C!
A! B! C!
0! 120! 240! 360! 480! 640!
Clock!
R e g
Clock
Comb. logic
A
R e g
Comb. logic
B
R e g
Comb. logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
239!
R e g
Clock
Comb. logic
A
R e g
Comb. logic
B
R e g
Comb. logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
241!
R e g
R e g
R e g
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
Comb. logic
A
Comb. logic
B
Comb. logic
C
Clock
300!
R e g
Clock
Comb. logic
A
R e g
Comb. logic
B
R e g
Comb. logic
C
100 ps 20 ps 100 ps 20 ps 100 ps 20 ps
359!
– 11 –! CS:APP!
Limitations: Nonuniform Delays!
n Throughput limited by slowest stage!n Other stages sit idle for much of the time!n Challenging to partition system into balanced stages!
R e g
Clock
R e g
Comb. logic
B
R e g
Comb. logic
C
50 ps 20 ps 150 ps 20 ps 100 ps 20 ps
Delay = 510 ps Throughput = 5.88 GOPS
Comb. logic A
Time
OP1!OP2!OP3!
A! B! C!A! B! C!
A! B! C!
– 12 –! CS:APP!
Limitations: Register Overhead!
n As try to deepen pipeline, overhead of loading registers becomes more significant!
n Percentage of clock cycle spent loading register:!l 1-stage pipeline: !6.25% !l 3-stage pipeline: !16.67% !l 6-stage pipeline: !28.57%!
n High speeds of modern processor designs obtained through very deep pipelining!
Delay = 420 ps, Throughput = 14.29 GOPS Clock
R e g
Comb. logic
50 ps 20 ps
R e g
Comb. logic
50 ps 20 ps
R e g
Comb. logic
50 ps 20 ps
R e g
Comb. logic
50 ps 20 ps
R e g
Comb. logic
50 ps 20 ps
R e g
Comb. logic
50 ps 20 ps
– 13 –! CS:APP!
SEQ Hardware!n Stages occur in sequence!n One operation in process
at a time!
Instructionmemory
Instructionmemory
PCincrement
PCincrement
CCCC ALUALU
Datamemory
Datamemory
NewPC
rB
dstE dstM
ALUA
ALUB
Mem.control
Addr
srcA srcB
read
write
ALUfun.
Fetch
Decode
Execute
Memory
Write back
data out
Registerfile
Registerfile
A BM
E
Registerfile
Registerfile
A BM
E
Bch
dstE dstM srcA srcB
icode ifun rA
PC
valC valP
valBvalA
Data
valE
valM
PC
newPC
– 14 –! CS:APP!
Instructionmemory
Instructionmemory
PCincrement
PCincrement
CCCC ALUALU
Datamemory
Datamemory
PC
rB
dstE dstM
ALUA
ALUB
Mem.control
Addr
srcA srcB
read
write
ALUfun.
Fetch
Decode
Execute
Memory
Write back
data out
Registerfile
Registerfile
A BM
E
Registerfile
Registerfile
A BM
E
Bch
dstE dstM srcA srcB
icode ifun rA
pBch pValM pValC pValPpIcode
PC
valC valP
valBvalA
Data
valE
valM
PC
SEQ+ Hardware!n Still sequential
implementation!n Reorder PC stage to put at
beginning!
PC Stage!n Task is to select PC for
current instruction!n Based on results
computed by previous instruction!
Processor State!n PC is no longer stored in
register!n But, can determine PC
based on other stored information!
– 15 –! CS:APP!
Instructionmemory
Instructionmemory
PCincrement
PCincrement
CCCCALUALU
Datamemory
Datamemory
Fetch
Decode
Execute
Memory
Write back
icode, ifunrA, rB
valC
Registerfile
Registerfile
A BM
E
Registerfile
Registerfile
A BM
E
pState
valP
srcA, srcBdstA, dstB
valA, valB
aluA, aluB
Bch
valE
Addr, Data
valM
PC
valE, valM
valM
icode, valCvalP
PC
Adding Pipeline Registers!
PCincrement
PCincrement
CCCCALUALU
Datamemory
Datamemory
Fetch
Decode
Execute
Memory
Write back
Registerfile
Registerfile
A BM
E
Registerfile
Registerfile
A BM
E
valP
d_srcA, d_srcB
valA, valB
aluA, aluB
Bch valE
Addr, Data
valM
PC
W_valE, W_valM, W_dstE, W_dstMW_icode, W_valM
icode, ifun,rA, rB, valC
E
M
W
F
D
valP
f_PC
predPC
Instructionmemory
Instructionmemory
M_icode, M_Bch, M_valA
– 16 –! CS:APP!
Pipeline Stages!Fetch!
n Select current PC!n Read instruction!n Compute incremented PC!
Decode!n Read program registers!
Execute!n Operate ALU!
Memory!n Read or write data memory!
Write Back!n Update register file!
PCincrement
PCincrement
CCCCALUALU
Datamemory
Datamemory
Fetch
Decode
Execute
Memory
Write back
Registerfile
Registerfile
A BM
E
Registerfile
Registerfile
A BM
E
valP
d_srcA, d_srcB
valA, valB
aluA, aluB
Bch valE
Addr, Data
valM
PC
W_valE, W_valM, W_dstE, W_dstMW_icode, W_valM
icode, ifun,rA, rB, valC
E
M
W
F
D
valP
f_PC
predPC
Instructionmemory
Instructionmemory
M_icode, M_Bch, M_valA
– 17 –! CS:APP!
The Five Stages of mrmovl!
Ifetch: Instruction Fetch!n Fetch the instruction from the Instruction Memory!
Reg/Dec: Registers Fetch and Instruction Decode!Exec: Calculate the memory address!Mem: Read the data from the Data Memory!WrB: Write the data back to the register file!
Cycle 1!Cycle 2! Cycle 3!Cycle 4! Cycle 5!
Ifetch! Reg/Dec! Exec! Mem! WrB!mrmovl!
– 18 –! CS:APP!
Key Ideas Behind Instruction Execution Pipelining!
Overlap execution of instructions!
The mrmovl instruction has 5 stages: I-fetch, Reg- Fetch / I-Decode, Execute, Memory-Access, Register Write-Back. !
n Five independent functional units to work on each stage!
l Each functional unit is used only once!
n The 2nd mrmovl can start as soon as the 1st finishes its Ifetch stage!
n Each mrmovl still takes five cycles to complete. latency is still 5 cycles!
n The throughput is much higher; CPI is 1 with ~1/5 cycle time.!
n Instructions start before the previous ones are completed.!
– 19 –! CS:APP!
Pipelining the mrmovl Instruction!
The five independent functional units in the pipeline datapath are:!n Instruction Memory for the Ifetch stage!n Register File’s Read ports for the Reg/Dec stage!n ALU for the Exec stage!n Data Memory for the Mem stage!n Register File’s Write port for the WrB stage!
One instruction enters the pipeline every cycle!n One instruction comes out of the pipeline (completed) every cycle!n The “Effective” Cycles per Instruction (CPI) is 1; ~1/5 cycle time!
Clock!
Cycle 1!Cycle 2! Cycle 3!Cycle 4! Cycle 5! Cycle 6! Cycle 7!
Ifetch! Reg/Dec! Exec! Mem! WrB!1st mrmovl!
Ifetch! Reg/Dec! Exec! Mem! WrB!2nd mrmovl!
Ifetch! Reg/Dec! Exec! Mem! WrB!3rd mrmovl!
– 20 –! CS:APP!
The Four Stages of Int Op!
Ifetch: Instruction Fetch!n Fetch the instruction from the Instruction Memory!
Reg/Dec: Register access and Instruction Decode!Exec: ALU operates on the two register operands!WrB: Write the ALU output back to the register file!
Cycle 1! Cycle 2! Cycle 3! Cycle 4!
Ifetch! Reg/Dec! Exec! WrB!R-type!
– 21 –! CS:APP!
Pipelining the IOP and mrmovl Instructions!
We have a problem called a pipeline conflict or resource hazard:!n Two instructions try to write to the register file at the same time!!
Clock!Cycle 1!Cycle 2! Cycle 3!Cycle 4! Cycle 5! Cycle 6! Cycle 7! Cycle 8! Cycle 9!
Ifetch! Reg/Dec! Exec! Wr!IOP!
Ifetch! Reg/Dec! Exec! Wr!IOP!
Ifetch! Reg/Dec! Exec! Mem! Wr!mrmovl!
Ifetch! Reg/Dec! Exec! Wr!IOP!
Ifetch! Reg/Dec! Exec! Wr!IOP!
OOPS! We have a problem!!
– 22 –! CS:APP!
Important Observation!
Each functional unit can only be used once per instruction!Each functional unit must be used in the same stage for all instructions:!
n Mrmovl uses Register File’s Write Port during its 5th stage!!!!n IOP uses Register File’s Write Port during its 4th stage!
Ifetch! Reg/Dec! Exec! Mem! WrB!Load!1! 2! 3! 4! 5!
Ifetch! Reg/Dec! Exec! WrB!R-type!1! 2! 3! 4!
° How to solve this pipeline hazard?
– 23 –! CS:APP!
Solution: Delay IOP’s Write by One Cycle!Delay R-type’s register write by one cycle:!
n Now IOP instructions also use Reg File’s write port at Stage 5!n Mem stage is a NO-OP stage: nothing is being done. Effective CPI?!
Clock!Cycle 1!Cycle 2! Cycle 3!Cycle 4! Cycle 5! Cycle 6! Cycle 7! Cycle 8! Cycle 9!
Ifetch! Reg/Dec! Mem! WrB!IOP!
Ifetch! Reg/Dec! Mem! WrB!IOP!
Ifetch! Reg/Dec! Exec! Mem! WrB!mrmovl!
Ifetch! Reg/Dec! Mem! WrB!IOP!
Ifetch! Reg/Dec! Mem! WrB!IOP!
Ifetch! Reg/Dec! Exec! Wr!IOP! Mem!
Exec!
Exec!
Exec!
Exec!
1! 2! 3! 4! 5!
– 24 –! CS:APP!
Data Dependencies!
System!n Each operation depends on result from preceding one!
Clock
Combinational logic
R e g
Time
OP1!OP2!OP3!
– 25 –! CS:APP!
Data Dependencies in Programs!
n Result from one instruction used as operand for another!l Read-after-write (RAW) dependency!
n Very common in actual programs!n Must make sure our pipeline handles these properly!
l Get correct results!l Minimize performance impact!
1 irmovl $50, %eax
2 addl %eax , %ebx
3 mrmovl 100( %ebx ), %edx
– 26 –! CS:APP!
Data Hazards!Example:! irmovl $50, %eax addl %eax, %ebx # depends on irmovl mrmovl 100(%ebx), %edx # depends on addl
Clock!Cycle 1!Cycle 2! Cycle 3!Cycle 4! Cycle 5! Cycle 6! Cycle 7! Cycle 8!
Ifetch! Reg/Dec! Exec! Mem! WrB!irmovl
Ifetch! Reg/Dec! Exec! Mem! WrB!addl
Ifetch! Reg/Dec! Exec! Mem! WrB!mrmovl
n Result does not feed back around in time for next operation!n Pipelining has changed behavior of system!n How can we avoid data hazards (i.e., ensure we get correct
answer)?!
– 27 –! CS:APP!
Solving Data Hazards!!irmovl $50, %eax addl %eax, %ebx # depends on irmovl mrmovl 100(%ebx), %edx # depends on addl
Three techniques to solve data hazards!n Insert nops, or !n Stall (i.e., hold addl in Fetch until irmovl is finished), or!n Modify datapath to pass/forward value!
!
– 28 –! CS:APP!
What else can go wrong!!!irmovl $50, %eax addl %eax, %ebx jne foo mrmovl 100(%ebx), %edx addl %ebx, %edx
foo: irmovl $220, %edx subl %ebx, %ecx
!
– 29 –! CS:APP!
Control Dependency!!irmovl $50, %eax addl %eax, %ebx jne foo mrmovl 100(%ebx), %edx # control dependent on jne addl %ebx, %edx
foo: irmovl $220, %edx subl %ebx, %ecx
• We don’t know next PC! • What should we do?
!
– 30 –! CS:APP!
Predicting the PC!
n Start fetch of new instruction after current one has completed fetch stage!l Not enough time to reliably determine next instruction!
n Guess which instruction will follow!l Recover if prediction was incorrect!
– 31 –! CS:APP!
One Prediction Strategy!Instructions that Don’t Transfer Control!
n Predict next PC to be valP!n Always reliable!
Call and Unconditional Jumps!n Predict next PC to be valC (destination)!n Always reliable!
Conditional Jumps!n Predict next PC to be valC (destination)!n Only correct if branch is taken!
l Typically right 60% of time!
Return Instruction!n Don’t try to predict!
What if we guess wrong?!
– 32 –! CS:APP!
Recovering from Branch Misprediction!
Clock!Cycle 1!Cycle 2! Cycle 3!Cycle 4! Cycle 5! Cycle 6! Cycle 7! Cycle 8! Cycle 9!
Ifetch! Reg/Dec! Exec! Mem!irmovl!
Ifetch! Reg/Dec! Exec! Mem!addl!
Ifetch! Reg/Dec! Exec! Mem! Wr!jne!
Ifetch! Reg/Dec! Exec!irmovl!
Ifetch! Reg/Dec! Exec!addl!
OOPS! Wrong direction!Stop “bad” instructions!
From updating state!
Wr!
Wr!
Mem! Wr!
Mem! Wr!
Ifetch! Reg/Dec! Exec!mrmovl! Mem! W
– 33 –! CS:APP!
Pipeline Summary!Concept!
n Break instruction execution into 5 stages!n Run instructions through in pipelined mode!n Latency of one instruction is ~ same, but throughput
increases!
Limitations!n Can’t handle dependencies between instructions when
instructions follow too closely!n Data dependencies!
l One instruction writes register, later one reads it!n Control dependency!
l Instruction sets PC in way that pipeline did not predict correctly!l Mispredicted branch and return!