computer science 104: pipelining - duke...

– 1 –! CS:APP!

Alvin R. Lebeck!

Computer Science 104:!Pipelining!

Slides based on those from Randy Bryant, G. Kedem

– 2 –! CS:APP!

Administrative!Homework #5 !!Due Nov 2, in class!

Midterm II, Monday Nov 7 in class, closed book/notes!Project:!!16-bit processor implementationGroups of 2-3 (not individual)Due near end of semester More specifics to come!

Reading 4.4 (4.5 if you want more details)!

– 3 –! CS:APP!

SEQ Problems!n  Stages occur in sequence!n  One operation in process

at a time!n  Very slow, but works!!

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCC ALUALU

Datamemory

Datamemory

NewPC

rB

dstE dstM

ALUA

ALUB

Mem.control

Addr

srcA srcB

read

write

ALUfun.

Fetch

Decode

Execute

Memory

Write back

data out

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

Bch

dstE dstM srcA srcB

icode ifun rA

PC

valC valP

valBvalA

Data

valE

valM

PC

newPC

– 4 –! CS:APP!

Understanding Performance!•  Execution time = Inst/prog * cycles/inst * seconds/cycle!•  Inst/prog: ~ O(n)…take algorithms!•  Cycles/inst: for SEQ = 1!•  Seconds/cycle: for SEQ assume 5ns!

•  Problem with SEQ: every instruction takes full 5ns!•  E.g., ADD doesn’t need to use data memory!

– 5 –! CS:APP!

Overview!General Principles of Pipelining!

n  Goal!n  Difficulties!

Creating a Pipelined Y86 Processor!n  Rearranging SEQ!n  Inserting pipeline registers!n  Problems with data and control hazards!

– 6 –! CS:APP!

Real-World Pipelines: Car Washes!

Idea!n  Divide process into

independent stages!n  Move objects through stages

in sequence!n  At any given time, multiple

objects being processed!

Sequential! Parallel!

Pipelined!

– 7 –! CS:APP!

Computational Example!

System!n  Computation requires total of 300 picoseconds!n  Additional 20 picoseconds to save result in register!n  Can must have clock cycle of at least 320 ps!

Combinational logic

R e g

300 ps 20 ps

Clock

Delay = 320 ps Throughput = 3.12 GOPS

– 8 –! CS:APP!

3-Way Pipelined Version!

System!n  Divide combinational logic into 3 blocks of 100 ps each!n  Can begin new operation as soon as previous one passes

through stage A.!l  Begin new operation every 120 ps!

n  Overall latency increases!l  360 ps from start to finish!

R e g

Clock

Comb. logic

A

R e g

Comb. logic

B

R e g

Comb. logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps


– 9 –! CS:APP!

Pipeline Diagrams!Unpipelined!!!!

n  Cannot start new operation until previous one completes!

3-Way Pipelined!!!!

n  Up to 3 operations in process simultaneously!

Time

OP1!OP2!OP3!

Time

A! B! C!A! B! C!

A! B! C!

OP1!OP2!OP3!

– 10 –! CS:APP!

Operating a Pipeline!

Time

OP1!OP2!OP3!

A! B! C!A! B! C!

A! B! C!

0! 120! 240! 360! 480! 640!

Clock!

R e g

Clock

Comb. logic

A

R e g

Comb. logic

B

R e g

Comb. logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

239!

R e g

Clock

Comb. logic

A

R e g

Comb. logic

B

R e g

Comb. logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

241!

R e g

R e g

R e g

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

Comb. logic

A

Comb. logic

B

Comb. logic

C

Clock

300!

R e g

Clock

Comb. logic

A

R e g

Comb. logic

B

R e g

Comb. logic

C

100 ps 20 ps 100 ps 20 ps 100 ps 20 ps

359!

– 11 –! CS:APP!

Limitations: Nonuniform Delays!

n  Throughput limited by slowest stage!n  Other stages sit idle for much of the time!n  Challenging to partition system into balanced stages!

R e g

Clock

R e g

Comb. logic

B

R e g

Comb. logic

C

50 ps 20 ps 150 ps 20 ps 100 ps 20 ps


Comb. logic A

Time

OP1!OP2!OP3!

A! B! C!A! B! C!

A! B! C!

– 12 –! CS:APP!

Limitations: Register Overhead!

n  As try to deepen pipeline, overhead of loading registers becomes more significant!

n  Percentage of clock cycle spent loading register:!l  1-stage pipeline: !6.25% !l  3-stage pipeline: !16.67% !l  6-stage pipeline: !28.57%!

n  High speeds of modern processor designs obtained through very deep pipelining!

Delay = 420 ps, Throughput = 14.29 GOPS Clock

R e g

Comb. logic

50 ps 20 ps

R e g

Comb. logic

50 ps 20 ps

R e g

Comb. logic

50 ps 20 ps

R e g

Comb. logic

50 ps 20 ps

R e g

Comb. logic

50 ps 20 ps

R e g

Comb. logic

50 ps 20 ps

– 13 –! CS:APP!

SEQ Hardware!n  Stages occur in sequence!n  One operation in process

at a time!

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCC ALUALU

Datamemory

Datamemory

NewPC

rB

dstE dstM

ALUA

ALUB

Mem.control

Addr

srcA srcB

read

write

ALUfun.

Fetch

Decode

Execute

Memory

Write back

data out

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

Bch

dstE dstM srcA srcB

icode ifun rA

PC

valC valP

valBvalA

Data

valE

valM

PC

newPC

– 14 –! CS:APP!

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCC ALUALU

Datamemory

Datamemory

PC

rB

dstE dstM

ALUA

ALUB

Mem.control

Addr

srcA srcB

read

write

ALUfun.

Fetch

Decode

Execute

Memory

Write back

data out

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

Bch

dstE dstM srcA srcB

icode ifun rA

pBch pValM pValC pValPpIcode

PC

valC valP

valBvalA

Data

valE

valM

PC

SEQ+ Hardware!n  Still sequential

implementation!n  Reorder PC stage to put at

beginning!

PC Stage!n  Task is to select PC for

current instruction!n  Based on results

computed by previous instruction!

Processor State!n  PC is no longer stored in

register!n  But, can determine PC

based on other stored information!

– 15 –! CS:APP!

Instructionmemory

Instructionmemory

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

icode, ifunrA, rB

valC

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

pState

valP

srcA, srcBdstA, dstB

valA, valB

aluA, aluB

Bch

valE

Addr, Data

valM

PC

valE, valM

valM

icode, valCvalP

PC

Adding Pipeline Registers!

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

valP

d_srcA, d_srcB

valA, valB

aluA, aluB

Bch valE

Addr, Data

valM

PC

W_valE, W_valM, W_dstE, W_dstMW_icode, W_valM

icode, ifun,rA, rB, valC

E

M

W

F

D

valP

f_PC

predPC

Instructionmemory

Instructionmemory

M_icode, M_Bch, M_valA

– 16 –! CS:APP!

Pipeline Stages!Fetch!

n  Select current PC!n  Read instruction!n  Compute incremented PC!

Decode!n  Read program registers!

Execute!n  Operate ALU!

Memory!n  Read or write data memory!

Write Back!n  Update register file!

PCincrement

PCincrement

CCCCALUALU

Datamemory

Datamemory

Fetch

Decode

Execute

Memory

Write back

Registerfile

Registerfile

A BM

E

Registerfile

Registerfile

A BM

E

valP

d_srcA, d_srcB

valA, valB

aluA, aluB

Bch valE

Addr, Data

valM

PC

W_valE, W_valM, W_dstE, W_dstMW_icode, W_valM

icode, ifun,rA, rB, valC

E

M

W

F

D

valP

f_PC

predPC

Instructionmemory

Instructionmemory

M_icode, M_Bch, M_valA

– 17 –! CS:APP!

The Five Stages of mrmovl!

Ifetch: Instruction Fetch!n  Fetch the instruction from the Instruction Memory!

Reg/Dec: Registers Fetch and Instruction Decode!Exec: Calculate the memory address!Mem: Read the data from the Data Memory!WrB: Write the data back to the register file!

Cycle 1!Cycle 2! Cycle 3!Cycle 4! Cycle 5!

Ifetch! Reg/Dec! Exec! Mem! WrB!mrmovl!

– 18 –! CS:APP!

Key Ideas Behind Instruction Execution Pipelining!

Overlap execution of instructions!

The mrmovl instruction has 5 stages: I-fetch, Reg- Fetch / I-Decode, Execute, Memory-Access, Register Write-Back. !

n  Five independent functional units to work on each stage!

l Each functional unit is used only once!

n  The 2nd mrmovl can start as soon as the 1st finishes its Ifetch stage!

n  Each mrmovl still takes five cycles to complete. latency is still 5 cycles!

n  The throughput is much higher; CPI is 1 with ~1/5 cycle time.!

n  Instructions start before the previous ones are completed.!

– 19 –! CS:APP!

Pipelining the mrmovl Instruction!

The five independent functional units in the pipeline datapath are:!n  Instruction Memory for the Ifetch stage!n  Register File’s Read ports for the Reg/Dec stage!n  ALU for the Exec stage!n  Data Memory for the Mem stage!n  Register File’s Write port for the WrB stage!

One instruction enters the pipeline every cycle!n  One instruction comes out of the pipeline (completed) every cycle!n  The “Effective” Cycles per Instruction (CPI) is 1; ~1/5 cycle time!

Clock!

Cycle 1!Cycle 2! Cycle 3!Cycle 4! Cycle 5! Cycle 6! Cycle 7!

Ifetch! Reg/Dec! Exec! Mem! WrB!1st mrmovl!

Ifetch! Reg/Dec! Exec! Mem! WrB!2nd mrmovl!

Ifetch! Reg/Dec! Exec! Mem! WrB!3rd mrmovl!

– 20 –! CS:APP!

The Four Stages of Int Op!

Ifetch: Instruction Fetch!n  Fetch the instruction from the Instruction Memory!

Reg/Dec: Register access and Instruction Decode!Exec: ALU operates on the two register operands!WrB: Write the ALU output back to the register file!

Cycle 1! Cycle 2! Cycle 3! Cycle 4!

Ifetch! Reg/Dec! Exec! WrB!R-type!

– 21 –! CS:APP!

Pipelining the IOP and mrmovl Instructions!

We have a problem called a pipeline conflict or resource hazard:!n  Two instructions try to write to the register file at the same time!!

Clock!Cycle 1!Cycle 2! Cycle 3!Cycle 4! Cycle 5! Cycle 6! Cycle 7! Cycle 8! Cycle 9!

Ifetch! Reg/Dec! Exec! Wr!IOP!


Ifetch! Reg/Dec! Exec! Mem! Wr!mrmovl!



OOPS! We have a problem!!

– 22 –! CS:APP!

Important Observation!

Each functional unit can only be used once per instruction!Each functional unit must be used in the same stage for all instructions:!

n  Mrmovl uses Register File’s Write Port during its 5th stage!!!!n  IOP uses Register File’s Write Port during its 4th stage!

Ifetch! Reg/Dec! Exec! Mem! WrB!Load!1! 2! 3! 4! 5!

Ifetch! Reg/Dec! Exec! WrB!R-type!1! 2! 3! 4!

°  How to solve this pipeline hazard?

– 23 –! CS:APP!

Solution: Delay IOP’s Write by One Cycle!Delay R-type’s register write by one cycle:!

n  Now IOP instructions also use Reg File’s write port at Stage 5!n  Mem stage is a NO-OP stage: nothing is being done. Effective CPI?!


Ifetch! Reg/Dec! Mem! WrB!IOP!


Ifetch! Reg/Dec! Exec! Mem! WrB!mrmovl!



Ifetch! Reg/Dec! Exec! Wr!IOP! Mem!

Exec!

Exec!

Exec!

Exec!

1! 2! 3! 4! 5!

– 24 –! CS:APP!

Data Dependencies!

System!n  Each operation depends on result from preceding one!

Clock

Combinational logic

R e g

Time

OP1!OP2!OP3!

– 25 –! CS:APP!

Data Dependencies in Programs!

n  Result from one instruction used as operand for another!l  Read-after-write (RAW) dependency!

n  Very common in actual programs!n  Must make sure our pipeline handles these properly!

l  Get correct results!l  Minimize performance impact!

1 irmovl $50, %eax

2 addl %eax , %ebx

3 mrmovl 100( %ebx ), %edx

– 26 –! CS:APP!

Data Hazards!Example:! irmovl $50, %eax addl %eax, %ebx # depends on irmovl mrmovl 100(%ebx), %edx # depends on addl

Clock!Cycle 1!Cycle 2! Cycle 3!Cycle 4! Cycle 5! Cycle 6! Cycle 7! Cycle 8!

Ifetch! Reg/Dec! Exec! Mem! WrB!irmovl

Ifetch! Reg/Dec! Exec! Mem! WrB!addl

Ifetch! Reg/Dec! Exec! Mem! WrB!mrmovl

n  Result does not feed back around in time for next operation!n  Pipelining has changed behavior of system!n  How can we avoid data hazards (i.e., ensure we get correct

answer)?!

– 27 –! CS:APP!

Solving Data Hazards!!irmovl $50, %eax addl %eax, %ebx # depends on irmovl mrmovl 100(%ebx), %edx # depends on addl

Three techniques to solve data hazards!n  Insert nops, or !n  Stall (i.e., hold addl in Fetch until irmovl is finished), or!n  Modify datapath to pass/forward value!

!

– 28 –! CS:APP!

What else can go wrong!!!irmovl $50, %eax addl %eax, %ebx jne foo mrmovl 100(%ebx), %edx addl %ebx, %edx

foo: irmovl $220, %edx subl %ebx, %ecx

!

– 29 –! CS:APP!

Control Dependency!!irmovl $50, %eax addl %eax, %ebx jne foo mrmovl 100(%ebx), %edx # control dependent on jne addl %ebx, %edx

foo: irmovl $220, %edx subl %ebx, %ecx

•  We don’t know next PC! •  What should we do?

!

– 30 –! CS:APP!

Predicting the PC!

n  Start fetch of new instruction after current one has completed fetch stage!l  Not enough time to reliably determine next instruction!

n  Guess which instruction will follow!l  Recover if prediction was incorrect!

– 31 –! CS:APP!

One Prediction Strategy!Instructions that Don’t Transfer Control!

n  Predict next PC to be valP!n  Always reliable!

Call and Unconditional Jumps!n  Predict next PC to be valC (destination)!n  Always reliable!

Conditional Jumps!n  Predict next PC to be valC (destination)!n  Only correct if branch is taken!

l  Typically right 60% of time!

Return Instruction!n  Don’t try to predict!

What if we guess wrong?!

– 32 –! CS:APP!

Recovering from Branch Misprediction!


Ifetch! Reg/Dec! Exec! Mem!irmovl!

Ifetch! Reg/Dec! Exec! Mem!addl!

Ifetch! Reg/Dec! Exec! Mem! Wr!jne!

Ifetch! Reg/Dec! Exec!irmovl!

Ifetch! Reg/Dec! Exec!addl!

OOPS! Wrong direction!Stop “bad” instructions!

From updating state!

Wr!

Wr!

Mem! Wr!

Mem! Wr!

Ifetch! Reg/Dec! Exec!mrmovl! Mem! W

– 33 –! CS:APP!

Pipeline Summary!Concept!

n  Break instruction execution into 5 stages!n  Run instructions through in pipelined mode!n  Latency of one instruction is ~ same, but throughput

increases!

Limitations!n  Can’t handle dependencies between instructions when

instructions follow too closely!n  Data dependencies!

l  One instruction writes register, later one reads it!n  Control dependency!

l  Instruction sets PC in way that pipeline did not predict correctly!l  Mispredicted branch and return!

computer science 104: pipelining - duke...

Documents