fundamentals of computer systems - a pipelined mips processor

Fundamentals of Computer SystemsA Pipelined MIPS Processor

Stephen A. Edwards and Martha A. Kim

Columbia University

Spring 2012

Technical Illustrations Copyright ©2007 Elsevier

Sequential Laundry

Alice

Bob

Cindy

Time

Pipelined Laundry

Alice

Bob

Cindy

Time

Why let the washer ordryer sit idle when bothcan be running?

Single-Cycle Datapath Timing

��

�

� �

��

��

�

�

��

��

� �

� �

� ��

��

�

��

��

��

�

�

�

�

� �

��

��

�

��

�

��

�

��

��

��

��

��

��

!!�

�

�"�&�$(� �&' '�'

��& '�'

��

��($��

��'��)

��&�&��

�&�$(�

*&�+

�

�"

Clk

PC

Instruction Memory

Register File

ALU

Data Memory

Single-Cycle vs. Pipelined Datapath

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW

WriteDataE WriteDataM

SrcAE

PCPlus4D

PCBranchM

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

ALU

WriteRegE4:0

CLK

CLK

CLK

SignImm

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PC0

1PC' Instr

25:21

20:16

15:0

SrcB

20:16

15:11

<<2

+

ALUResult ReadData

WriteData

SrcA

PCPlus4

PCBranch

WriteReg4:0

Result

Zero

CLK

ALU

Fetch Decode Execute Memory Writeback

Corrected Pipelined DatapathThe register number to write (WriteReg) must staysynchronized with the result.

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4EPCPlus4F

ZeroM

CLK CLK

WriteRegW4:0

ALU

WriteRegE4:0

CLK

CLK

CLK

Fetch Decode Execute Memory Writeback

Pipeline ControlSame control unit as the single-cycle processor;Control signals delayed across pipeline stages

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign Extend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE0

1

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

20:16

15:11

RtE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4EPCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

ZeroM

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU

RegWriteE RegWriteM RegWriteW

MemtoRegE MemtoRegM MemtoRegW

MemWriteE MemWriteM

BranchE BranchM

RegDstE

ALUSrcE

WriteRegE4:0

Single-Cycle vs. Pipeline Timing

Time (ps)Instr

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead / Write

WriteReg

1

2

0 100 200 300 400 500 600 700 800 900 1100 1200 1300 1400 1500 1600 1700 1800 19001000

Instr

1

2

3

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead / Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

FetchInstruction

DecodeRead Reg

ExecuteALU

MemoryRead/Write

WriteReg

Single-Cycle

Pipelined

Pipelining Abstraction

Time (cycles)

lw $s2, 40($0) RF 40

$0RF

$s2+ DM

RF $t2

$t1RF

$s3+ DM

RF $s5

$s1RF

$s4- DM

RF $t6

$t5RF

$s5& DM

RF 20

$s1RF

$s6+ DM

RF $t4

$t3RF

$s7| DM

add $s3, $t1, $t2

sub $s4, $s1, $s5

and $s5, $t5, $t6

sw $s6, 20($s1)

or $s7, $t3, $t4

1 2 3 4 5 6 7 8 9 10

add

IM

IM

IM

IM

IM

IMlw

sub

and

sw

or

Data Hazard

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

The first instruction produces a result (in $s0) that laterinstructions need.

The ALU has computed the value in cycle 3, but it won’tbe written to the register file until cycle 5.

Bioh

azard

Water

Hazard

Hazard

Lights

Dukes

ofH

azzard

Eliminating Hazards at Compile-Time

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IM add

or

sub

nop

nop

RF RFDMnopIM

RF RFDMnopIM

9 10

Insert nops to delay later instructions; sometimespossible to put useful work in those slots.

Eliminating Data Hazards through Forwarding

Time (cycles)

add $s0, $s2, $s3 RF $s3

$s2RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMadd

or

sub

Add logic to send data “between” instructions; registerfile eventually written.

Here, the result is available at the end of cycle 3 andneeded in cycles 4 and 5.

Datapath with Data Forwarding

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

ForwardAE

ForwardBE

20:16RtE

RsD

RdD

RtD

RegW

riteM

RegWriteW

Hazard Unit

PCPlus4E

BranchE BranchM

ZeroM

ForwardAE =

10 if rsE 6= 0∧ rsE = WriteRegM∧RegWriteM,01 if rsE 6= 0∧ rsE = WriteRegW∧RegWriteW,00 otherwise.

Data Hazard that Demands a Stall

Time (cycles)

lw $s0, 40($0) RF 40

$0RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

Trouble!

This data hazard can’t be solved with forwardingbecause the value is only available at the end ofcycle 4, yet is needed at the beginning.

Stalling a Pipeline

Time (cycles)

lw $s0, 40($0) RF 40

$0RF

$s0+ DM

RF $s1

$s0RF

$t0& DM

RF $s0

$s4RF

$t1| DM

RF $s5

$s0RF

$t2- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IM lw

or

sub

9

RF $s1

$s0

IM or

Stall

A stall tells an instruction to wait for a cycle beforeproceeding.

Stalling Hardware

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchM

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcM

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

SignImmD

StallF

StallD

ForwardA

E

ForwardB

E

20:16RtE

RsD

RdD

RtD

RegWriteM

RegWriteW

Mem

toRegE

Hazard Unit

FlushE

PCPlus4E

BranchE BranchM

ZeroM

EN

EN

CLR

lwstall = MemToRegE∧�

(rsD = rtE)∨ (rtD = rtE)�

StallF,StallD,FlushE = lwstall

Control Hazards

Time (cycles)

beq $t1, $t2, 40 RF $t2

$t1RF- DM

RF $s1

$s0RF& DM

RF $s0

$s4RF| DM

RF $s5

$s0RF- DM

and $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

and

IM

IM

IM

IMlw

or

sub

20

24

28

2C

30

...

...

9

Flushthese

instructions

64 slt $t3, $s2, $s3 RF $s3

$s2RF

$t3slt DMIM

slt

Whether to branch isn’t determined until the beginningof the fourth cycle; three instructions would beexecuted erroneously if the branch were taken.

Early Branch Resolution

EqualD

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

=

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

20:16RtE

RsD

RdE

RtD

Reg

Writ

eM

Re

gWrit

eW

Mem

toR

egE

Hazard Unit

Flu

shE

EN

EN

CLR

CLR

Introduced another data hazard in the decode stage

Control Hazards w/ Early Branch Resolution

Time (cycles)

beq $t1, $t2, 40 RF $t2

$t1RF- DM

RF $s1

$s0RF& DMand $t0, $s0, $s1

or $t1, $s4, $s0

sub $t2, $s0, $s5

1 2 3 4 5 6 7 8

andIM

IMlw20

24

28

2C

30

...

...

9

Flushthis

instruction

64 slt $t3, $s2, $s3 RF $s3

$s2RF

$t3slt DMIM

slt

Handling Data and Control Hazards

EqualD

SignImmE

CLK

A RD

InstructionMemory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

SignExtend

RegisterFile

0

1

0

1

A RD

DataMemory

WD

WE

1

0

PCF0

1PC' InstrD

25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

ControlUnit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

0

1

0

1

=

SignImmD

Sta

llF

Sta

llD

For

war

dAE

For

war

dBE

For

war

dAD

For

war

dBD

20:16RtE

RsD

RdD

RtD

Reg

Writ

eE

Reg

Writ

eM

Reg

Writ

eW

Mem

toR

egE

Bra

nchD

Hazard Unit

Flu

shE

EN

EN

CLR

CLR

Forwarding and Stalling Logic

“Forward ALU result to branch-if-equal comparator if we would readits destination register”

ForwardAD = (rsD 6= 0∧ rsD = WriteRegM∧RegWriteM)

ForwardBD = (rtD 6= 0∧ rtD = WriteRegM∧RegWriteM)

“Stall if the branch would test the result of an ALU operation or amemory read”

branchstall =(BranchD∧RegWriteE∧ (WriteRegE = rsD∨WriteRegE = rtD))∨(BranchD∧MemToRegM∧ (WriteRegM = rsD∨WriteRegM = rtD))

“Stall if we need to read the result of a memory read or of a branch”

StallF,StallD,FlushE = lwstall∨ branchstall

Pipeline Performance CPI Example

Ideal CPI = 1; stalls reduce this, but how much?

Instructions in SPECINT2000 benchmark:

52% R-type25% Loads10% Stores11% Branches

2% Jumps

If 40% of loads are used by the next instruction and 25% ofbranches are mispredicted, what is the average CPI?

Load CPI = 2 when next instruction uses; 1 otherwise

Branch CPI = 2 when mispredicted; 1 otherwise

Jump CPI = 2

Average CPI = 1 · 0.52+ R-type(2 · 0.40+ 1 · 0.60) · 0.25+ Loads1 · 0.10+ Stores(2 · 0.25+ 1 · 0.75) · 0.11+ Branches2 · 0.02 Jumps

= 1.1475

Fully Bypassed Processor

EqualD

SignImmE

CLK

A RD

Instruction

Memory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign

Extend

Register

File

0

1

0

1

A RD

Data

Memory

WD

WE

1

0

PCF0

1

PC' InstrD25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

Control

Unit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

0

1

0

1

=

SignImmD

20:16 RtE

RsD

RdD

RtD

Hazard Unit

Sta

llF

Sta

llD

Forw

ard

AE

Forw

ard

BE

Forw

ard

AD

Forw

ard

BD

RegW

riteE

RegW

riteM

RegW

riteW

Mem

toR

egE

Bra

nchD

Flu

shE

EN

CLRENCLR

Pipelined Processor Critical Path

EqualD

SignImmE

CLK

A RD

Instruction

Memory

+

4

A1

A3

WD3

RD2

RD1WE3

A2

CLK

Sign

Extend

Register

File

0

1

0

1

A RD

Data

Memory

WD

WE

1

0

PCF0

1

PC' InstrD25:21

20:16

15:0

5:0

SrcBE

25:21

15:11

RsE

RdE

<<2

+

ALUOutM

ALUOutW

ReadDataW


SrcAE

PCPlus4D

PCBranchD

WriteRegM4:0

ResultW

PCPlus4F

31:26

RegDstD

BranchD

MemWriteD

MemtoRegD

ALUControlD2:0

ALUSrcD

RegWriteD

Op

Funct

Control

Unit

PCSrcD

CLK CLK CLK

CLK CLK

WriteRegW4:0

ALUControlE2:0

ALU



MemWriteE MemWriteM

RegDstE

ALUSrcE

WriteRegE4:0

000110

000110

0

1

0

1

=

SignImmD

20:16 RtE

RsD

RdD

RtD

Hazard Unit

Sta

llF

Sta

llD

Forw

ard

AE

Forw

ard

BE

Forw

ard

AD

Forw

ard

BD

RegW

riteE

RegW

riteM

RegW

riteW

Mem

toR

egE

Bra

nchD

Flu

shE

EN

CLRENCLR

Element Delay

Register clk-to-Q tpcq 30 psRegister setup tsetup 20Multiplexer tmux 25ALU tALU 200Memory Read tmemread 250Register file read tRFread 150Register file setup tRFsetup 20Equality teq 40AND gate tAND 15Memory Write tmemwrite 220Register file write tRFwrite 100

Tc=max

tpcq + tmemread + tsetup Fetch2(tRFread + tmux + teq + tAND + tmux + tRFsetup) Decodetpcq + tmux + tmux + tALU + tsetup Executetpcq + tmemwrite + tsetup Memory2(tpcq + tmux + tRFwrite) Writeback

=2(150+ 25+ 40+ 15+ 25+ 20) ps = 550 ps

Why 2? We assume it takes half a cycle for a newly writtenregister’s value (WD3) to propagate to RD1 or RD2, i.e., when anearlier instruction writes a register used by the current one.

Pipelined Processor Performance

For a 100 billion-instruction task on our pipelinedprocessor, each instruction takes 1.15 cycles onaverage. With a 550 ps clock period,

time = 100× 109 × 1.15× 550 ps = 63 seconds

Processor Execution Time Speedup

Single-Cycle 92.5 s 1.00 (by definition)Multi-Cycle 133.9 0.70Pipelined 63.25 1.46

fundamentals of computer systems - a pipelined mips processor

Documents