week 11 introduction to pipelined...

Spring 2005331 W11.1

14:332:331Computer Architecture and Assembly Language

Spring 2005

Week 11Introduction to Pipelined Datapath

[Adapted from Dave Patterson’s UCB CS152 slides and

Mary Jane Irwin’s PSU CSE331 slides]

Spring 2005331 W11.2

Head’s UpReminders

Pipelined datapath and controlHW#6 will be handed out soon

Spring 2005331 W11.3

Review: Multicycle Data and Control Path

Address

Read Data(Instr. or Data)

Memory

PC

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

ALU

Write Data

IRM

DR

AB

ALU

out

SignExtend

Shiftleft 2 ALU

control

Shiftleft 2

ALUOpControl

FSM

IRWriteMemtoReg

MemWriteMemRead

IorDPCWrite

PCWriteCond

RegDstRegWrite

ALUSrcAALUSrcB

zero

PCSource

1

1

1

1

1

10

0

0

0

0

0

2

2

3

4

Instr[5-0]

Instr[25-0]

PC[31-28]

Instr[15-0]

Instr[31-26]

32

28

Spring 2005331 W11.4

Review: RTL SummaryStep R-type Mem Ref Branch Jump

Instrfetch

IR = Memory[PC]; PC = PC + 4;

Decode A = Reg[IR[25-21]];B = Reg[IR[20-16]];

ALUOut = PC +(sign-extend(IR[15-0])<< 2);

Execute ALUOut = A op B;

ALUOut = A + sign-extend (IR[15-0]);

if (A==B) PC =

ALUOut;

PC = PC[31-28] ||(IR[25-0]

<< 2);

Memory access

Reg[IR[15-11]] = ALUOut;

MDR = Memory[ALUOut];

orMemory[ALUOut]

= B;

Write-back

Reg[IR[20-16]] = MDR;

Spring 2005331 W11.5

Review: Multicycle Datapath FSM

Start

Instr Fetch Decode

Write Back

Memory Access

Execute

(Op = R-

type)

(Op =

beq)

(Op = lw or sw) (Op = j)

(Op = lw)(Op = sw)

0 1

2

3

4

5

6

7

8 9

Unless otherwise assigned

PCWrite,IRWrite,MemWrite,RegWrite=0others=X

IorD=0MemRead;IRWrite

ALUSrcA=0ALUsrcB=01

PCSource,ALUOp=00PCWrite

ALUSrcA=0ALUSrcB=11ALUOp=00

PCWriteCond=0


PCWriteCond=0


PCWriteCond=0


PCSource=01PCWriteCond

PCSource=10PCWrite

MemReadIorD=1

PCWriteCond=0

MemWriteIorD=1

PCWriteCond=0

RegDst=1RegWriteMemtoReg=0

PCWriteCond=0

RegDst=0RegWriteMemtoReg=1

PCWriteCond=0

Spring 2005331 W11.6

Review: FSM Implementation

Combinationalcontrol logic

State RegNextState

Inst[31-26]

Inputs

Out

puts

Op0

Op1

Op2

Op3

Op4

Op5

PCWritePCWriteCondIorDMemReadMemWriteIRWriteMemtoRegPCSourceALUOpALUSourceBALUSourceARegWriteRegDst

System Clock

Spring 2005331 W11.7

Single Cycle Disadvantages & AdvantagesUses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowestinstruction

Is wasteful of area since some functional units must (e.g., adders) be duplicated since they can not be shared during a clock cycle

but

Is simple and easy to understand

Clk

Single Cycle Implementation:

lw sw Waste

Cycle 1 Cycle 2

Spring 2005331 W11.8

Multicycle Advantages & DisadvantagesUses the clock cycle efficiently – the clock cycle is timed to accommodate the slowest instruction step

balance the amount of work to be done in each steprestrict each step to use only one major functional unit

Multicycle implementations allowfunctional units to be used more than once per instruction as long as they are used on different clock cyclesfaster clock ratesdifferent instructions to take a different number of clock cycles

but

Requires additional internal state registers, muxes, and more complicated (FSM) control

Spring 2005331 W11.9

The Five Stages of Load InstructionCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

IFetch Dec Exec Mem WBlw

IFetch: Instruction Fetch and Update PC

Dec: Registers Fetch and Instruction Decode

Exec: Execute R-type; calculate memory address

Mem: Read/write the data from/to the Data Memory

WB: Write the data back to the register file

Spring 2005331 W11.10

Single Cycle vs. Multiple Cycle TimingSingle Cycle Implementation:

Clk Cycle 1

Multiple Cycle Implementation:

IFetch Dec Exec Mem WB

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10

IFetch Dec Exec Memlw sw

Clk

lw sw Waste

IFetchR-type

Cycle 1 Cycle 2

multicycle clock slower than 1/5th of single cycle clock due to stage flipflop overhead

Spring 2005331 W11.11

Pipelined MIPS ProcessorStart the next instruction while still working on the current one

improves throughput - total amount of work done in a given time

instruction latency (execution time, delay time, response time) is not reduced - time from the start of an instruction to its completion

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

IFetch Dec Exec Mem WBlw

Cycle 7Cycle 6 Cycle 8

sw IFetch Dec Exec Mem WB

R-type IFetch Dec Exec Mem WB

Spring 2005331 W11.12

Single Cycle, Multiple Cycle, vs. Pipeline

Clk

Cycle 1

Multiple Cycle Implementation:


Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10

lw IFetch Dec Exec Mem WB

IFetch Dec Exec Memlw sw

Pipeline Implementation:

IFetch Dec Exec Mem WBsw

Clk

Single Cycle Implementation:

Load Store Waste

IFetchR-type

Cycle 1 Cycle 2

wasted cycle

IFetch Dec Exec Mem WBR-type

Spring 2005331 W11.13

Pipelining the MIPS ISAWhat makes it easy

all instructions are the same length (32 bits)few instruction formats (three) with symmetry across formatsmemory operations can occur only in loads and storesoperands must be aligned in memory so a single data transfer requires only one memory access

What makes it hardstructural hazards: what if we had only one memorycontrol hazards: what about branchesdata hazards: what if an instruction’s input operands depend on the output of a previous instruction

Spring 2005331 W11.14

MIPS Pipeline Datapath ModificationsWhat do we need to add/modify in our MIPS datapath?

State registers between pipeline stages to isolate them

ReadAddress

InstructionMemory

Add

PC

4

0

1

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

16 32

ALU

1

0

Shiftleft 2

Add

DataMemoryAddress

Write Data

ReadData

1

0

IFet

ch/D

ec

Dec

/Exe

c

Exec

/Mem

Mem

/WB


System Clock

SignExtend

Spring 2005331 W11.15

MIPS Pipeline Control Path ModificationsAll control signals are determined during Decode

and held in the state registers between pipeline stages

ReadAddress

InstructionMemory

Add

PC

4

0

1

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

ReadData 1

ReadData 2

16 32

ALU

1

0

Shiftleft 2

Add

DataMemoryAddress

Write Data

ReadData

1

0

WB

IFet

ch/D

ec

Dec

/Exe

c

Exec

/Mem

Mem

/WB

IFetch Dec Exec Mem

System Clock

Control

SignExtend

Spring 2005331 W11.16

Graphically Representing MIPS Pipeline

ALUIM Reg DM Reg

Can help with answering questions like:how many cycles does it take to execute this code?what is the ALU doing during cycle 4?is there a hazard, why does it occur, and how can it be fixed?

Spring 2005331 W11.17

Why Pipeline? For Throughput!Time (clock cycles)

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Once the pipeline is full, one instruction is completed every cycle

Time to fill the pipeline

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3

Instr.

Order

Spring 2005331 W11.18

Can pipelining get us into trouble?Yes: Pipeline Hazards

structural hazards: attempt to use the same resource by two different instructions at the same timedata hazards: attempt to use item before it is ready

- instruction depends on result of prior instruction still in the pipeline

control hazards: attempt to make a decision before condition is evaulated

- branch instructions

Can always resolve hazards by waitingpipeline control must detect the hazardtake action (or delay action) to resolve hazards

Spring 2005331 W11.19

A Unified Memory Would Be a Structural HazardTime (clock cycles)

ALUMem Reg Mem Reg

ALUMem Reg Mem Reg

ALUMem Reg Mem Reg

ALUMem Reg Mem Reg

ALUMem Reg Mem Reg

Reading data from memory

Reading instruction from memory

lw

Inst 1

Inst 2

Inst 4

Inst 3

Instr.

Order

Spring 2005331 W11.20

How About Register File Access?Time (clock cycles)

Instr.

Order

add

Inst 1

Inst 2

Inst 4

add

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half.

Spring 2005331 W11.21

Branch Instructions Cause Control HazardsDependencies backward in time cause hazards

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

add

beq

lw

Inst 4

Inst 3

Instr.

Order

Spring 2005331 W11.22

One Way to “Fix” a Control Hazard

Instr.

Order

add

beq

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Inst 3

lw

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Can fix branch hazard by waiting –stall – but affects throughput

stall

stall

Spring 2005331 W11.23

Register Usage Can Cause Data HazardsDependencies backward in time cause hazards

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

add r1,r2,r3

sub r4,r1,r5

and r6,r1,r7

xor r4,r1,r5

or r8, r1, r9

Instr.

Order

Spring 2005331 W11.24

One Way to “Fix” a Data Hazard

Instr.

Order

add r1,r2,r3

ALUIM Reg DM Reg

sub r4,r1,r5

and r6,r1,r7

ALUIM Reg DM Reg

ALUIM Reg DM Reg

stall

stall

Can fix data hazard by waiting – stall –but affects throughput

Spring 2005331 W11.25

Loads Can Cause Data HazardsDependencies backward in time cause hazards

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

lw r1,100(r2)

sub r4,r1,r5

and r6,r1,r7

xor r4,r1,r5

or r8, r1, r9

Instr.

Order

Spring 2005331 W11.26

Stores Can Cause Data HazardsDependencies backward in time cause hazards

add r1,r2,r3

sw r1,100(r5)

and r6,r1,r7

xor r4,r1,r5

or r8, r1, r9A

LUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

ALUIM Reg DM Reg

Instr.

Order

Spring 2005331 W11.27

Other Pipeline Structures Are PossibleWhat about (slow) multiply operation?

let it take two cycles

What if the data memory access is twice as slow as the instruction memory?

make the clock twice as slow or …let data memory access take two cycles (and keep the same clock rate)

ALUIM Reg DM Reg

MUL

ALUIM Reg DM1 RegDM2

Spring 2005331 W11.28

Sample Pipeline AlternativesARM7

StrongARM-1

XScale

IM Reg EX

PC updateIM access

decodereg

access

ALU opDM accessshift/rotatecommit result

(write back)

ALUIM Reg DM Reg

ALUIM1 IM2 DM1 Reg

DM2Reg SHFT

PC updateBTB access

start IM access

IM access

decodereg 1 access

shift/rotatereg 2 access

ALU op

start DM accessexception

DM writereg write

Spring 2005331 W11.29

SummaryAll modern day processors use pipelining

Pipelining doesn’t help latency of single task, it helps throughput of entire workload

Multiple tasks operating simultaneously using different resources

Potential speedup = Number of pipe stages

Pipeline rate limited by slowest pipeline stageUnbalanced lengths of pipe stages reduces speedupTime to “fill” pipeline and time to “drain” it reduces speedup

Must detect and resolve hazardsStalling negatively affects throughput

week 11 introduction to pipelined...

Documents