1 pipeline datapath with some slides from: john lazzaro and dan garcia

88
1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

Post on 22-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

1

Pipeline Datapath

With some slides from:

John Lazzaro andDan Garcia

Page 2: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

2

Flip Flops פליפ-לופ

D Q A flip-flop “samples” right before the edge, and then “holds” value.Sampling

circuitHolds value

Which circuit contributes t_setup delay?

Which circuit contributes t_clk-to-Q delay?

Page 3: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

3

D Q

clk-to-Q ?

CLK == 0

Sense D, but Qoutputs old value.

CLK 0->1

Capture D, passvalue to Q

CLK

setup ?

hold ?

clk-to-Q

setup

hold

Page 4: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

4

Performance Equation

Seconds

Program

Instructions

Program=

Seconds

Cycle Instruction

Cycles

Goal is to

optimize execution time,

notindividu

alequationterms.

The CPI of the

program.Reflects

the program’

s instructio

n mix.

Machinesare

optimizedwith

respect to

programworkload

s.

Clockperiod.

Optimizejointlywith

machineCPI.

Page 5: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

5

השעוןHertz=1/sec

של במהירות פנטיום מחשב

מבצע שהוא שעון 2 *10^8פירושו מחזורי.בשניה

לוקח שעון מחזור כל

200MHZ

5*10^-9=5nanosecond

בימינו פקודה לוקחת ?כמה

Page 6: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

6

datapath מבנה ה-P

C

inst

ruct

ion

me

mor

y

+4

rtrs

rd

regi

ste

rs

ALU

Da

tam

em

ory

imm

1. InstructionFetch

2. Decode/ Register

Read

3. Execute 4. Memory5. Write

Back

Page 7: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

7

Gotta Do Laundry° Ann, Brian, Cathy, Dave

each have one load of clothes to wash, dry, fold, and put away

A B C D

° Dryer takes 30 minutes

° “Folder” takes 30 minutes

° “Stasher” takes 30 minutes to put clothes into drawers

° Washer takes 30 minutes

Page 8: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

8

Sequential Laundry

• Sequential laundry takes 8 hours for 4 loads

Task

Order

B

C

D

A

30Time

3030 3030 30 3030 3030 3030 3030 3030

6 PM 7 8 9 10 11 12 1 2 AM

Page 9: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

9

Pipelined Laundry

• Pipelined laundry takes 3.5 hours for 4 loads!

Task

Order

B

C

D

A

12 2 AM6 PM 7 8 9 10 11 1

Time303030 3030 3030

Page 10: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

10

General Definitions

• Latency: time to completely execute a certain task– for example, time to read a sector from disk is

disk access time or disk latency

• Throughput: amount of work that can be done over a period of time

Page 11: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

11

Pipelining Lessons (1/2)

• Pipelining doesn’t help latency of single task, it helps throughput of entire workload

• Multiple tasks operating simultaneously using different resources

• Potential speedup = Number pipe stages

• Time to “fill” pipeline and time to “drain” it reduces speedup:2.3X v. 4X in this example

6 PM 7 8 9

Time

B

C

D

A

303030 3030 3030Task

Order

Page 12: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

12

Pipelining Lessons (2/2)

• Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline?

• Pipeline rate limited by slowest pipeline stage

• Unbalanced lengths of pipe stages also reduces speedup

6 PM 7 8 9

Time

B

C

D

A

303030 3030 3030Task

Order

Page 13: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

13

Inspiration: Automobile assembly lineAssembly line moves on a steady clock.

Each station does the same task on each car.Car body shell

Car chassis

Mergestation

Boltingstation

The clock

Page 14: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

14

Inspiration: Automobile assembly lineSimpler station tasks → more cars per hour.Simple tasks take less time, clock is faster.

Page 15: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

15

Inspiration: Automobile assembly lineLine speed limited by slowest task.

Most efficient if all tasks take same time to do

Page 16: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

16

Inspiration: Automobile assembly lineSimpler tasks, complex car → long line!

These lines go 24 x 7,

and rarely shut down. Why?

Page 17: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

17

Lessons from car assembly lines

Faster line movement yields more cars per hour off the line.

Faster line movement requires more stages, each doing simpler tasks.

To maximize efficiency, all stages should take same amount of time(if not, workers in fast stages are idle)

“Filling”, “flushing”, and “stalling” assembly line are all bad news.

Page 18: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

19

datapath מבנה ה-P

C

inst

ruct

ion

me

mor

y

+4

rtrs

rd

regi

ste

rs

ALU

Da

tam

em

ory

imm

1. InstructionFetch

2. Decode/ Register

Read

3. Execute 4. Memory5. Write

Back

Page 19: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

21

Key Analogy: The instruction is the car

IR IR IR

Instruction Fetch

IR

Pipeline Stage #1

Stage #2

Controlshardware

in stage 2

Stage #3

Controlshardware

in stage 3

Stage #4

Controlshardware

in stage 4

Stage #5

Controlshardware

in stage 5

“Data-stationary control”

Page 20: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

22

Representation #1: Timeline

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

ADD R4,R3,R2

OR R7,R6,R5

SUB R1,R9,R8XOR R3,R2,R1

AND R6,R5,R4I1:I2:I3:I4:I5:

Sample Program

IF ID

IF

EX

ID

IF

MEM

EX

ID

IF

WB

MEM

EX

IFID

WB

MEM

IDEX

IF

WB

EXMEM

IDMEMWB

EXPipeline is “full”

Good for visualizing pipeline fills.

I1:I2:I3:I4:I5:

t1 t2 t3 t4 t5 t6 t7 t8Time:Inst

I6:

Page 21: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

23

Pipeline is “full”

Good for visualizing pipeline stalls.

Representation #2: Resource Usage

IR IR IR IR

ADD R4,R3,R2

OR R7,R6,R5

SUB R1,R9,R8XOR R3,R2,R1

AND R6,R5,R4I1:I2:I3:I4:I5:

Sample ProgramI1 I2

I1

I3

I2

I1

I4

I3

I2

I1

I5

I4

I3

I1I2

IF:ID:EX:MEM:WB:

t1 t2 t3 t4 t5 t6 t7 t8Time:Stage

IF (Fetch) ID (Decode) EX (ALU) MEM WB

I5

I4

I2I3

I6

I5

I3I4

I6

I7

I4I5

I6

I7

I8

Page 22: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

24

Review: Datapath for MIPS

Stage 1 Stage 2 Stage 3Stage 4 Stage 5

• Use datapath figure to represent pipelineIFtch Dcd Exec Mem WB

AL

U I$ Reg D$ Reg

PC

inst

ruct

ion

me

mor

y+4

rtrs

rd

regi

ste

rs

ALU

Da

tam

em

ory

imm

1. InstructionFetch

2. Decode/ Register Read 3. Execute 4. Memory

5. WriteBack

Page 23: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

25

Graphical Pipeline Representation

Instr.

Order

Load

Add

Store

Sub

Or

I$

Time (clock cycles)

I$

AL

U

Reg

Reg

I$

D$

AL

U

AL

U

Reg

D$

Reg

I$

D$

RegA

LU

Reg Reg

Reg

D$

Reg

D$

AL

U

(In Reg, right half highlight read, left half write)

Reg

I$

Page 24: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

26

Example• Suppose 2 ns for memory access, 2 ns for ALU operation, and 1 ns for register file read or write

• Nonpipelined Execution:–lw : IF + Read Reg + ALU + Memory + Write Reg

= 2 + 1 + 2 + 2 + 1 = 8 ns–add: IF + Read Reg + ALU + Write Reg

= 2 + 1 + 2 + 1 = 6 ns

• Pipelined Execution:–Max(IF,Read Reg,ALU,Memory,Write Reg) = 2

ns

Page 25: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

28

לשלבים חלוקה

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

1

ALUresult

Mux

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back

Page 26: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

29

הרגיסטרים הוספת

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

Page 27: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

30

In struc t ion

m e m o ry

Add re ss

4

32

0

A ddA d d

re s u lt

S h if t

le f t 2

Ins

tru

ct i

on

IF /I D E X / M E M M E M /W B

Mux

0

1

A d d

P C

0W ri teda ta

Mux

1

R eg iste rs

R e a dd a ta 1

R e a dd a ta 2

R e a dre g is te r 1

R e a dre g is te r 2

16S ig n

e xte nd

W ritere g is te r

W rited ata

R ea dda ta

1

A LUre s u lt

Mux

A LU

Z e ro

ID /E X

In s tr u c t io n fe tc h

lw

A

IF/ID

Page 28: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

31

In struc t ion

m e m o ry

A dd re ss

4

32

0

A ddA dd

res u lt

S h if t

le ft 2

Inst

ruc

tio

n

IF / I D E X / M E M

Mux

0

1

A dd

P C

0W r iteda ta

Mux

1

R eg iste rs

R e a dda ta 1

R e a dda ta 2

R e a dre g is te r 1

R e a dre g is te r 2

16S i gn

e xte nd

W r itere g is te r

W rited ata

R ea dda ta

1

A LUre s u lt

Mux

A LU

Ze ro

ID /E X M E M /W B

I n s t r u c t io n d e c o d e

lw

A d d re s s

D ata

m e m o ry

ID/EX

Page 29: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

32

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Execution

lw

Address

Datamemory

EX/MEM

Page 30: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

33

In s tru c t ion

m e m or y

A dd res s

4

32

0

A d dA dd

r e su lt

S h if t

le ft 2

Ins

tru

ct i

on

I F /ID E X/M E M

Mux

0

1

A d d

P C

0W rit ed a ta

Mux

1

R e g is te rs

R ea dd ata 1

R ea dd ata 2

R e adre g is ter 1

R e adre g is ter 2

16S ig n

e xte nd

W ritere g is ter

W riteda ta

R e add at a

D a ta

m em o ry

1

A L Ures u lt

Mux

A L U

Z er o

ID /E X M E M /W B

M e m o r y

lw

A d dre ss

MEM/WB

Page 31: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

34

In stru ct ion

m em ory

A dd res s

4

32

0

Ad dA dd

r esu lt

S h ift

l e ft 2

Ins

tru

ct io

n

I F /ID E X /M E M

Mux

0

1

A d d

P C

0W rit ed a ta

Mux

1

R eg iste rs

R e add ata 1

R e add ata 2

R e a dre g is ter 1

R e a dre g is ter 2

16S ig n

e xte nd

W rited a ta

R e a ddat aD a ta

m e m o ry

1

A LUre s u lt

Mux

A LU

Ze ro

ID /EX M E M /W B

W r it e b a c k

l w

W ritere g is te r

A d dre s s

Page 32: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

35

A correction !!! תיקון

Keep the right Rd all the way!

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX

Page 33: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

36

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Address

Datamemory

So here is the updated CPU;

Page 34: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

38

Control

PC

Instructionmemory

Address

Inst

ruct

ion

Instruction[20– 16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1Write

data

Read

data Mux

1

ALUcontrol

RegWrite

MemRead

Instruction[15– 11]

6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

PCSrc

Zero

AddAdd

result

Shiftleft 2

ALUresult

ALU

Zero

Add

0

1

Mux

0

1

Mux

Page 35: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

39

הבקרה קוויExecution/Address Calculation

stage control linesMemory access stage

control lines

Write-back stage control

lines

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

Page 36: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

40

PC

Instructionmemory

Inst

ruct

ion

Add

Instruction[20– 16]

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

RegW

rite

MemRead

Control

ALU

Instruction[15– 11]

6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Mem

Writ

e

AddressData

memory

Address

Datapath with Control

Page 37: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

41

דוגמאA demonstration of a sequence of instructions:

Lw $10,20($1)

Sub $11,$2,$3

And $12,$4,$5

Or $13,$6,$7

Add $14,$8,$9

Page 38: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

42

Instructionmemory

Instruction[20– 16]

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

Instruction[15– 0]

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU

Instruction[15– 11]

EX

M

WB

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

000

00

0000

000

00

000

0

00

00

0

0

0

Mux

0

1

Add

PC

0

Datamemory

Address

Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

Mux

0

1

Add Addresult

Writeregister

Writedata

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Re

gWrit

e

ALU

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

010

11

0001

000

00

000

0

00

00

0

0

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

lwControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

X

10

20

X

1

Instruction[20– 16]

Instruction[15– 0] Sign

extend

Instruction[15– 11]

20

$X

$1

10

X

Me

mW

rite

MemReadM

em

Writ

e

Datamemory

Address

Address

Address

Clock 2

Clock 1

Page 39: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

43

Instructionmemory

Address

Instruction[20– 16]

Mem

toR

eg

Branch

ALUSrc

4

Instruction[15– 0]

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU

Instruction[15– 11]

EX

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

000

10

1100

010

11

000

1

00

00

0

0

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

Writeregister

Writedata 1

ALUresult

ALUcontrol

Shiftleft 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

MEM/WB

IF: or $13, $6, $7

000

10

1100

000

10

101

0

11

10

0

0

0

Mux

0

1

Add

PC

0Writedata

Mux

1

andControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

12

X

X

5

4

Instruction[20– 16]

Instruction[15– 0]

Instruction[15– 11]

X

$5

$4

X

12

Me

mW

rite

MemReadM

em

Writ

e

sub

11

X

X

3

2

X

$3

$2

X

11

$1

20

10

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$3

$2

11

Mux

Mux

ALUAddress Read

dataData

memory

10

WB

Zero

Zero

Signextend

Signextend

Datamemory

Address

Clock 3

Clock 4

ID: and $12, $4, $5

Page 40: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

44

Instructionmemory

Address

Instruction[20– 16]

Branch

ALUSrc

4

Instruction[15– 0]

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU

Instruction[15– 11]

EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

MEM/WB

IF: add $14, $8, $9

000

10

1100

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

ALUcontrol

Shiftleft 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

MEM/WB

IF: after<1>

000

10

1100

000

10

101

0

10

00

0

1

0

Mux

0

1

Add

PC

0Writedata

Mux

1

addControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

14

X

X

9

8

Instruction[20– 16]

Instruction[15– 0]

Instruction[15– 11]

X

$9

$8

X

14

Me

mW

rite

MemReadM

em

Writ

e

or

13

X

X

7

6

X

$7

$6

X

13

$4

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$7

$6

13

Mux

Mux

ALUReaddata

12

WB

11 10

10$5

12

WB

Mem

toR

eg

1

1

11

11

Writeregister

Writedata

Zero

Zero

Datamemory

Address

Datamemory

Address

Signextend

Signextend

Clock 5

Clock 6

Page 41: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

45

Instructionmemory

Address

Instruction[20– 16]

Branch

ALUSrc

4

Instruction[15– 0]

0

1

Add Addresult

RegistersWriteregister

Writedata

ALUresult

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU

Instruction[15– 11]

Signextend

EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .

MEM/WB

IF: after<2>

000

00

0000

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .

MEM/WB

IF: after<3>

000

00

0000

000

00

000

0

10

00

0

1

0

Mux

0

1

Add

PC

0Writedata

Mux

1

Control

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Instruction[20– 16]

Instruction[15– 0] Sign

extend

Instruction[15– 11]

Me

mW

rite

MemReadM

em

Writ

e

$8

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

Mux

Mux

ALUReaddata

14

WB

13 12

12$9

14

WB

Mem

toR

eg

1

0

13

13

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2 Zero

Datamemory

Address

Datamemory

Address

Clock 7

Clock 8

Page 42: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

46

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

Zero

ALUcontrol

Shiftleft 2R

egW

r ite

M

WB

Inst

r uct

ion

IF/ID EX/MEMID/EX

ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .

MEM/ WB

IF: after<4>

000

00

0000

000

00

000

0

00

00

0

1

0

Mux

0

1

Add

PC

0Writedata

Mux

1

Control

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Instruction[20– 16]

Instruction[15– 0] Sign

extend

Instru ction[15– 11]

MemRead

Mem

Wr i

te

Mux

Mux

ALUReaddata

WB

14

14

Writeregister

Writedata

Datamemory

Address

Clock 9

Page 43: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

48

Problems for Computers• Limits to pipelining: Hazards prevent next

instruction from executing during its designated clock cycle– Structural hazards: HW cannot support this combination

of instructions (single person to fold and put clothes away)

– Control hazards: Pipelining of branches & other instructions stall the pipeline until the hazard “bubbles” in the pipeline

– Data hazards: Instruction depends on result of prior instruction still in the pipeline

Page 44: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

49

An example for data hazards:

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Page 45: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

50

An example for data hazards:

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

An example for data hazards:Register $2 is updated only at the WB phase, i.e., the 5th clock cycle (actually at the end of the 5th clock cycle). However, we try to use it at the 3rd clock cycle when we read $2 at the decode phase of the and instruction

Page 46: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

51

Graphic representation of data hazards:

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecutionorder(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

Page 47: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

52

sub $2, $1, $3

nop

nop

nop

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Solving data hazards by adding nops

Page 48: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

53

Solving data hazards by adding nops

Programexecutionorder(in instructions)

and $12, $2, $

or $13, $6, $ 2

add $14, $2, $

sw $15, 100( $2

5

2

)g

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Re

Reg

Reg

Reg

DM

IM Reg DMReg

IM Reg DM Reg

IM Reg DMReg

IM Reg DMReg

sub $2, $1, $ 3

nop

nop

nop

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

Value of register $2:

CC 10 CC 11 CC 12

– 20 – 20 – 20

Page 49: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

54

The internal structure of the Register File

32

32

3232

32

32

32

32

32

32

32

Read data 2

write data

Read data 1

5

5

5

Rd reg 2 (= Rt)

Rd reg 1 (= Rs)

RegWrite

Wr reg (= Rd) 32

32

E

שונים רגיסטרים שני של ערכים בוזמנית היציאות משתי קוראים) הבאה ) השעון בעליית האחרים הרגיסטרים לאחד כותבים

Page 50: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

55

We could earn 1 ck cycle if GPR is “transparent”

Programexecutionorder(in instructions)

and $12, $2, $

or $13, $6, $ 2

add $14, $2, $

sw $15, 100( $2

5

2

) g

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Re

Reg

Reg

Reg

DM

IM Reg DMReg

IM Reg DM Reg

IM Reg DMReg

sub $2, $1, $ 3

nop

nop

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

Value of register $2:

CC 10 CC 11 CC 12

– 20 – 20 – 20

We could earn 1 ck cycle if GPR is “transparent”, i.e, we could see the write data to the GPR at the GPR outputs (if the write address equals the read address), i.e., during Ck #5.

Page 51: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

56

The internal structure of the modified Register File. We ‘bypass” the input data (the write data) to the read data1 output whenever Rs=Rd/Rt (i.e., whenever read reg1=write reg but not zero). We “bypass” the input data (the write data) to the read data2 output whenever Rt=Rd/Rt (i.e., whenever read reg2=write reg, but not zero).

32

32

3232

32

32

32

32

32

32

32

Read data 2

write data

Read data 1

5

5

5

Rd reg 2 (= Rt)

Rd reg 1 (= Rs)

RegWrite

Wr reg (= Rd) 32

32

E

שונים רגיסטרים שני של ערכים בוזמנית היציאות משתי קוראים) הבאה ) השעון בעליית האחרים הרגיסטרים לאחד כותבים

032

032

32

3232

write data32

write data

Wr reg 5

5Wr reg

Page 52: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

57

sub $2, $1, $3

nop

nop

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

After doing that change we only need 2 nops

After the change the WB of an early instruction can happen at the same time with the read reg (decode) phase of a newer instruction (3 with two other instructions in between). In case we have a data hazard, we need to add only two nop instructions.

Unfortunately, this happens too often. We need a better solution!

Page 53: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

58

Graphic representation of data hazards:

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecutionorder(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

Page 54: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

59

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecutionorder(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

Page 55: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

60

Forwarding – הערכים גניבת

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

Page 56: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

61

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

Page 57: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

62

Forwarding (done at the execute phase)

PCInstruction

memory

Registers

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Forwardingunit

IF/ID

Inst

ruct

ion

Mux

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

Rt

Rt

Rs

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

If ID/EX.Rs=EX/MEM.Rd, i.e., the Rd of the previous instruction equals the Rs of the current instruction (which is in the “decode” phase), then we use the “ALUout” of the previous instruction

instead of the output of the GPR. If ID/EX.Rs=MEM/WB.Rd, i.e., the Rd of the previous instruction equals the Rs of the current instruction (which is in the “decode” phase), then we use the “ALUout” of the previous instruction instead of the output of the GPR. [ similarly, compare also ID/EX.Rt to MEM/WB.Rd ]

Similarly, compare also ID/EX.Rt to EX/MEM.Rd and to MEM/WB.Rd

Page 58: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

65

An example for forwarding דוגמא

Sub $2, $1, $3

And $4, $2, $5 needs forwarding from the previous instruction

Or $4, $4, $2 needs forwarding from two instructions back

Add $9, $4, $2 needs forwarding from 3 instructions back (thru the “transparent” GPR)

Here we discuss the $2 register only

(The first two cases are handled in the execute phase, the last one, in the decode phase).

Page 59: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

66

An example for forwarding דוגמא

Sub $2, $1, $3

And $4, $2, $5

Or $4, $4, $2 needs forwarding from the previous instruction

Add $9, $4, $2 needs forwarding from the previous instruction

Here we discuss the $4 register and there are two case (the 2nd one in purple)

Page 60: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

67

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5 sub $2, $1, $3

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

10 10

$2

$5

5

2

4

$1

$3

3

1

2

Control

ALU

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

or $4, $4, $2 and $4, $2, $5

ID/EX

sub $2, . . .

EX/MEM

before<1>

MEM/WB

add $9, $4, $2

Clock 4

4

6

10 10

$4

$2

6

2

4

$2

$5

5

2

4

Control

ALU

10

2

WB

M

WB

Since Rs=2 and Rd of previous inst. was 2, we use ALUout instead of Rs

Sub $2, $1, $3

And $4, $2, $5

Or $4, $4, $2

Add $9, $4, $2

Page 61: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

68

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

add $9, $4, $2 or $4, $4, $2

ID/EX

and $4, . . .

EX/MEM

sub $2, . . .

MEM/WB

after<1>

Clock 5

4

2

10 10

$4

$2

2

4

9

$4

$2

4

2

24

Control

ALU

10

WB

2

1

PCInstruction

memory

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

after<1>after<2> add $9, $4, $2 or $4, . . .

EX/MEM

and $4, . . .

MEM/WB

Clock 6

10

$4

$2

2

4

9

ALU

10

4

4

WB

4

1

Registers

Inst

ruct

ion

IF/ID

ID/EX

4

Control

In blue we see forwarding from two instructions back (Mem->Exec.), in red, from previous instruction (WB->Exec.), in purple, from 3 instructions back (WB->Decode).

Page 62: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

69

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

add $9, $4, $2 or $4, $4, $2

ID/EX

and $4, . . .

EX/MEM

sub $2, . . .

MEM/WB

after<1>

Clock 5

4

2

10 10

$4

$2

2

4

9

$4

$2

4

2

24

Control

ALU

10

WB

2

1

PCInstruction

memory

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

after<1>after<2> add $9, $4, $2 or $4, . . .

EX/MEM

and $4, . . .

MEM/WB

Clock 6

10

$4

$2

2

4

9

ALU

10

4

4

WB

4

1

Registers

Inst

ruct

ion

IF/ID

ID/EX

4

Control

Page 63: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

70

The solution does not work for lw - הפתרון תמיד לאעובד

(in lw we do not have the data in the pipe!, it comes from the data memory!)

Reg

IM

Reg

Reg

IM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

If the previous instruction was lw to a register and we try to use the register in the current instruction, we have a problem, since we cannot go back in time!

One solution is to avoid such cases by adding a nop (by the Assembler) whenever Rt of the lw is equal to Rs or Rt of the following instruction.

Page 64: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

71

Another h/w solution is to add Bubbles,i.e., add nop by hardware

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

“nop”

We need to hold IF/ID for one ck cycle and insert a “nop: into ID/EX. This is equal to adding a nop instruction by the Assembler.

Page 65: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

72

Hazard detection unit

PCInstruction

memory

Registers

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

0

Mux

IF/ID

Inst

ruct

ion

ID/EX.MemRead

IF/I

DW

rite

PC

Wri

te

ID/EX.RegisterRt

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

RtRs

Rd

Rt EX/MEM.RegisterRd

MEM/WB.RegisterRd

We need to hold the IF/ID and PC for one ck cycle and insert a “nop: into ID/EX. This is equal to adding a nop instruction by the Assembler.

If (ID/EX.MemRd)&& ( (ID/EX.Rt= =IF/ID.Rs) || (ID/EX.Rt= =IF/ID.Rt) ) we must “stall” the pipeline! This means that prev. inst was lw and it was to the current Rs or Rt. (of course if one of them is not used, don’t stall)

Holding means”freeze” the IF/ID and the PC for 1 clock cycle

Hold the IF/ID by not giving a IF/IDWrire signal and do not increment the PC (which already points at the nex instruction) by not giving the PCWrite signal. Inserting a nop is by clearing all control signals.

Rt from prev. inst.

Rs, Rt of current inst.

identifies lw

Page 66: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

73

An example for lw hazard detection דוגמא

lw $2, 20($1)

And $4, $2, $5

Or $4, $4, $2

Add $9, $4, $2

Page 67: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

74

Hazarddetection

unit

0

MuxIF

/ID

Wri

te

PC

Wri

te

ID/EX.RegisterRt

lw $2, 20($1)

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

2

500 11

$2

$5

5

2

4

$1

$X

X

1

2

Control

ALU

M

WB

Hazarddetection

unit

0

MuxIF

/ID

Wri

te

PC

Wri

te

ID/EX.RegisterRt

ID/EX.MemRead

ID/EX.MemRead

M

WB

$1

$X

X

1

2

before<3>

PCInstruction

memory

Registers

Mux

Mux

Mux

EX WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

ID/EX

EX/MEM

MEM/WB

and $4, $2, $5 lw $2, 20($1) before<1> before<2>

Clock 2

1

1

X

X11

Control

ALU

M

WB

Page 68: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

75

Hazarddetection

unit

0

MuxIF

/IDW

rite

PC

Writ

e

ID/EX.RegisterRt

$2

$5

5

2

2

2

4

WB

Hazarddetection

unit

0

MuxIF

/IDW

rite

PC

Writ

e

ID/EX.RegisterRt

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

Datamemory

Mux

Inst

ruct

ion

IF/ID

and $4, $2, $5 bubble

ID/EX

lw $2, . . .

EX/MEM

before<1>

MEM/WB

Clock 4

2

2

5

510

11

00

$2

$5

5

2

4

Control

ALU

M

WB

bubble lw $2, . . .

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5

ID/EX

EX/MEM

MEM/WB

add $9, $4, $2

Clock 5

2

210 10

11

$4

$2

2

4

4

4

2

4

$2

$5

5

2

4

Control

ALU

0

WB

ID/EX.MemRead

ID/EX.MemRead

or $4, $4, $2

or $4, $4, $2

The lw instruction is in the WB phase. $2 is “being written”. We can use $2 in the Execute phase of the and instruction, with the help of forwarding.

Page 69: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

76

Registers

Inst

ruct

ion

ID/EX

4

Control

PCInstruction

memory

PCInstruction

memory

Hazarddetection

unit

0

Mux

IF/I

DW

rite

PC

Wri

te

IF/I

DW

rite

PC

Wri

te

ID/EX.RegisterRt

bubble

Registers

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Inst

ruct

ion

IF/ID

add $9, $4, $2

ID/EX

and $4, . . .

EX/MEM

MEM/WB

Clock 6

4

4

2

210 10

$4

$2

2

4

49

$2

2

Control

ALU

10

WB0

add $9, $4, $2 or $4, . . . and $4, . . .after<2> after<1>

after<1>

Clock 7

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Forwardingunit

EX/MEM

MEM/WB

10 10

$4

$4

$2

2

4

4

9

ALU

10

WB

44

4

1

Hazarddetection

unit

0

Mux

ID/EX.RegisterRt

or $4, $4, $2

ID/EX.MemRead

ID/EX.MemRead

Mux

IF/ID

Page 70: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

77

PC

Instructionmemory

Inst

ruct

ion

Add

Instruction[20– 16]

Me

mto

Re

g

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU

Instruction[15– 11]

6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Me

mW

rite

AddressData

memory

Address

Just to remind us how branch is handled we show again the Datapath with Control

Page 71: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

78

Branch Hazards

Reg

Reg

CC 1

Time (in clock cycles)

40 beq $1, $3, 7

Programexecutionorder(in instructions)

IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

Here we calc.Rs-Rt

Here we decide to branch (switching the address to the PC and issuing PCWrite Cond)

These 3 instructions should be “killed” before they do harm, I.e., change any register.

In cc5 we already use the new PC calculated by the branch. (PC=72)

Page 72: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

79

Control Hazard: Branching (1/7)

Where do we do the compare for the branch?

I$

beq

Instr 1

Instr 2

Instr 3

Instr 4A

LU I$ Reg D$ Reg

AL

U I$ Reg D$ Reg

AL

U I$ Reg D$ RegA

LUReg D$ Reg

AL

U I$ Reg D$ Reg

Instr.

Order

Time (clock cycles)

Page 73: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

80

Control Hazard: Branching (2/7)• We put branch decision-making hardware in

ALU stage– therefore two more instructions after the branch

will always be fetched, whether or not the branch is taken

• Desired functionality of a branch– if we do not take the branch, don’t waste any

time and continue executing normally– if we take the branch, don’t execute any

instructions after the branch, just go to the desired label

Page 74: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

81

Control Hazard: Branching (3/7)

• Initial Solution: Stall until decision is made– insert “no-op” instructions: those that

accomplish nothing, just take time– Drawback: branches take 3 clock cycles each

(assuming comparator is put in ALU stage)

Page 75: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

82

Control Hazard: Branching (4/7)• Optimization #1:

– move asynchronous comparator up to Stage 2– as soon as instruction is decoded (Opcode

identifies is as a branch), immediately make a decision and set the value of the PC (if necessary)

– Benefit: since branch is complete in Stage 2, only one unnecessary instruction is fetched, so only one no-op is needed

– Side Note: This means that branches are idle in Stages 3, 4 and 5.

Page 76: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

83

PC Instructionmemory

4

Registers

Mux

Mux

Mux

ALU

EX

M

WB

M

WB

WB

ID/EX

0

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

Signextend

Control

Mux

=

Shiftleft 2

Mux

Page 77: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

84

• Insert a single no-op (bubble)

Control Hazard: Branching (5/7)

add

beq

lw

AL

U I$ Reg D$ Reg

AL

U I$ Reg D$ RegA

LUReg D$ Reg I$

Instr.

Order

Time (clock cycles)

bubble

• Impact: 2 clock cycles per branch instruction slow

Page 78: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

85

Control Hazard: Branching (6/7)

• Optimization #2: Redefine branches– Old definition: if we take the branch, none of

the instructions after the branch get executed by accident

– New definition: whether or not we take the branch, the single instruction immediately following the branch gets executed (called the branch-delay slot)

Page 79: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

86

Control Hazard: Branching (7/7)• Notes on Branch-Delay Slot

– Worst-Case Scenario: can always put a no-op in the branch-delay slot

– Better Case: can find an instruction preceding the branch which can be placed in the branch-delay slot without affecting flow of the program

• re-ordering instructions is a common method of speeding up programs

• compiler must be very smart in order to find instructions to do this

• usually can find such an instruction at least 50% of the time

• Jumps also have a delay slot…

Page 80: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

87

Example: Nondelayed vs. Delayed Branch

add $1 ,$2,$3

sub $4, $5,$6

beq $1, $4, Exit

or $8, $9 ,$10

xor $10, $1,$11

Nondelayed Branch

add $1 ,$2,$3

sub $4, $5,$6

beq $1, $4, Exit

or $8, $9 ,$10

xor $10, $1,$11

Delayed Branch

Exit: Exit:

Page 81: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

88

Question (1/2)

Assume 1 instr/clock, delayed branch, 5 stage pipeline, forwarding, interlock on unresolved load hazards (after 103 loops, so pipeline full)Loop: lw $t0, 0($s1)

addu $t0, $t0, $s2sw $t0, 0($s1)addiu $s1, $s1, -4bne $s1, $zero, Loopnop

•How many pipeline stages (clock cycles) per loop iteration to execute this code?

12345678910

Page 82: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

89

Answer (1/2)• Assume 1 instr/clock, delayed branch, 5 stage

pipeline, forwarding, interlock on unresolved load hazards. 103 iterations, so pipeline full.

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addiu $s1, $s1, -4bne $s1, $zero, Loopnop

• How many pipeline stages (clock cycles) per loop iteration to execute this code?

1. 2. (data hazard so stall)

3.4.5.6.

(delayed branch so exec. nop)7.

1 2 3 4 5 6 7 8 9 10

Page 83: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

90

Question (2/2)

Assume 1 instr/clock, delayed branch, 5 stage pipeline, forwarding, interlock on unresolved load hazards (after 103 loops, so pipeline full). Rewrite this code to reduce pipeline stages (clock cycles) per loop to as few as possible. Loop: lw $t0, 0($s1)

addu $t0, $t0, $s2sw $t0, 0($s1)addiu $s1, $s1, -4bne $s1, $zero, Loopnop

•How many pipeline stages (clock cycles) per loop iteration to execute this code?

12345678910

Page 84: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

91

A (2/2) How long to execute?

• How many pipeline stages (clock cycles) per loop iteration to execute your revised code? (assume pipeline is full)

• Rewrite this code to reduce clock cycles per loop to as few as possible:

Loop: lw $t0, 0($s1)addiu $s1, $s1, -4 addu $t0, $t0, $s2bne $s1, $zero, Loopsw $t0, +4($s1)

(no hazard since extra cycle)

1.

3.

4.5.

2.

(modified sw to put past addiu)

1 2 3 4 5 6 7 8 9 10

Page 85: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

97

PCInstruction

memory

4

Registers

Signextend

Mux

Mux

Control

EX

M

WB

M

WB

WB

Mux

Hazarddetection

unit

Forwardingunit

Mux

IF.Flush

IF/ID

and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8

MEM/WB

EX/MEM

ID/EX

Clock 3

72 44

48 44

28

7

$1

$3

10

48

72

72

0

Mux

0

$4

$8

ALUData

memory

bubble (nop)lw $4, 50($7)

Clock 4

Mux

Shiftleft 2

before<1>

beq $1, $3, 7 sub $10, . . . before<1>

before<2>

=

PC Instructionmemory

4

Registers

Signextend

Mux

Mux

Control

EX

M

WB

M

WB

WB

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

MEM/WB

EX/MEM

ID/EX

76 72

76 72

$1

$3

10

76

ALUData

memory

Mux

Shiftleft 2

=

sub $10, $4, $8

beq $1, $3, 7

and $12, $2, $5

lw $4, 50($7)

Page 86: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

98

Summary of hazardsData hazards:

* Forward from previous instruction

* Forward from two instructions ago

* (Forward thru “transparent”GPR = from 3 instructions ago)

* If we cannot forward, (after lw) we stall the pipe by inserting a nop and freezing IF/ID and PC for 1 ck cycle

Control hazards:

* If branch is successful we flush the instruction following the branch (which is at the IF/ID register. We

just clear the register)

Notes:

In the real MIPS CPU, no flush was employed. This give the compiler the opportunity to put useful instructions following the branch. This explains why the simulator always performs the instruction following the branch.this is called a delayed branch.

Also, in the real MIPS CPU no lw stall was used. Again this give some freedom to the compiler to

choose whether to put a nop following lw or some useful instruction. This is called a delayed load.

Page 87: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

99

Other slides

Page 88: 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

100

Handling exceptions