1 pipeline datapath with some slides from: john lazzaro and dan garcia

1

Pipeline Datapath

With some slides from:

John Lazzaro andDan Garcia

2

Flip Flops פליפ-לופ

D Q A flip-flop “samples” right before the edge, and then “holds” value.Sampling

circuitHolds value

Which circuit contributes t_setup delay?

Which circuit contributes t_clk-to-Q delay?

3

D Q

clk-to-Q ?

CLK == 0

Sense D, but Qoutputs old value.

CLK 0->1

Capture D, passvalue to Q

CLK

setup ?

hold ?

clk-to-Q

setup

hold

4

Performance Equation

Seconds

Program

Instructions

Program=

Seconds

Cycle Instruction

Cycles

Goal is to

optimize execution time,

notindividu

alequationterms.

The CPI of the

program.Reflects

the program’

s instructio

n mix.

Machinesare

optimizedwith

respect to

programworkload

s.

Clockperiod.

Optimizejointlywith

machineCPI.

5

השעוןHertz=1/sec

של במהירות פנטיום מחשב

מבצע שהוא שעון 2 *10^8פירושו מחזורי.בשניה

לוקח שעון מחזור כל

200MHZ

5*10^-9=5nanosecond

בימינו פקודה לוקחת ?כמה

6

datapath מבנה ה-P

C

inst

ruct

ion

me

mor

y

+4

rtrs

rd

regi

ste

rs

ALU

Da

tam

em

ory

imm

1. InstructionFetch

2. Decode/ Register

Read

3. Execute 4. Memory5. Write

Back

7

Gotta Do Laundry° Ann, Brian, Cathy, Dave

each have one load of clothes to wash, dry, fold, and put away

A B C D

° Dryer takes 30 minutes

° “Folder” takes 30 minutes

° “Stasher” takes 30 minutes to put clothes into drawers

° Washer takes 30 minutes

8

Sequential Laundry

• Sequential laundry takes 8 hours for 4 loads

Task

Order

B

C

D

A

30Time

3030 3030 30 3030 3030 3030 3030 3030

6 PM 7 8 9 10 11 12 1 2 AM

9

Pipelined Laundry

• Pipelined laundry takes 3.5 hours for 4 loads!

Task

Order

B

C

D

A

12 2 AM6 PM 7 8 9 10 11 1

Time303030 3030 3030

10

General Definitions

• Latency: time to completely execute a certain task– for example, time to read a sector from disk is

disk access time or disk latency

• Throughput: amount of work that can be done over a period of time

11

Pipelining Lessons (1/2)

• Pipelining doesn’t help latency of single task, it helps throughput of entire workload

• Multiple tasks operating simultaneously using different resources

• Potential speedup = Number pipe stages

• Time to “fill” pipeline and time to “drain” it reduces speedup:2.3X v. 4X in this example

6 PM 7 8 9

Time

B

C

D

A

303030 3030 3030Task

Order

12

Pipelining Lessons (2/2)

• Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline?

• Pipeline rate limited by slowest pipeline stage

• Unbalanced lengths of pipe stages also reduces speedup

6 PM 7 8 9

Time

B

C

D

A

303030 3030 3030Task

Order

13

Inspiration: Automobile assembly lineAssembly line moves on a steady clock.

Each station does the same task on each car.Car body shell

Car chassis

Mergestation

Boltingstation

The clock

14

Inspiration: Automobile assembly lineSimpler station tasks → more cars per hour.Simple tasks take less time, clock is faster.

15

Inspiration: Automobile assembly lineLine speed limited by slowest task.

Most efficient if all tasks take same time to do

16

Inspiration: Automobile assembly lineSimpler tasks, complex car → long line!

These lines go 24 x 7,

and rarely shut down. Why?

17

Lessons from car assembly lines

Faster line movement yields more cars per hour off the line.

Faster line movement requires more stages, each doing simpler tasks.

To maximize efficiency, all stages should take same amount of time(if not, workers in fast stages are idle)

“Filling”, “flushing”, and “stalling” assembly line are all bad news.

19

datapath מבנה ה-P

C

inst

ruct

ion

me

mor

y

+4

rtrs

rd

regi

ste

rs

ALU

Da

tam

em

ory

imm

1. InstructionFetch

2. Decode/ Register

Read

3. Execute 4. Memory5. Write

Back

21

Key Analogy: The instruction is the car

IR IR IR

Instruction Fetch

IR

Pipeline Stage #1

Stage #2

Controlshardware

in stage 2

Stage #3

Controlshardware

in stage 3

Stage #4

Controlshardware

in stage 4

Stage #5

Controlshardware

in stage 5

“Data-stationary control”

22

Representation #1: Timeline

IR IR

IF (Fetch) ID (Decode) EX (ALU)

IR IR

MEM WB

ADD R4,R3,R2

OR R7,R6,R5

SUB R1,R9,R8XOR R3,R2,R1

AND R6,R5,R4I1:I2:I3:I4:I5:

Sample Program

IF ID

IF

EX

ID

IF

MEM

EX

ID

IF

WB

MEM

EX

IFID

WB

MEM

IDEX

IF

WB

EXMEM

IDMEMWB

EXPipeline is “full”

Good for visualizing pipeline fills.

I1:I2:I3:I4:I5:

t1 t2 t3 t4 t5 t6 t7 t8Time:Inst

I6:

23

Pipeline is “full”

Good for visualizing pipeline stalls.

Representation #2: Resource Usage

IR IR IR IR

ADD R4,R3,R2

OR R7,R6,R5

SUB R1,R9,R8XOR R3,R2,R1

AND R6,R5,R4I1:I2:I3:I4:I5:

Sample ProgramI1 I2

I1

I3

I2

I1

I4

I3

I2

I1

I5

I4

I3

I1I2

IF:ID:EX:MEM:WB:

t1 t2 t3 t4 t5 t6 t7 t8Time:Stage

IF (Fetch) ID (Decode) EX (ALU) MEM WB

I5

I4

I2I3

I6

I5

I3I4

I6

I7

I4I5

I6

I7

I8

24

Review: Datapath for MIPS

Stage 1 Stage 2 Stage 3Stage 4 Stage 5

• Use datapath figure to represent pipelineIFtch Dcd Exec Mem WB

AL

U I$ Reg D$ Reg

PC

inst

ruct

ion

me

mor

y+4

rtrs

rd

regi

ste

rs

ALU

Da

tam

em

ory

imm

1. InstructionFetch

2. Decode/ Register Read 3. Execute 4. Memory

5. WriteBack

25

Graphical Pipeline Representation

Instr.

Order

Load

Add

Store

Sub

Or

I$

Time (clock cycles)

I$

AL

U

Reg

Reg

I$

D$

AL

U

AL

U

Reg

D$

Reg

I$

D$

RegA

LU

Reg Reg

Reg

D$

Reg

D$

AL

U

(In Reg, right half highlight read, left half write)

Reg

I$

26

Example• Suppose 2 ns for memory access, 2 ns for ALU operation, and 1 ns for register file read or write

• Nonpipelined Execution:–lw : IF + Read Reg + ALU + Memory + Write Reg

= 2 + 1 + 2 + 2 + 1 = 8 ns–add: IF + Read Reg + ALU + Write Reg

= 2 + 1 + 2 + 1 = 6 ns

• Pipelined Execution:–Max(IF,Read Reg,ALU,Memory,Write Reg) = 2

ns

28

לשלבים חלוקה

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

1

ALUresult

Mux

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back

29

הרגיסטרים הוספת

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address

30

In struc t ion

m e m o ry

Add re ss

4

32

0

A ddA d d

re s u lt

S h if t

le f t 2

Ins

tru

ct i

on

IF /I D E X / M E M M E M /W B

Mux

0

1

A d d

P C

0W ri teda ta

Mux

1

R eg iste rs

R e a dd a ta 1

R e a dd a ta 2

R e a dre g is te r 1


16S ig n

e xte nd

W ritere g is te r

W rited ata

R ea dda ta

1

A LUre s u lt

Mux

A LU

Z e ro

ID /E X

In s tr u c t io n fe tc h

lw

A

IF/ID

31

In struc t ion

m e m o ry

A dd re ss

4

32

0

A ddA dd

res u lt

S h if t

le ft 2

Inst

ruc

tio

n

IF / I D E X / M E M

Mux

0

1

A dd

P C

0W r iteda ta

Mux

1

R eg iste rs

R e a dda ta 1

R e a dda ta 2



16S i gn

e xte nd

W r itere g is te r

W rited ata

R ea dda ta

1

A LUre s u lt

Mux

A LU

Ze ro

ID /E X M E M /W B

I n s t r u c t io n d e c o d e

lw

A d d re s s

D ata

m e m o ry

ID/EX

32

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Execution

lw

Address

Datamemory

EX/MEM

33

In s tru c t ion

m e m or y

A dd res s

4

32

0

A d dA dd

r e su lt

S h if t

le ft 2

Ins

tru

ct i

on

I F /ID E X/M E M

Mux

0

1

A d d

P C

0W rit ed a ta

Mux

1

R e g is te rs

R ea dd ata 1

R ea dd ata 2

R e adre g is ter 1

R e adre g is ter 2

16S ig n

e xte nd

W ritere g is ter

W riteda ta

R e add at a

D a ta

m em o ry

1

A L Ures u lt

Mux

A L U

Z er o

ID /E X M E M /W B

M e m o r y

lw

A d dre ss

MEM/WB

34

In stru ct ion

m em ory

A dd res s

4

32

0

Ad dA dd

r esu lt

S h ift

l e ft 2

Ins

tru

ct io

n

I F /ID E X /M E M

Mux

0

1

A d d

P C

0W rit ed a ta

Mux

1

R eg iste rs

R e add ata 1

R e add ata 2

R e a dre g is ter 1

R e a dre g is ter 2

16S ig n

e xte nd

W rited a ta

R e a ddat aD a ta

m e m o ry

1

A LUre s u lt

Mux

A LU

Ze ro

ID /EX M E M /W B

W r it e b a c k

l w

W ritere g is te r

A d dre s s

35

A correction !!! תיקון

Keep the right Rd all the way!

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX

36

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Inst

ruct

ion

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Address

Datamemory

So here is the updated CPU;

38

Control

PC

Instructionmemory

Address

Inst

ruct

ion

Instruction[20– 16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1Write

data

Read

data Mux

1

ALUcontrol

RegWrite

MemRead


6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

PCSrc

Zero

AddAdd

result

Shiftleft 2

ALUresult

ALU

Zero

Add

0

1

Mux

0

1

Mux

39

הבקרה קוויExecution/Address Calculation

stage control linesMemory access stage

control lines

Write-back stage control

lines

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

40

PC

Instructionmemory

Inst

ruct

ion

Add


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4


0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

RegW

rite

MemRead

Control

ALU


6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Mem

Writ

e

AddressData

memory

Address

Datapath with Control

41

דוגמאA demonstration of a sequence of instructions:

Lw $10,20($1)

Sub $11,$2,$3

And $12,$4,$5

Or $13,$6,$7

Add $14,$8,$9

42

Instructionmemory


Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4


0

Mux

0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU


EX

M

WB

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

000

00

0000

000

00

000

0

00

00

0

0

0

Mux

0

1

Add

PC

0

Datamemory

Address

Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

Mux

0

1

Add Addresult

Writeregister

Writedata

Mux

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Re

gWrit

e

ALU

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

010

11

0001

000

00

000

0

00

00

0

0

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

lwControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

X

10

20

X

1


Instruction[15– 0] Sign

extend


20

$X

$1

10

X

Me

mW

rite

MemReadM

em

Writ

e

Datamemory

Address

Address

Address

Clock 2

Clock 1

43

Instructionmemory

Address


Mem

toR

eg

Branch

ALUSrc

4


0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU


EX

M

WB

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

000

10

1100

010

11

000

1

00

00

0

0

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

Writeregister

Writedata 1

ALUresult

ALUcontrol

Shiftleft 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

MEM/WB

IF: or $13, $6, $7

000

10

1100

000

10

101

0

11

10

0

0

0

Mux

0

1

Add

PC

0Writedata

Mux

1

andControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

12

X

X

5

4




X

$5

$4

X

12

Me

mW

rite

MemReadM

em

Writ

e

sub

11

X

X

3

2

X

$3

$2

X

11

$1

20

10

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$3

$2

11

Mux

Mux

ALUAddress Read

dataData

memory

10

WB

Zero

Zero

Signextend

Signextend

Datamemory

Address

Clock 3

Clock 4

ID: and $12, $4, $5

44

Instructionmemory

Address


Branch

ALUSrc

4


0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU


EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

MEM/WB

IF: add $14, $8, $9

000

10

1100

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

ALUcontrol

Shiftleft 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

MEM/WB

IF: after<1>

000

10

1100

000

10

101

0

10

00

0

1

0

Mux

0

1

Add

PC

0Writedata

Mux

1

addControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

14

X

X

9

8




X

$9

$8

X

14

Me

mW

rite

MemReadM

em

Writ

e

or

13

X

X

7

6

X

$7

$6

X

13

$4

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

$7

$6

13

Mux

Mux

ALUReaddata

12

WB

11 10

10$5

12

WB

Mem

toR

eg

1

1

11

11

Writeregister

Writedata

Zero

Zero

Datamemory

Address

Datamemory

Address

Signextend

Signextend

Clock 5

Clock 6

45

Instructionmemory

Address


Branch

ALUSrc

4


0

1

Add Addresult


Writedata

ALUresult

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU


Signextend

EX

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .

MEM/WB

IF: after<2>

000

00

0000

000

10

101

0

10

00

0

Mux

0

1

Add

PC

0Writedata

Readdata

Mux

1

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

Zero

ALUcontrol

Shiftleft 2

Re

gWrit

e

M

WB

Inst

ruct

ion

IF/ID EX/MEMID/EX

ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .

MEM/WB

IF: after<3>

000

00

0000

000

00

000

0

10

00

0

1

0

Mux

0

1

Add

PC

0Writedata

Mux

1

Control

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2



extend


Me

mW

rite

MemReadM

em

Writ

e

$8

Mux

0

Mux

1

ALUOp

RegDst

ALUcontrol

M

WB

Mux

Mux

ALUReaddata

14

WB

13 12

12$9

14

WB

Mem

toR

eg

1

0

13

13

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2 Zero

Datamemory

Address

Datamemory

Address

Clock 7

Clock 8

46

WB

EX

M

Instructionmemory

Address

Mem

toR

eg

ALUOp

Branch

RegDst

ALUSrc

4

0

0

1

Add Addresult

1

ALUresult

Zero

ALUcontrol

Shiftleft 2R

egW

r ite

M

WB

Inst

r uct

ion

IF/ID EX/MEMID/EX

ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .

MEM/ WB

IF: after<4>

000

00

0000

000

00

000

0

00

00

0

1

0

Mux

0

1

Add

PC

0Writedata

Mux

1

Control

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2



extend

Instru ction[15– 11]

MemRead

Mem

Wr i

te

Mux

Mux

ALUReaddata

WB

14

14

Writeregister

Writedata

Datamemory

Address

Clock 9

48

Problems for Computers• Limits to pipelining: Hazards prevent next

instruction from executing during its designated clock cycle– Structural hazards: HW cannot support this combination

of instructions (single person to fold and put clothes away)

– Control hazards: Pipelining of branches & other instructions stall the pipeline until the hazard “bubbles” in the pipeline

– Data hazards: Instruction depends on result of prior instruction still in the pipeline

49

An example for data hazards:

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

50

An example for data hazards:

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

An example for data hazards:Register $2 is updated only at the WB phase, i.e., the 5th clock cycle (actually at the end of the 5th clock cycle). However, we try to use it at the 3rd clock cycle when we read $2 at the decode phase of the and instruction

51

Graphic representation of data hazards:

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecutionorder(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

52

sub $2, $1, $3

nop

nop

nop

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Solving data hazards by adding nops

53

Solving data hazards by adding nops


and $12, $2, $

or $13, $6, $ 2

add $14, $2, $

sw $15, 100( $2

5

2

)g

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Re

Reg

Reg

Reg

DM

IM Reg DMReg

IM Reg DM Reg

IM Reg DMReg

IM Reg DMReg

sub $2, $1, $ 3

nop

nop

nop



CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20


CC 10 CC 11 CC 12

– 20 – 20 – 20

54

The internal structure of the Register File

32

32

3232

32

32

32

32

32

32

32

Read data 2

write data

Read data 1

5

5

5

Rd reg 2 (= Rt)

Rd reg 1 (= Rs)

RegWrite

Wr reg (= Rd) 32

32

E

שונים רגיסטרים שני של ערכים בוזמנית היציאות משתי קוראים) הבאה ) השעון בעליית האחרים הרגיסטרים לאחד כותבים

55

We could earn 1 ck cycle if GPR is “transparent”


and $12, $2, $

or $13, $6, $ 2

add $14, $2, $

sw $15, 100( $2

5

2

) g

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Re

Reg

Reg

Reg

DM

IM Reg DMReg

IM Reg DM Reg

IM Reg DMReg

sub $2, $1, $ 3

nop

nop



CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20


CC 10 CC 11 CC 12

– 20 – 20 – 20

We could earn 1 ck cycle if GPR is “transparent”, i.e, we could see the write data to the GPR at the GPR outputs (if the write address equals the read address), i.e., during Ck #5.

56

The internal structure of the modified Register File. We ‘bypass” the input data (the write data) to the read data1 output whenever Rs=Rd/Rt (i.e., whenever read reg1=write reg but not zero). We “bypass” the input data (the write data) to the read data2 output whenever Rt=Rd/Rt (i.e., whenever read reg2=write reg, but not zero).

32

32

3232

32

32

32

32

32

32

32

Read data 2

write data

Read data 1

5

5

5

Rd reg 2 (= Rt)

Rd reg 1 (= Rs)

RegWrite

Wr reg (= Rd) 32

32

E

שונים רגיסטרים שני של ערכים בוזמנית היציאות משתי קוראים) הבאה ) השעון בעליית האחרים הרגיסטרים לאחד כותבים

032

032

32

3232

write data32

write data

Wr reg 5

5Wr reg

57

sub $2, $1, $3

nop

nop

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

After doing that change we only need 2 nops

After the change the WB of an early instruction can happen at the same time with the read reg (decode) phase of a newer instruction (3 with two other instructions in between). In case we have a data hazard, we need to add only two nop instructions.

Unfortunately, this happens too often. We need a better solution!

58

Graphic representation of data hazards:

IM Reg

IM Reg



sub $2, $1, $3


and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)


DM Reg

Reg

Reg

Reg

DM

59

IM Reg

IM Reg



sub $2, $1, $3


and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)


DM Reg

Reg

Reg

Reg

DM

60

Forwarding – הערכים גניבת

IM Reg

IM Reg



sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

61

IM Reg

IM Reg



sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

62

Forwarding (done at the execute phase)

PCInstruction

memory

Registers

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Forwardingunit

IF/ID

Inst

ruct

ion

Mux

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

Rt

Rt

Rs

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

If ID/EX.Rs=EX/MEM.Rd, i.e., the Rd of the previous instruction equals the Rs of the current instruction (which is in the “decode” phase), then we use the “ALUout” of the previous instruction

instead of the output of the GPR. If ID/EX.Rs=MEM/WB.Rd, i.e., the Rd of the previous instruction equals the Rs of the current instruction (which is in the “decode” phase), then we use the “ALUout” of the previous instruction instead of the output of the GPR. [ similarly, compare also ID/EX.Rt to MEM/WB.Rd ]

Similarly, compare also ID/EX.Rt to EX/MEM.Rd and to MEM/WB.Rd

65

An example for forwarding דוגמא

Sub $2, $1, $3

And $4, $2, $5 needs forwarding from the previous instruction

Or $4, $4, $2 needs forwarding from two instructions back

Add $9, $4, $2 needs forwarding from 3 instructions back (thru the “transparent” GPR)

Here we discuss the $2 register only

(The first two cases are handled in the execute phase, the last one, in the decode phase).

66

An example for forwarding דוגמא

Sub $2, $1, $3

And $4, $2, $5

Or $4, $4, $2 needs forwarding from the previous instruction

Add $9, $4, $2 needs forwarding from the previous instruction

Here we discuss the $4 register and there are two case (the 2nd one in purple)

67

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5 sub $2, $1, $3

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

10 10

$2

$5

5

2

4

$1

$3

3

1

2

Control

ALU

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

or $4, $4, $2 and $4, $2, $5

ID/EX

sub $2, . . .

EX/MEM

before<1>

MEM/WB

add $9, $4, $2

Clock 4

4

6

10 10

$4

$2

6

2

4

$2

$5

5

2

4

Control

ALU

10

2

WB

M

WB

Since Rs=2 and Rd of previous inst. was 2, we use ALUout instead of Rs

Sub $2, $1, $3

And $4, $2, $5

Or $4, $4, $2

Add $9, $4, $2

68

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

add $9, $4, $2 or $4, $4, $2

ID/EX

and $4, . . .

EX/MEM

sub $2, . . .

MEM/WB

after<1>

Clock 5

4

2

10 10

$4

$2

2

4

9

$4

$2

4

2

24

Control

ALU

10

WB

2

1

PCInstruction

memory

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

after<1>after<2> add $9, $4, $2 or $4, . . .

EX/MEM

and $4, . . .

MEM/WB

Clock 6

10

$4

$2

2

4

9

ALU

10

4

4

WB

4

1

Registers

Inst

ruct

ion

IF/ID

ID/EX

4

Control

In blue we see forwarding from two instructions back (Mem->Exec.), in red, from previous instruction (WB->Exec.), in purple, from 3 instructions back (WB->Decode).

69

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

add $9, $4, $2 or $4, $4, $2

ID/EX

and $4, . . .

EX/MEM

sub $2, . . .

MEM/WB

after<1>

Clock 5

4

2

10 10

$4

$2

2

4

9

$4

$2

4

2

24

Control

ALU

10

WB

2

1

PCInstruction

memory

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

after<1>after<2> add $9, $4, $2 or $4, . . .

EX/MEM

and $4, . . .

MEM/WB

Clock 6

10

$4

$2

2

4

9

ALU

10

4

4

WB

4

1

Registers

Inst

ruct

ion

IF/ID

ID/EX

4

Control

70

The solution does not work for lw - הפתרון תמיד לאעובד

(in lw we do not have the data in the pipe!, it comes from the data memory!)

Reg

IM

Reg

Reg

IM



lw $2, 20($1)


and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

If the previous instruction was lw to a register and we try to use the register in the current instruction, we have a problem, since we cannot go back in time!

One solution is to avoid such cases by adding a nop (by the Assembler) whenever Rt of the lw is equal to Rs or Rt of the following instruction.

71

Another h/w solution is to add Bubbles,i.e., add nop by hardware

lw $2, 20($1)


and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

“nop”

We need to hold IF/ID for one ck cycle and insert a “nop: into ID/EX. This is equal to adding a nop instruction by the Assembler.

72

Hazard detection unit

PCInstruction

memory

Registers

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

0

Mux

IF/ID

Inst

ruct

ion

ID/EX.MemRead

IF/I

DW

rite

PC

Wri

te

ID/EX.RegisterRt

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs

RtRs

Rd

Rt EX/MEM.RegisterRd

MEM/WB.RegisterRd

We need to hold the IF/ID and PC for one ck cycle and insert a “nop: into ID/EX. This is equal to adding a nop instruction by the Assembler.

If (ID/EX.MemRd)&& ( (ID/EX.Rt= =IF/ID.Rs) || (ID/EX.Rt= =IF/ID.Rt) ) we must “stall” the pipeline! This means that prev. inst was lw and it was to the current Rs or Rt. (of course if one of them is not used, don’t stall)

Holding means”freeze” the IF/ID and the PC for 1 clock cycle

Hold the IF/ID by not giving a IF/IDWrire signal and do not increment the PC (which already points at the nex instruction) by not giving the PCWrite signal. Inserting a nop is by clearing all control signals.

Rt from prev. inst.

Rs, Rt of current inst.

identifies lw

73

An example for lw hazard detection דוגמא

lw $2, 20($1)

And $4, $2, $5

Or $4, $4, $2

Add $9, $4, $2

74

Hazarddetection

unit

0

MuxIF

/ID

Wri

te

PC

Wri

te

ID/EX.RegisterRt

lw $2, 20($1)

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5

ID/EX

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

2

5

2

500 11

$2

$5

5

2

4

$1

$X

X

1

2

Control

ALU

M

WB

Hazarddetection

unit

0

MuxIF

/ID

Wri

te

PC

Wri

te

ID/EX.RegisterRt

ID/EX.MemRead

ID/EX.MemRead

M

WB

$1

$X

X

1

2

before<3>

PCInstruction

memory

Registers

Mux

Mux

Mux

EX WB

Datamemory

Mux

Forwardingunit

Inst

ruct

ion

IF/ID

ID/EX

EX/MEM

MEM/WB

and $4, $2, $5 lw $2, 20($1) before<1> before<2>

Clock 2

1

1

X

X11

Control

ALU

M

WB

75

Hazarddetection

unit

0

MuxIF

/IDW

rite

PC

Writ

e

ID/EX.RegisterRt

$2

$5

5

2

2

2

4

WB

Hazarddetection

unit

0

MuxIF

/IDW

rite

PC

Writ

e

ID/EX.RegisterRt

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

Datamemory

Mux

Inst

ruct

ion

IF/ID

and $4, $2, $5 bubble

ID/EX

lw $2, . . .

EX/MEM

before<1>

MEM/WB

Clock 4

2

2

5

510

11

00

$2

$5

5

2

4

Control

ALU

M

WB

bubble lw $2, . . .

PCInstruction

memory

Registers

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Forwardingunit

Inst

ruct

ion

IF/ID

and $4, $2, $5

ID/EX

EX/MEM

MEM/WB

add $9, $4, $2

Clock 5

2

210 10

11

$4

$2

2

4

4

4

2

4

$2

$5

5

2

4

Control

ALU

0

WB

ID/EX.MemRead

ID/EX.MemRead

or $4, $4, $2

or $4, $4, $2

The lw instruction is in the WB phase. $2 is “being written”. We can use $2 in the Execute phase of the and instruction, with the help of forwarding.

76

Registers

Inst

ruct

ion

ID/EX

4

Control

PCInstruction

memory

PCInstruction

memory

Hazarddetection

unit

0

Mux

IF/I

DW

rite

PC

Wri

te

IF/I

DW

rite

PC

Wri

te

ID/EX.RegisterRt

bubble

Registers

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Inst

ruct

ion

IF/ID

add $9, $4, $2

ID/EX

and $4, . . .

EX/MEM

MEM/WB

Clock 6

4

4

2

210 10

$4

$2

2

4

49

$2

2

Control

ALU

10

WB0

add $9, $4, $2 or $4, . . . and $4, . . .after<2> after<1>

after<1>

Clock 7

Mux

Mux

Mux

EX

M

WB

M

WB

Datamemory

Mux

Forwardingunit

Forwardingunit

EX/MEM

MEM/WB

10 10

$4

$4

$2

2

4

4

9

ALU

10

WB

44

4

1

Hazarddetection

unit

0

Mux

ID/EX.RegisterRt

or $4, $4, $2

ID/EX.MemRead

ID/EX.MemRead

Mux

IF/ID

77

PC

Instructionmemory

Inst

ruct

ion

Add


Me

mto

Re

g

ALUOp

Branch

RegDst

ALUSrc

4


0

0

Mux

0

1

Add Addresult


Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU


6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Me

mW

rite

AddressData

memory

Address

Just to remind us how branch is handled we show again the Datapath with Control

78

Branch Hazards

Reg

Reg

CC 1


40 beq $1, $3, 7


IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

Here we calc.Rs-Rt

Here we decide to branch (switching the address to the PC and issuing PCWrite Cond)

These 3 instructions should be “killed” before they do harm, I.e., change any register.

In cc5 we already use the new PC calculated by the branch. (PC=72)

79

Control Hazard: Branching (1/7)

Where do we do the compare for the branch?

I$

beq

Instr 1

Instr 2

Instr 3

Instr 4A

LU I$ Reg D$ Reg

AL

U I$ Reg D$ Reg

AL

U I$ Reg D$ RegA

LUReg D$ Reg

AL

U I$ Reg D$ Reg

Instr.

Order

Time (clock cycles)

80

Control Hazard: Branching (2/7)• We put branch decision-making hardware in

ALU stage– therefore two more instructions after the branch

will always be fetched, whether or not the branch is taken

• Desired functionality of a branch– if we do not take the branch, don’t waste any

time and continue executing normally– if we take the branch, don’t execute any

instructions after the branch, just go to the desired label

81


• Initial Solution: Stall until decision is made– insert “no-op” instructions: those that

accomplish nothing, just take time– Drawback: branches take 3 clock cycles each

(assuming comparator is put in ALU stage)

82

Control Hazard: Branching (4/7)• Optimization #1:

– move asynchronous comparator up to Stage 2– as soon as instruction is decoded (Opcode

identifies is as a branch), immediately make a decision and set the value of the PC (if necessary)

– Benefit: since branch is complete in Stage 2, only one unnecessary instruction is fetched, so only one no-op is needed

– Side Note: This means that branches are idle in Stages 3, 4 and 5.

83

PC Instructionmemory

4

Registers

Mux

Mux

Mux

ALU

EX

M

WB

M

WB

WB

ID/EX

0

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

Signextend

Control

Mux

=

Shiftleft 2

Mux

84

• Insert a single no-op (bubble)


add

beq

lw

AL

U I$ Reg D$ Reg

AL

U I$ Reg D$ RegA

LUReg D$ Reg I$

Instr.

Order

Time (clock cycles)

bubble

• Impact: 2 clock cycles per branch instruction slow

85


• Optimization #2: Redefine branches– Old definition: if we take the branch, none of

the instructions after the branch get executed by accident

– New definition: whether or not we take the branch, the single instruction immediately following the branch gets executed (called the branch-delay slot)

86

Control Hazard: Branching (7/7)• Notes on Branch-Delay Slot

– Worst-Case Scenario: can always put a no-op in the branch-delay slot

– Better Case: can find an instruction preceding the branch which can be placed in the branch-delay slot without affecting flow of the program

• re-ordering instructions is a common method of speeding up programs

• compiler must be very smart in order to find instructions to do this

• usually can find such an instruction at least 50% of the time

• Jumps also have a delay slot…

87

Example: Nondelayed vs. Delayed Branch

add $1 ,$2,$3

sub $4, $5,$6

beq $1, $4, Exit

or $8, $9 ,$10

xor $10, $1,$11

Nondelayed Branch

add $1 ,$2,$3

sub $4, $5,$6

beq $1, $4, Exit

or $8, $9 ,$10

xor $10, $1,$11

Delayed Branch

Exit: Exit:

88

Question (1/2)

Assume 1 instr/clock, delayed branch, 5 stage pipeline, forwarding, interlock on unresolved load hazards (after 103 loops, so pipeline full)Loop: lw $t0, 0($s1)

addu $t0, $t0, $s2sw $t0, 0($s1)addiu $s1, $s1, -4bne $s1, $zero, Loopnop

•How many pipeline stages (clock cycles) per loop iteration to execute this code?

12345678910

89

Answer (1/2)• Assume 1 instr/clock, delayed branch, 5 stage

pipeline, forwarding, interlock on unresolved load hazards. 103 iterations, so pipeline full.

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addiu $s1, $s1, -4bne $s1, $zero, Loopnop

• How many pipeline stages (clock cycles) per loop iteration to execute this code?

1. 2. (data hazard so stall)

3.4.5.6.

(delayed branch so exec. nop)7.

1 2 3 4 5 6 7 8 9 10

90

Question (2/2)

Assume 1 instr/clock, delayed branch, 5 stage pipeline, forwarding, interlock on unresolved load hazards (after 103 loops, so pipeline full). Rewrite this code to reduce pipeline stages (clock cycles) per loop to as few as possible. Loop: lw $t0, 0($s1)

addu $t0, $t0, $s2sw $t0, 0($s1)addiu $s1, $s1, -4bne $s1, $zero, Loopnop

•How many pipeline stages (clock cycles) per loop iteration to execute this code?

12345678910

91

A (2/2) How long to execute?

• How many pipeline stages (clock cycles) per loop iteration to execute your revised code? (assume pipeline is full)

• Rewrite this code to reduce clock cycles per loop to as few as possible:

Loop: lw $t0, 0($s1)addiu $s1, $s1, -4 addu $t0, $t0, $s2bne $s1, $zero, Loopsw $t0, +4($s1)

(no hazard since extra cycle)

1.

3.

4.5.

2.

(modified sw to put past addiu)

1 2 3 4 5 6 7 8 9 10

97

PCInstruction

memory

4

Registers

Signextend

Mux

Mux

Control

EX

M

WB

M

WB

WB

Mux

Hazarddetection

unit

Forwardingunit

Mux

IF.Flush

IF/ID

and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8

MEM/WB

EX/MEM

ID/EX

Clock 3

72 44

48 44

28

7

$1

$3

10

48

72

72

0

Mux

0

$4

$8

ALUData

memory

bubble (nop)lw $4, 50($7)

Clock 4

Mux

Shiftleft 2

before<1>

beq $1, $3, 7 sub $10, . . . before<1>

before<2>

=

PC Instructionmemory

4

Registers

Signextend

Mux

Mux

Control

EX

M

WB

M

WB

WB

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

MEM/WB

EX/MEM

ID/EX

76 72

76 72

$1

$3

10

76

ALUData

memory

Mux

Shiftleft 2

=

sub $10, $4, $8

beq $1, $3, 7

and $12, $2, $5

lw $4, 50($7)

98

Summary of hazardsData hazards:

* Forward from previous instruction

* Forward from two instructions ago

* (Forward thru “transparent”GPR = from 3 instructions ago)

* If we cannot forward, (after lw) we stall the pipe by inserting a nop and freezing IF/ID and PC for 1 ck cycle

Control hazards:

* If branch is successful we flush the instruction following the branch (which is at the IF/ID register. We

just clear the register)

Notes:

In the real MIPS CPU, no flush was employed. This give the compiler the opportunity to put useful instructions following the branch. This explains why the simulator always performs the instruction following the branch.this is called a delayed branch.

Also, in the real MIPS CPU no lw stall was used. Again this give some freedom to the compiler to

choose whether to put a nop following lw or some useful instruction. This is called a delayed load.

99

Other slides

100

Handling exceptions

1 pipeline datapath with some slides from: john lazzaro and dan garcia

Documents