lecture 3 · 2021. 1. 26. · lecture 3 eecs 470 slide 1 eecs 470 lecture 3 pipelining & hazards i...

Lecture 3 Slide 1 EECS 470

EECS 470

Lecture 3

Pipelining & Hazards I

Jon Beaumont

GAS STATION

Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Mudge, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, University of Pennsylvania, and University of Wisconsin.


Announcements

• Reminder Lab #1 due Friday by end of lab (12:30 pm)

Get checked off during GSI/IA OH

Verilog assignment #1 due Friday evening Submit to autograder by 11:59p 25 submissions so far (30% of class)

HW # 1 due Thursday 2/4 Submit through Gradescope by 11:59p

• Adding more staff OH Also experimenting with different formats to see what works best Check the description on the Google calendar for each session for

details We'll try to come up with a consistent format to avoid confusion


Project 1


Last Time

• Quantifying performance Latency vs throughput Different averaging techniques (arithmetic, harmonic, geometric)

• Power and Energy


Today

• Baseline processor discussion Review 5-stage pipeline from EECS 370 Introduce Hazards


Lingering Questions

• "Could you give a few more examples of computing applications where we care more about throughput??"

• Remember, you can submit lingering questions to cover next lecture at: https://bit.ly/3oSr5FD

Latency Throughput

Real-time systems (self driving

cars, drones, etc)

Scientific computing (e.g.

simulations)

Web search Autograding a class's projects

Processing audio/video Training machine learning

models

https://bit.ly/3oSr5FD


Capacitive Power dissipation

Power ~ ½ CV2Af

Capacitance: Function of wire length, transistor size

Supply Voltage: Has been dropping with successive fab generations

Clock frequency: Increasing…

Activity factor: How often, on average, do wires switch?

Wh

at u

ses

po

wer

in a

ch

ip?


Voltage Scaling

• Scenario: 80W, 1 BIPS, 1.5V, 1GHz Cache Optimization:

IPC decreases by 10%, reduces power by 20% => Final Processor: 900 MIPS, 64W

• What if we just adjust frequency/voltage on processor? How to reduce power by 20%?

P = CV2F = CV3 => Drop voltage by 7% (and also Freq) => .93*.93*.93 = .8x

So for equal power (64W)

Cache Optimization = 900MIPS Simple Voltage/Frequency Scaling = 930MIPS W

hat

use

s p

ow

er in

a c

hip

?


Power scales roughly cubically

with frequency Scale clock frequency to 80% Now add a second core

Same power budget, but 1.6x performance!

But: Must parallelize application Remember Amdahl’s Law!

Multicore: Solution to Power-constrained design?

Performance Power


The Execution Core: Pipelining


Outline for next several lectures: Understanding the Execution Core

High-level design feature -> Actual microarchitecture example 1. Pipelining: 370’s 5-stage pipeline (review)

2. Dynamic scheduling: Scoreboard (CDC 6600)

3. Register Renaming: Tomasulo’s algorithm (IBM 360)

4. Precise interrupts with Reorder Buffer:

P6 MIPS R10K


Before there was pipelining…

Basic datapath: fetch, decode, execute • Single-cycle control:

+ Low CPI (1) – Long clock period (to accommodate slowest instruction)

• Multi-cycle control: + Short clock period – High CPI

+ Potentially better overall latency if designed well (could it ever be worse?)

insn0.fetch, dec, exec

Single-cycle

Multi-cycle


insn0.dec insn0.fetch


insn0.exec

insn1.exec


Speeding Up

Remember, three ways to speed up a process:

• Reduce number of tasks (possible?) • Decrease latency of tasks (what would that include?) • Parallelize

How do we parallelize this pipeline?


Single-cycle

Multi-cycle




insn0.exec

insn1.exec


Parallelize

Duplicate pipeline (superscalar)

• Effective, but expensive (~2x hardware overhead) • Discuss more later in semester

• Or… pipeline!



insn0.exec

insn1.exec



insn0.exec

insn2.exec



insn1.exec

insn3.exec


Pipelining

• Important performance technique Improves throughput at the expense of latency

Why does latency go up?

• Begin with multi-cycle design When instruction advances from stage 1 to 2… … allow next instruction to enter stage 1 Each instruction still passes through all stages + But instructions enter and leave at a much faster rate + More instructions executed in parallel

• Not much hardware overhead (what needs to be added?)


insn1.dec insn1.fetch Multi-cycle

Pipelined

insn0.exec

insn1.exec



insn0.exec

insn1.exec


Pipeline Illustrated:

GateDelay

Comb. Logicn Gate Delay

GateDelayL

GateDelayL

L GateDelayLGateDelayL

L BW = ~(1/n)

n--2

n--2

n--3

n--3

n--3

BW = ~(2/n)

BW = ~(3/n)


370 Processor Pipeline Review

I-cache Reg

File PC

+1

D-cache ALU

Fetch Decode Memory

(Write-back)

Tpipeline = Tbase / 5

Execute


Stage 1: Fetch

Fetch an instruction from memory every cycle. Use PC to index memory Increment PC (assume no branches for now)

Write state to the pipeline register (IF/ID) The next stage will read this pipeline register. Note that pipeline register must be edge triggered

20

Inst

ruct

ion

bit

s

IF / ID Pipeline register

PC

Instruction

Memory/

Cache

en

en

1

+

M

U

X

Res

t of

pip

elin

ed d

ata

pa

th

PC

+ 1


Stage 2: Decode

Decodes opcode bits

May set up control signals for later stages

Read input operands from registers file

specified by regA and regB of instruction bits

Write state to the pipeline register (ID/EX)

Opcode

Register contents

Offset & destination fields

PC+1 (even though decode didn’t use it)

22

Destreg

Data

ID / EX Pipeline register

Con

ten

ts

Of

reg

A

Con

ten

ts

Of

reg

B Register File

regA

regB

en

Res

t of

pip

elin

ed d

ata

pa

th

Inst

ruct

ion

bit

s

IF / ID Pipeline register

PC

+ 1

PC

+ 1

C

on

tro

l

Sig

na

ls

Sta

ge

1:

Fet

ch d

ata

path


Stage 3: Execute

Perform ALU operation.

Input operands can be:

Contents of regA or RegB

Offset field on the instruction

Branches: calculate PC+1+offset

Write state to the pipeline register (EX/Mem)

ALU result, contents of RegB and PC+1+offset

Instruction bits for opcode and destReg specifiers

24

ID / EX Pipeline register

Con

ten

ts

Of

reg

A

Con

ten

ts

Of

reg

B

Res

t of

pip

elin

ed d

ata

pa

th

AL

U

Res

ult

EX/Mem Pipeline register

PC

+ 1

C

on

tro

l

Sig

na

ls

Sta

ge

2:

Dec

od

e d

ata

path

Con

trol

Sig

nals

PC

+1

+o

ffse

t

+

con

ten

ts

of

reg

B

A

L

U M

U

X


Stage 4: Memory Operation

Perform data cache access for memory ops ALU result contains address for ld and st

Opcode bits control mem R/W and enable signals

Write state to the pipeline register (Mem/WB) ALU result and MemData

Instruction bits for opcode and destReg specifiers

26

Alu

Res

ult

Mem/WB Pipeline register

Res

t of

pip

elin

ed d

ata

pa

th

Alu

Res

ult

EX/Mem Pipeline register

Sta

ge

3:

Ex

ecu

te d

ata

path

Con

trol

Sig

nals

PC

+1

+o

ffse

t

con

ten

ts

of

reg

B

This goes back to the MUX

before the PC in stage 1.

Mem

ory

Rea

d D

ata

Data Memory

en R/W

Con

trol

Sig

nals

MUX control

for PC input


Stage 5: Write back

Writing result to register file (if required) Write MemData to destReg for ld instruction

Write ALU result to destReg for arithmetic instruction

Opcode bits control register write enable signal

28

Alu

Res

ult

Mem/WB Pipeline register

Sta

ge

4:

Mem

ory

da

tap

ath

Con

trol

Sig

nals

Mem

ory

Rea

d D

ata

M

U

X

This goes back to data

input of register file

This goes back to the

destination register specifier

M

U

X

bits 0-2

bits 16-18

register write enable


Sample Code (Simple)

Run the following code on a pipelined datapath: add 1 2 3 ; reg 3 = reg 1 + reg 2 nand 4 5 6 ; reg 6 = reg 4 & reg 5 lw 2 4 20 ; reg 4 = Mem[reg2+20] add 2 5 5 ; reg 5 = reg 2 + reg 5 sw 3 7 10 ; Mem[reg3+10] =reg 7

30

PC Inst

mem

Reg

iste

r fi

le

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

Bits 0-2

Bits 16-18

op

dest

offset

valB

valA

PC+1 PC+1

target

ALU

result

op

dest

valB

op

dest

ALU

result

mdata

eq? instru

ction

0

R2

R3

R4

R5

R1

R6

R0

R7

regA

regB

Bits 22-24

data

dest

31

PC Inst

mem

Reg

iste

r fi

le

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

Bits 0-2

Bits 16-18

noop

0

0

0

0

0 0

0

0

noop

0

0

noop

0

0

0

0

no

op

9 12 18 7

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

Bits 22-24

data

dest

Initial

State

32

PC Inst

mem

Reg

iste

r fi

le

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

Bits 0-2

Bits 16-18

noop

0

0

0

0

0 1

0

0

noop

0

0

noop

0

0

0

0 ad

d 1

2 3

9 12 18 7

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

Bits 22-24

data

dest

Fetch:

add 1 2 3

add 1 2 3

Time: 1

33

PC Inst

mem

Reg

iste

r fi

le

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

Bits 0-2

Bits 16-18

add

3

3

9

36

1 2

0

0

noop

0

0

noop

0

0

0

0 na

nd

4 5

6

9 12 18 7

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

1

2

Bits 22-24

data

dest

Fetch:

nand 4 5 6

nand 4 5 6 add 1 2 3

Time: 2

34

PC Inst

mem

Reg

iste

r fi

le

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

Bits 0-2

Bits 16-18

nand

6

6

7

18

2 3

4

45

add

3

9

noop

0

0

0

0 lw 2

4 2

0

9 12 18 7

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

4

5

Bits 22-24

data

dest

Fetch:

lw 2 4 20

lw 2 4 20 nand 4 5 6 add 1 2 3

Time: 3

36

9

1

3

3

35

PC Inst

mem

Reg

iste

r fi

le

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

Bits 0-2

Bits 16-18

lw

4

20

18

9

3 4

8

-3

nand

6

7

add

3

45

0

0 ad

d 2

5 5

9 12 18 7

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

2

4

Bits 22-24

data

dest

Fetch:

add 2 5 5

add 2 5 5 lw 2 4 20 nand 4 5 6 add 1 2 3

Time: 4

18

7

2

6

6

45

3

36

PC Inst

mem

Reg

iste

r fi

le

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

Bits 0-2

Bits 16-18

add

5

5

7

9

4 5

23

29

lw

4

18

nand

6

-3

0

0 sw 3

7 1

0

9 45 18 7

36

41

0

22

R2

R3

R4

R5

R1

R6

R0

R7

2

5

Bits 22-24

data

dest

Fetch:

sw 3 7 10

sw 3 7 10 add 2 5 5 lw 2 4 20 nand 4 5 6 add

Time: 5

9

20

3

20

4

-3

6

45

3

37

PC Inst

mem

Reg

iste

r fi

le

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

Bits 0-2

Bits 16-18

sw

7

10

22

45

5

9

16

add

5

7

lw

4

29

99

0

9 45 18 7

36

-3

0

22

R2

R3

R4

R5

R1

R6

R0

R7

3

7

Bits 22-24

data

dest

No more

instructions

sw 3 7 10 add 2 5 5 lw 2 4 20 nand

Time: 6

9

7

4

5

5

29

4

-3

6

38

PC Inst

mem

Reg

iste

r fi

le

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

Bits 0-2

Bits 16-18

15

55

sw

7

22

add

5

16

0

0

9 45 99 7

36

-3

0

22

R2

R3

R4

R5

R1

R6

R0

R7

Bits 22-24

data

dest

No more

instructions

sw 3 7 10 add 2 5 5 lw

Time: 7

45

5

10

7

10

16

5

99

4

39

PC Inst

mem

Reg

iste

r fi

le

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

Bits 0-2

Bits 16-18

sw

7

55

0

9 45 99 16

36

-3

0

22

R2

R3

R4

R5

R1

R6

R0

R7

Bits 22-24

data

dest

No more

instructions

sw 3 7 10 add

Time: 8

22 55

22

16

5

40

PC Inst

mem

Reg

iste

r fi

le

M

U

X A

L

U

M

U

X

1

Data

memory

+ +

M

U

X

IF/

ID

ID/

EX

EX/

Mem

Mem/

WB

M

U

X

Bits 0-2

Bits 16-18

9 45 99 16

36

-3

0

22

R2

R3

R4

R5

R1

R6

R0

R7

Bits 22-24

data

dest

No more

instructions

sw

Time: 9


Time graphs

Time: 1 2 3 4 5 6 7 8 9

add

nand

lw

add

sw

fetch decode execute memory writeback






Balancing Pipeline Stages

IF

ID

EX

MEM

WB

TIF= 6 units

TID= 2 units

TEX= 9 units

TMEM= 5 units

TWB= 8 units

What is the speedup of the pipelined processor over the single-cycle processor (assuming no hazards)?

a) 2/30

b) 6/30

c) 9/30

d) 30/9

e) 30/6

f) 30/2

g) No idea

Can we do better in terms of either performance or efficiency?


Balancing Pipeline Stages

Two Methods for Stage Quantization: Merging of multiple stages Further subdividing a stage

Recent Trends: Deeper pipelines (more and more stages)

Pipeline depth growing more slowly since Pentium 4. Why?

Multiple pipelines Pipelined memory/cache accesses (tricky)


The Cost of Deeper Pipelines

Instruction pipelines are not ideal i.e. Instructions in different stages can have dependencies

Suppose add 1 2 3

nand 3 4 5

F D E M W F D E M W

t0 t1 t2 t3 t4 t5 Inst0 Inst1

F D E M W F D E M W

t0 t1 t2 t3 t4 t5 add nand E Stall

F E M D Stall D

RAW!!

(read-after-write

dependency)


Types of Dependencies and Hazards

Data Dependence (Both memory and register) True dependence (RAW)

Instruction must wait for all required input operands

Anti-Dependence (WAR) Later write must not clobber a still-pending earlier read

Output dependence (WAW) Earlier write must not clobber already-completed later write

Control Dependence (aka Procedural Dependence) Conditional branches may change instruction sequence Instructions after cond. branch depend on outcome (more exact definition later)

Not an

issue

now, but

stay

tuned


Terminology

Pipeline Hazards: Potential violations of program dependences Must ensure program dependences are not violated

Hazard Resolution: Static Method: Performed at compiled time in software Dynamic Method: Performed at run time using hardware

Pipeline Interlock: Hardware mechanisms for dynamic hazard resolution Must detect and enforce dependences at run time


Handling Data Hazards

Avoidance (static) Make sure there are no hazards in the code

Detect and Stall (dynamic) Stall until earlier instructions finish

Detect and Forward (dynamic) Get correct value from elsewhere in pipeline


Handling Data Hazards: Avoidance

Programmer/compiler must know implementation details Insert noops between dependent instructions

add 1 2 3 noop noop nand 3 4 5

write R3 in cycle 5

read R3 in cycle 6


Problems with Avoidance

Binary compatibility New implementations may require more noops

Code size Higher instruction cache footprint Longer binary load times Worse in machines that execute multiple instructions / cycle

Intel Itanium – 25-40% of instructions are noops

Slower execution CPI=1, but many instructions are noops


Handling Data Hazards: Detect & Stall

Detection Compare regA & regB with DestReg of preceding insn.

3 bit comparators

Stall Do not advance pipeline register for Fetch/Decode Pass noop to Execute


Next Time

• Continue 5-stage review Discuss detect-and-forward in depth

• Lingering questions / feedback? I'll include an anonymous form at the end of every lecture: https://bit.ly/3oSr5FD
https://bit.ly/3oSr5FD

lecture 3 · 2021. 1. 26. · lecture 3 eecs 470 slide 1 eecs 470 lecture 3 pipelining & hazards i...

Documents