instruction-level parallelism and its exploitation (part i...

118
Instruction-Level Parallelism and Its Exploitation (part I and II) ECE 154B Dmitri Strukov

Upload: others

Post on 25-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Instruction-Level Parallelism and Its Exploitation(part I and II)

ECE 154B

Dmitri Strukov

Page 2: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Introduction

• Pipelining become universal technique in 1985– Overlaps execution of instructions

– Exploits “Instruction Level Parallelism”

• Beyond this, there are two main approaches:– Hardware-based dynamic approaches

• Used in server and desktop processors

• Not used as extensively in PMP processors

– Compiler-based static approaches• Not as successful outside of scientific applications

Page 3: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Instruction-Level Parallelism

• When exploiting instruction-level parallelism, goal is to minimize CPI

– Pipeline CPI =• Ideal pipeline CPI +

• Structural stalls +

• Data hazard stalls +

• Control stalls

Page 4: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

ILP techniques

Page 5: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Data Dependence

• How much parallelisms exist is a property of code

• Exploiting the most of it is a goal of ILP

• Challenges: Dependencies and Hazards– Data dependency

– Name dependency

– Control dependency

• Dependency could result in data hazard

Page 6: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Data Dependence

• Data dependency– Instruction j is data dependent on instruction i if

• Instruction i produces a result that may be used by instruction j• Instruction j is data dependent on instruction k and instruction k is data dependent on

instruction I

• Dependent instructions cannot be executed simultaneously

• Pipeline organization determines if dependence is detected and if it causes a stall

• Data dependence conveys:– Possibility of a hazard– Order in which results must be calculated– Upper bound on exploitable instruction level parallelism

• Dependencies that flow through memory locations are difficult to detect

Page 7: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Name Dependence

• Two instructions use the same name but no flow of information– Not a true data dependence, but is a problem when

reordering instructions– Antidependence: instruction j writes a register or memory

location that instruction i reads• Initial ordering (i before j) must be preserved

– Output dependence: instruction i and instruction j write the same register or memory location• Ordering must be preserved

• To resolve, use renaming techniques

Page 8: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Data Hazards

• Data Hazards– Read after write (RAW)

• occur with true data dependency• add 3 1 2 ; nand 4 3 5

– Write after write (WAW)• output dependency• in pipelines that write in more than one pipe stage or with out-or-

order execution• add 3 1 2 ; nand 3 5 6

– Write after read (WAR)• antidependency• add 1 2 3 ; nand 2 5 6• can occur with reoredering

Page 9: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Compiler Techniques for Exposing ILP

• Pipeline scheduling– Separate dependent instruction from the source

instruction by the pipeline latency of the source instruction

• Example:for (i=999; i>=0; i=i-1)

x[i] = x[i] + s;

Page 10: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

A More Realistic Pipeline

Page 11: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Pipeline Stalls

Loop: L.D F0,0(R1)

stall

ADD.D F4,F0,F2

stall

stall

S.D F4,0(R1)

DADDUI R1,R1,#-8

stall (assume integer load latency is 1)

BNE R1,R2,Loop

Page 12: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Pipeline Scheduling

Scheduled code:Loop: L.D F0,0(R1)

DADDUI R1,R1,#-8

ADD.D F4,F0,F2

stall

stall

S.D F4,8(R1)

BNE R1,R2,Loop

Page 13: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Loop unrolling

• Parallelism with basic block is limited– Typical size of basic block = 3-6 instructions

– Must optimize across branches

• Loop-Level Parallelism– Unroll loop statically or dynamically

– Use SIMD (vector processors and GPUs)

Page 14: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Loop Unrolling

• Loop unrolling– Unroll by a factor of 4 (assume # elements is divisible by 4)– Eliminate unnecessary instructions

Loop: L.D F0,0(R1)

ADD.D F4,F0,F2

S.D F4,0(R1) ;drop DADDUI & BNE

L.D F6,-8(R1)

ADD.D F8,F6,F2

S.D F8,-8(R1) ;drop DADDUI & BNE

L.D F10,-16(R1)

ADD.D F12,F10,F2

S.D F12,-16(R1) ;drop DADDUI & BNE

L.D F14,-24(R1)

ADD.D F16,F14,F2

S.D F16,-24(R1)

DADDUI R1,R1,#-32

BNE R1,R2,Loop

note: number of live registers vs. original loop

Page 15: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Loop Unrolling/Pipeline Scheduling

• Pipeline schedule the unrolled loop:

Loop: L.D F0,0(R1)

L.D F6,-8(R1)

L.D F10,-16(R1)

L.D F14,-24(R1)

ADD.D F4,F0,F2

ADD.D F8,F6,F2

ADD.D F12,F10,F2

ADD.D F16,F14,F2

S.D F4,0(R1)

S.D F8,-8(R1)

DADDUI R1,R1,#-32

S.D F12,16(R1)

S.D F16,8(R1)

BNE R1,R2,Loop

Page 16: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Strip Mining

• Unknown number of loop iterations?– Number of iterations = n

– Goal: make k copies of the loop body

– Generate pair of loops:• First executes n mod k times

• Second executes n / k times

• “Strip mining”

Page 17: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Challenges to Loop Unrolling

• Limited benefit for large k (Amdahl's law)

• Code size

• Register pressure

Page 18: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Out-of-Order Execution: Scoreboarding

Page 19: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Out-of-Order Execution

• Variable latencies make out-of-order execution desirable• How do we prevent WAR and WAW hazards?• How do we deal with variable latency?

– Forwarding for RAW hazards harder.

Instruction

add 3 2 1

mul 6 5 4

div 8 7 6add 7 2 1

sub 8 2 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14

IF ID EX M WB

IF ID E1 E2 E3 E4 E5 E6 E7 M WB

IF ID x x x x x x E1 E2 E3 E4 …

IF ID EX M WB

IF ID EX M WB

Page 20: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard: a Bookkeeping Technique

• Out-of-order execution divides ID stage:1. Issue—decode instructions, check for structural hazards2. Read operands—wait until no data hazards, then read

operands

• Scoreboards date to CDC6600 in 1963• Instructions execute whenever not dependent on

previous instructions and no hazards. • CDC 6600: In order issue, out-of-order execution, out-

of-order commit (or completion)– No forwarding!– Imprecise interrupt/exception model

Page 21: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Architecture(CDC 6600)

Fun

ctiona

l Units

Regist

ers

FP Mult

FP Mult

FP Divide

FP Add

Integer

MemorySCOREBOARD

Fetch/Issue

Page 22: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Implications• Out-of-order completion => WAR, WAW hazards?• Solutions for WAR:

– Stall writeback until all prior uses of the destination register have been read

– Read registers only during Read Operands stage

• Solution for WAW:– Detect hazard and stall issue of new instruction until other

instruction completes

• Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units

• Scoreboard keeps track of dependencies between instructions that have already issued

• Scoreboard replaces ID, EX, M and WB with 4 stages

Page 23: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Four Stages of Scoreboard Control

• Issue—decode instructions & check for structural hazards (ID1)– Instructions issued in program order (for hazard checking)– Don’t issue if structural hazard– Don’t issue if instruction is output dependent on any previously

issued but uncompleted instruction (no WAW hazards)

• Read operands—wait until no data hazards, then read operands (ID2)– All real dependencies (RAW hazards) resolved in this stage,

since we wait for instructions to write back data.– No forwarding of data in this model!

Page 24: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Four Stages of Scoreboard Control

• Execution—operate on operands (EX)– The functional unit begins execution upon receiving

operands. When the result is ready, it notifies the scoreboard that it has completed execution. This includes memory ops.

• Write result—finish execution (WB)– Stall until no WAR hazards with previous instructions:

Example: DIV 1 2 3ADD 3 4 5NAND 4 7 6

CDC 6600 scoreboard would stall NAND until ADD reads operands

Page 25: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Three Parts of the Scoreboard

• Instruction status: Which state the instruction is in

• Functional unit status: Indicates the state of the functional unit (FU)―nine fields for each functional unit– Busy: Indicates whether the unit is busy or not

– Op: Operation to perform in the unit (e.g., + or –)

– Fi: Destination register number

– Fj,Fk: Source-register numbers

– Qj,Qk: Functional units producing source registers Fj, Fk

– Rj,Rk: Flags indicating when Fj, Fk are ready

• Register result status– Indicates which functional unit will write each register, if one exists

– Blank when no pending instructions will write that register

Page 26: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard ExampleInstruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2

LD F2 45+ R3

MULTD F0 F2 F4

SUBD F8 F6 F2

DIVD F10 F0 F6

ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

Mult1 No

Mult2 No

Add No

Divide No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F30FU

Page 27: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Detailed Scoreboard Pipeline Control

Read operands

Execution complete

Instruction status

Write result

Issue

Bookkeeping

Rj No; Rk No

f(if Qj(f)=FU then Rj(f) Yes);f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No

Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’;

Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj;

Rk not Qk; Result(‘D’) FU;

Rj and Rk

Functional unit done

Wait until

f((Fj(f)Fi(FU) or Rj(f)=No) &

(Fk(f)Fi(FU) or Rk( f )=No))

Not busy (FU) and not result(D)

Page 28: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 1Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1

LD F2 45+ R3

MULTD F0 F2 F4

SUBD F8 F6 F2

DIVD F10 F0 F6

ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F6 R2 Yes

Mult1 No

Mult2 No

Add No

Divide No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU Integer

Page 29: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 2Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2

LD F2 45+ R3

MULTD F0 F2 F4

SUBD F8 F6 F2

DIVD F10 F0 F6

ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F6 R2 Yes

Mult1 No

Mult2 No

Add No

Divide No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Integer

• Issue 2nd LD?

Page 30: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 3Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3

LD F2 45+ R3

MULTD F0 F2 F4

SUBD F8 F6 F2

DIVD F10 F0 F6

ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F6 R2 No

Mult1 No

Mult2 No

Add No

Divide No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Integer

• Issue MULT?

Page 31: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 4Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3

MULTD F0 F2 F4

SUBD F8 F6 F2

DIVD F10 F0 F6

ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

Mult1 No

Mult2 No

Add No

Divide No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU

Page 32: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 5Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5

MULTD F0 F2 F4

SUBD F8 F6 F2

DIVD F10 F0 F6

ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F2 R3 Yes

Mult1 No

Mult2 No

Add No

Divide No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Integer

Page 33: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 6Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6

MULTD F0 F2 F4 6

SUBD F8 F6 F2

DIVD F10 F0 F6

ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F2 R3 Yes

Mult1 Yes Mult F0 F2 F4 Integer No Yes

Mult2 No

Add No

Divide No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Mult1 Integer

Page 34: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 7Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7

MULTD F0 F2 F4 6

SUBD F8 F6 F2 7

DIVD F10 F0 F6

ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer Yes Load F2 R3 No

Mult1 Yes Mult F0 F2 F4 Integer No Yes

Mult2 No

Add Yes Sub F8 F6 F2 Integer Yes No

Divide No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Mult1 Integer Add

• Read multiply operands?

Page 35: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 8Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6

SUBD F8 F6 F2 7

DIVD F10 F0 F6 8

ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 No

Add Yes Sub F8 F6 F2 Yes Yes

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU Mult1 Add Divide

Page 36: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 9Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9

SUBD F8 F6 F2 7 9

DIVD F10 F0 F6 8

ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

10 Mult1 Yes Mult F0 F2 F4 Yes Yes

Mult2 No

2 Add Yes Sub F8 F6 F2 Yes Yes

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F309 FU Mult1 Add Divide

• Read operands for MULT & SUB? Issue ADDD?

Note Remaining

Page 37: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 10Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9

SUBD F8 F6 F2 7 9

DIVD F10 F0 F6 8

ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

9 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No

1 Add Yes Sub F8 F6 F2 No No

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3010 FU Mult1 Add Divide

Page 38: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 11Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9

SUBD F8 F6 F2 7 9 11

DIVD F10 F0 F6 8

ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

8 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No

0 Add Yes Sub F8 F6 F2 No No

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 Add Divide

Page 39: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 12Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9

SUBD F8 F6 F2 7 9 11 12

DIVD F10 F0 F6 8

ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

7 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No

Add No

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3012 FU Mult1 Divide

• Read operands for DIVD?

Page 40: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 13Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9

SUBD F8 F6 F2 7 9 11 12

DIVD F10 F0 F6 8

ADDD F6 F8 F2 13

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

6 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No

Add Yes Add F6 F8 F2 Yes Yes

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU Mult1 Add Divide

Page 41: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 14Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9

SUBD F8 F6 F2 7 9 11 12

DIVD F10 F0 F6 8

ADDD F6 F8 F2 13 14

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

5 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No

2 Add Yes Add F6 F8 F2 Yes Yes

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3014 FU Mult1 Add Divide

Page 42: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 15Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9

SUBD F8 F6 F2 7 9 11 12

DIVD F10 F0 F6 8

ADDD F6 F8 F2 13 14

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

4 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No

1 Add Yes Add F6 F8 F2 No No

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3015 FU Mult1 Add Divide

Page 43: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 16Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9

SUBD F8 F6 F2 7 9 11 12

DIVD F10 F0 F6 8

ADDD F6 F8 F2 13 14 16

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

3 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No

0 Add Yes Add F6 F8 F2 No No

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU Mult1 Add Divide

Page 44: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 17Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9

SUBD F8 F6 F2 7 9 11 12

DIVD F10 F0 F6 8

ADDD F6 F8 F2 13 14 16

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

2 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No

Add Yes Add F6 F8 F2 No No

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3017 FU Mult1 Add Divide

• Why not write result of ADD???

WAR Hazard!

Page 45: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 18Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9

SUBD F8 F6 F2 7 9 11 12

DIVD F10 F0 F6 8

ADDD F6 F8 F2 13 14 16

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

1 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No

Add Yes Add F6 F8 F2 No No

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3018 FU Mult1 Add Divide

Page 46: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 19Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9 19

SUBD F8 F6 F2 7 9 11 12

DIVD F10 F0 F6 8

ADDD F6 F8 F2 13 14 16

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

0 Mult1 Yes Mult F0 F2 F4 No No

Mult2 No

Add Yes Add F6 F8 F2 No No

Divide Yes Div F10 F0 F6 Mult1 No Yes

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3019 FU Mult1 Add Divide

Page 47: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 20Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9 19 20

SUBD F8 F6 F2 7 9 11 12

DIVD F10 F0 F6 8

ADDD F6 F8 F2 13 14 16

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

Mult1 No

Mult2 No

Add Yes Add F6 F8 F2 No No

Divide Yes Div F10 F0 F6 Yes Yes

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3020 FU Add Divide

Page 48: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 21Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9 19 20

SUBD F8 F6 F2 7 9 11 12

DIVD F10 F0 F6 8 21

ADDD F6 F8 F2 13 14 16

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

Mult1 No

Mult2 No

Add Yes Add F6 F8 F2 No No

Divide Yes Div F10 F0 F6 Yes Yes

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3021 FU Add Divide

• WAR Hazard is now gone...

Page 49: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 22Instruction status: Read Exec Write

Instruction j k Issue Oper Comp ResultLD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9 19 20

SUBD F8 F6 F2 7 9 11 12

DIVD F10 F0 F6 8 21

ADDD F6 F8 F2 13 14 16 22

Functional unit status: dest S1 S2 FU FU Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

Mult1 No

Mult2 No

Add No

39 Divide Yes Div F10 F0 F6 No No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3022 FU Divide

Page 50: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Page 51: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 61Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9 19 20

SUBD F8 F6 F2 7 9 11 12

DIVD F10 F0 F6 8 21 61

ADDD F6 F8 F2 13 14 16 22

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

Mult1 No

Mult2 No

Add No

0 Divide Yes Div F10 F0 F6 No No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3061 FU Divide

Page 52: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 62Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9 19 20

SUBD F8 F6 F2 7 9 11 12

DIVD F10 F0 F6 8 21 61 62

ADDD F6 F8 F2 13 14 16 22

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

Mult1 No

Mult2 No

Add No

Divide No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3062 FU

Page 53: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Example: Cycle 62Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2 1 2 3 4

LD F2 45+ R3 5 6 7 8

MULTD F0 F2 F4 6 9 19 20

SUBD F8 F6 F2 7 9 11 12

DIVD F10 F0 F6 8 21 61 62

ADDD F6 F8 F2 13 14 16 22

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

Mult1 No

Mult2 No

Add No

Divide No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3062 FU

• In-order issue; out-of-order execute & commit

Page 54: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

CDC 6600 Scoreboard

• Historical context:– Speedup 1.7 from compiler; 2.5 by hand

BUT slow memory (no cache) limits benefit

• Limitations of 6600 scoreboard:– No forwarding hardware

– Limited to instructions in basic block (small window)

– Small number of functional units (structural hazards), especially integer/load store units

– Do not issue on structural hazards

– Wait for WAR hazards

– Prevent WAW hazards

• Precise interrupts?

Page 55: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Instruction-Level Parallelism and Its Exploitation

(Part II)

ECE 154B

Dmitri Strukov

Page 56: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

ILP techniques

this week

lastweek

not covered

nextweek

Page 57: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Scoreboard Technique Review

• Allow for out of order execution by processing several instructions simultaneously

– In-order issue / out-of-order ex/completion

– Pipe stages: issue, read registers, ex, write back

• Book keeping to resolve RAW, WAW, and RAW

– Resolve RAW at “read registers” stage

– Stall issue for WAW, stall completion (WB) for WAR

Page 58: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Instruction status: Read Exec Write

Instruction j k Issue Oper Comp Result

LD F6 34+ R2

LD F2 45+ R3

MULTD F0 F2 F4

SUBD F8 F6 F2

DIVD F10 F0 F6

ADDD F6 F8 F2

Functional unit status: dest S1 S2 FU FU Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No

Mult1 No

Mult2 No

Add No

Divide No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F30FU

Read operands

Execution complete

Instruction status

Write result

Issue

Bookkeeping

Rj No; Rk No

f(if Qj(f)=FU then Rj(f) Yes);f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No

Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’;

Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj;

Rk not Qk; Result(‘D’) FU;

Rj and Rk

Functional unit done

Wait until

f((Fj(f)Fi(FU) or Rj(f)=No) &

(Fk(f)Fi(FU) or Rk( f )=No))

Not busy (FU) and not result(D)

Page 59: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

B@t+1

A@t

C@t+2

Fi Fj Fk Qj Qk Rj Rk

Not completed

Not completed

Completed

Instr A: r2 r4/r5Instr B: r6 r3*r2Instr C: r3 r7+r8

tim

e

Read operands

Execution complete

Instruction status

Write result

Issue

Bookkeeping

Rj No; Rk No

f(if Qj(f)=FU then Rj(f) Yes);f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No

Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’;

Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj;

Rk not Qk; Result(‘D’) FU;

Rj and Rk

Functional unit done

Wait until

f((Fj(f)Fi(FU) or Rj(f)=No) &

(Fk(f)Fi(FU) or Rk( f )=No))

Not busy (FU) and not result(D)

Page 60: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

B@t+1

A@t

C@t+2

Fi Fj Fk Qj Qk Rj Rk

Not completed

Not completed

Completed

r3 r2

r2

r3

B No

Read operands

Execution complete

Instruction status

Write result

Issue

Bookkeeping

Rj No; Rk No

f(if Qj(f)=FU then Rj(f) Yes);f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No

Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’;

Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj;

Rk not Qk; Result(‘D’) FU;

Rj and Rk

Functional unit done

Wait until

f((Fj(f)Fi(FU) or Rj(f)=No) &

(Fk(f)Fi(FU) or Rk( f )=No))

Not busy (FU) and not result(D)

Instr A: r2 r4/r5Instr B: r6 r3*r2Instr C: r3 r7+r8

tim

e

Page 61: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Another Dynamic Algorithm: Tomasulo’s Algorithm

• For IBM 360/91 about 3 years after CDC 6600 (1966)

• Goal: High Performance without special compilers

• Differences between IBM 360 & CDC 6600 ISA– IBM has only 2 register specifiers/instr vs. 3 in CDC 6600

– IBM has 4 FP registers vs. 8 in CDC 6600

– IBM has memory-register ops

• Small number of floating point registers prevented interesting compiler scheduling of operationsThis led Tomasulo to try to figure out how to get more effective

registers — renaming in hardware!

• Why Study? The descendants of this have flourished!Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604, …

Page 62: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo vs. Scoreboard• Control & buffers distributed with Function Units (FU) vs.

centralized in scoreboardFU buffers called “reservation stations”; have pending operands

• Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming– avoids WAR, WAW hazards– More reservation stations than registers, so can do optimizations

compilers can’t

• Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs

• Load and Stores treated as FUs with RSs as well

Page 63: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

1) Trying in improve performance by exploiting ILP (overlapping as much as possible execution of instructions)

2) So far, issue one instruction per cycle and overlap exaction of many by pipelining- We will broaden it later to allow issue of multiple instructions per cycle (superscalar

processors)

3) Problem with pipelining hazards (WAW, RAW, WAR), and resulting stalls, which are especially painful with relatively large memory (less of a problem earlier, more a problem now – consequence of memory wall problem) and FP latencies (was problem earlier, less of a problem now)

4) Let’s reorder instructions (without affective program correctness) can align (overlap) better instructions to reduce hazards

- Scoreboard allows reordering but deals somewhat efficiently only with RAW but not with WAW and WAR

- WAW and RAW improved in Tomasulo via register renaming (essentially storing register value instead of register label in scoreboard). Also more efficient due to forwarding (common data bus)

- However, both scoreboard and Tomasulo are limited in reordering given typical small number of instructions in basic blocks (i.e. cannot reorder across branches)

Big picture on ILP for now

Page 64: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Organization

+ register status

Page 65: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Reservation Station Components

Op: Operation to perform in the unit (e.g., + or –)

Vj, Vk: Value of source operandsStore buffers have V fields, results to be stored

Qj, Qk: Reservation stations producing source registers (value to be written)– No ready flags as in Scoreboard; Qj,Qk=0 => ready

– Store buffers only have Qi for RS producing result

Busy: Indicates reservation station or FU is busy

Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

Page 66: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Three Stages of Tomasulo Algorithm

1. Issue—get instruction from FP Op QueueIf reservation station free (no structural hazard), control issues instr & sends operands (renames registers)

2. Execute—operate on operands (EX)When both operands ready then execute;if not ready, watch Common Data Bus for result

3. Write result—finish execution (WB)Write on Common Data Bus to all awaiting units; mark reservation station available

• Normal data bus: data + destination (“go to” bus)• Common data bus: data + source (“come from” bus)

– 64 bits of data + 4 bits of Functional Unit source address– Write if matches expected Functional Unit (produces result)– Does the broadcast

Page 67: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo ExampleInstruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 Load1 No

LD F2 45+ R3 Load2 No

MULTD F0 F2 F4 Load3 No

SUBD F8 F6 F2

DIVD F10 F0 F6

ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 No

Add3 No

Mult1 No

Mult2 No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F300 FU

Page 68: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 1Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 Load1 Yes 34+R2

LD F2 45+ R3 Load2 No

MULTD F0 F2 F4 Load3 No

SUBD F8 F6 F2

DIVD F10 F0 F6

ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 No

Add3 No

Mult1 No

Mult2 No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F301 FU Load1

Page 69: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 2Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 Load1 Yes 34+R2

LD F2 45+ R3 2 Load2 Yes 45+R3

MULTD F0 F2 F4 Load3 No

SUBD F8 F6 F2

DIVD F10 F0 F6

ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 No

Add3 No

Mult1 No

Mult2 No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F302 FU Load2 Load1

Note: Unlike 6600, can have multiple loads outstanding(This was not an inherent limitation of scoreboarding)

Page 70: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 3Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 Load1 Yes 34+R2

LD F2 45+ R3 2 Load2 Yes 45+R3

MULTD F0 F2 F4 3 Load3 No

SUBD F8 F6 F2

DIVD F10 F0 F6

ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 No

Add3 No

Mult1 Yes MULTD R(F4) Load2

Mult2 No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F303 FU Mult1 Load2 Load1

• Note: registers names are removed (“renamed”) in Reservation Stations; MULT issued vs. scoreboard

• Load1 completing; what is waiting for Load1?

Page 71: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 4Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 Load2 Yes 45+R3

MULTD F0 F2 F4 3 Load3 No

SUBD F8 F6 F2 4

DIVD F10 F0 F6

ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 Yes SUBD M(A1) Load2

Add2 No

Add3 No

Mult1 Yes MULTD R(F4) Load2

Mult2 No

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F304 FU Mult1 Load2 M(A1) Add1

• Load2 completing; what is waiting for Load2?

Page 72: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 5Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 5 Load2 No

MULTD F0 F2 F4 3 Load3 No

SUBD F8 F6 F2 4

DIVD F10 F0 F6 5

ADDD F6 F8 F2

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

2 Add1 Yes SUBD M(A1) M(A2)

Add2 No

Add3 No

10 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F305 FU Mult1 M(A2) M(A1) Add1 Mult2

Page 73: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 6Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 5 Load2 No

MULTD F0 F2 F4 3 Load3 No

SUBD F8 F6 F2 4

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

1 Add1 Yes SUBD M(A1) M(A2)

Add2 Yes ADDD M(A2) Add1

Add3 No

9 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F306 FU Mult1 M(A2) Add2 Add1 Mult2

• Issue ADDD here?

Page 74: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 7Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 5 Load2 No

MULTD F0 F2 F4 3 Load3 No

SUBD F8 F6 F2 4 7

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

0 Add1 Yes SUBD M(A1) M(A2)

Add2 Yes ADDD M(A2) Add1

Add3 No

8 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F307 FU Mult1 M(A2) Add2 Add1 Mult2

• Add1 completing; what is waiting for it?

Page 75: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 8Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 5 Load2 No

MULTD F0 F2 F4 3 Load3 No

SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

2 Add2 Yes ADDD (M-M) M(A2)

Add3 No

7 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F308 FU Mult1 M(A2) Add2 (M-M) Mult2

Page 76: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 9Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 5 Load2 No

MULTD F0 F2 F4 3 Load3 No

SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

1 Add2 Yes ADDD (M-M) M(A2)

Add3 No

6 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F309 FU Mult1 M(A2) Add2 (M-M) Mult2

Page 77: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 10Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 5 Load2 No

MULTD F0 F2 F4 3 Load3 No

SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6 10

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

0 Add2 Yes ADDD (M-M) M(A2)

Add3 No

5 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3010 FU Mult1 M(A2) Add2 (M-M) Mult2

• Add2 completing; what is waiting for it?

Page 78: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 11Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 5 Load2 No

MULTD F0 F2 F4 3 Load3 No

SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 No

Add3 No

4 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3011 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

• Write result of ADDD here?

• All quick instructions complete in this cycle!

Page 79: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 12Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 5 Load2 No

MULTD F0 F2 F4 3 Load3 No

SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 No

Add3 No

3 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3012 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

Page 80: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 13Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 5 Load2 No

MULTD F0 F2 F4 3 Load3 No

SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 No

Add3 No

2 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3013 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

Page 81: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 14Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 5 Load2 No

MULTD F0 F2 F4 3 Load3 No

SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 No

Add3 No

1 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3014 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

Page 82: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 15Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 5 Load2 No

MULTD F0 F2 F4 3 15 Load3 No

SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 No

Add3 No

0 Mult1 Yes MULTD M(A2) R(F4)

Mult2 Yes DIVD M(A1) Mult1

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3015 FU Mult1 M(A2) (M-M+M)(M-M) Mult2

Page 83: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 16Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 5 Load2 No

MULTD F0 F2 F4 3 15 16 Load3 No

SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 No

Add3 No

Mult1 No

40 Mult2 Yes DIVD M*F4 M(A1)

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3016 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

Page 84: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Faster than light computation(skip a couple of cycles)

Page 85: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 55Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 5 Load2 No

MULTD F0 F2 F4 3 15 16 Load3 No

SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5

ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 No

Add3 No

Mult1 No

1 Mult2 Yes DIVD M*F4 M(A1)

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3055 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

Page 86: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 56Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 5 Load2 No

MULTD F0 F2 F4 3 15 16 Load3 No

SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5 56

ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 No

Add3 No

Mult1 No

0 Mult2 Yes DIVD M*F4 M(A1)

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Mult2

• Mult2 is completing; what is waiting for it?

Page 87: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Example Cycle 57Instruction status: Exec Write

Instruction j k Issue Comp Result Busy Address

LD F6 34+ R2 1 3 4 Load1 No

LD F2 45+ R3 2 4 5 Load2 No

MULTD F0 F2 F4 3 15 16 Load3 No

SUBD F8 F6 F2 4 7 8

DIVD F10 F0 F6 5 56 57

ADDD F6 F8 F2 6 10 11

Reservation Stations: S1 S2 RS RS

Time Name Busy Op Vj Vk Qj Qk

Add1 No

Add2 No

Add3 No

Mult1 No

Mult2 Yes DIVD M*F4 M(A1)

Register result status:

Clock F0 F2 F4 F6 F8 F10 F12 ... F3056 FU M*F4 M(A2) (M-M+M)(M-M) Result

• Once again: In-order issue, out-of-order execution and completion

Page 88: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Compare to Scoreboard Cycle 62

Instruction status: Read Exec Write Exec Write

Instruction j k Issue Oper Comp Result Issue ComplResult

LD F6 34+ R2 1 2 3 4 1 3 4

LD F2 45+ R3 5 6 7 8 2 4 5

MULTD F0 F2 F4 6 9 19 20 3 15 16

SUBD F8 F6 F2 7 9 11 12 4 7 8

DIVD F10 F0 F6 8 21 61 62 5 56 57

ADDD F6 F8 F2 13 14 16 22 6 10 11

• Why take longer on scoreboard/6600?

• Not efficient WAR and WAW

• Structural hazards

• Lack of forwarding

Page 89: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo vs. Scoreboard(IBM 360/91 vs. CDC 6600)

Pipelined Functional Units Multiple Functional Units

(6 load, 3 store, 3+, 2x/÷) (1 load/store, 1+, 2x, 1÷)

window size: ≤ 14 instructions ≤ 5 instructions

No issue on structural hazard same

WAR: renaming avoids stall completion

WAW: renaming avoids stall issue

Broadcast results from FU Write/read registers

Control: reservation stations central scoreboard

Page 90: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo Drawbacks

• Complexitydelays of 360/91, MIPS 10000, IBM 620?

• Many associative stores (CDB) at high speed

• Performance limited by Common Data Bus– Each CDB must go to multiple functional units high

capacitance, high wiring density

– Number of functional units that can complete per cycle limited to one!

(Multiple CDBs more FU logic for parallel assoc stores)

• Non-precise interrupts!

Page 91: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Summary on Tomasulo

• Reservations stations: implicit register renaming to larger set of registers + buffering source operands– Prevents registers as bottleneck– Avoids WAR, WAW hazards of Scoreboard– Allows loop unrolling in HW (when branch can be quickly

resolved) • Helps cache misses as well• Lasting Contributions

– Dynamic scheduling– Register renaming– Load/store disambiguation

• 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264

Page 92: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

1) Trying in improve performance by exploiting ILP (overlapping as much as possible execution of instructions)

2) So far, issue one instruction per cycle and overlap exaction of many by pipelining- We will broaden it later to allow issue of multiple instructions per cycle (superscalar

processors)

3) Problem with pipelining hazards (WAW, RAW, WAR), and resulting stalls, which are especially painful with relatively large memory (less of a problem earlier, more a problem now – consequence of memory wall phenomena) and FP latencies (was problem earlier, less of a problem now)

4) Let’s reorder instructions (without affective program correctness) can align (overlap) better instructions to reduce hazards

- Scoreboard allows reordering but deals somewhat efficiently only with RAW but not with WAW and WAR

- WAW and RAW improved in Tomasulo via register renaming (essentially storing register value instead of register label in scoreboard). Also more efficient due to forwarding (common data bus)

- However, both scoreboard and Tomasulo are limited in reordering given typical small number of instructions in basic blocks (i.e. cannot reorder across branches)

5) Solution let’s execute speculatively instructions to have more instructions available for reordering

- Need mechanism to recover from wrong execution.

Summary on ILP for now

Page 93: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Exploiting More ILP

• Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP

(will look at branch prediction implementations next week)

• One way to overcome control dependencies is with speculation– Make a guess and execute program as if our guess is correct

– Need mechanisms to handle the case when the speculation is incorrect

• Can do some speculation in the compiler– Reordered / duplicated instructions around branches

Page 94: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Hardware-Based Speculation

• Extends the idea of dynamic scheduling with three key ideas:1. Dynamic branch prediction2. Speculation to allow the execution of instructions before

control dependencies are resolved3. Dynamic scheduling to deal with scheduling different

combinations of basic blocks• What we saw earlier was within a basic block

• Modern processors started using speculation around the introduction of the PowerPC 603, Intel Pentium II and extend Tomasulo’s approach to support speculation

Page 95: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Speculating with Tomasulo

• Separate execution from completion– Allow instructions to execute speculatively but do not let

instructions update registers or memory until they are no longer speculative

• Instruction Commit– After an instruction is no longer speculative it is allowed to

make register and memory updates

• Allow instructions to execute and complete out of order but force them to commit in order

• Add a hardware buffer, called the reorder buffer (ROB), with registers to hold the result of an instruction between completion and commit– Acts as a FIFO queue in order issued

Page 96: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Original Tomasulo Architecture

Page 97: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo and Reorder Buffer

• Sits between Execution and Register File

• Source of operands• In this case integrated with

Store buffer• Reservation stations use

ROB slot as a tag• Instructions commit at

head of ROB FIFO queue– Easy to undo

speculated instructions on mispredictedbranches or on exceptions

Page 98: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

ROB Data Structure

• Instruction Type Field– Indicates whether the instruction is a branch, store, or

register operation

• Destination Field– Register number for loads, ALU ops, or memory address

for stores

• Value Field– Holds the value of the instruction result until instruction

commits

• Ready Field– Indicates if instruction has completed execution and the

value is ready

Page 99: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Instruction Execution1. Issue: Get an instruction from the Instruction Queue

If the reservation station and the ROB has a free slot (no structural hazard), issue the instruction to the reservation station and the ROB, send operands to the reservation station if available in the register file or the ROB. The allocated ROB slot number is sent to the reservation station to use as a tag when placing data on the CDB.

2. Execution: Operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result

3. Write result: Finish execution (WB) Write on CDB to all awaiting units and to the ROB using the tag; mark

reservation station available

4. Commit: Update register or memory with the ROB result When an instruction reaches the head of the ROB and results are present,

update the register with the result or store to memory and remove the instruction from the ROB

If an incorrectly predicted branch reaches the head of the ROB, flush the ROB, and restart at the correct successor of the branch

Blue text = Change from Tomasulo

Page 100: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo With ROB

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

State

Dest

Oldest

Newest

From Memory

Dest

Reorder Buffer

Registers

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

F0 LD F0, 10(R2) I ROB1

Dest

1 10+R2

Page 101: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo With ROB

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

State

Dest

Oldest

Newest

From Memory

Dest

Reorder Buffer

Registers

ROB7

ROB6

ROB5

ROB4

ROB3

ROB2

F0 LD F0, 10(R2) E ROB1

Dest

1 10+R2

Page 102: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo With ROB

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

Dest

Oldest

Newest

From Memory

Dest

Reorder Buffer

Registers

ROB7

ROB6

ROB5

ROB4

ROB3

F10 ADDD F10,F4,F0 I ROB2

F0 LD F0, 10(R2) E ROB1

2 ADDD R(F4),1

Dest

1 10+R2

State

Page 103: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo With ROB

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

Dest

Oldest

Newest

From Memory

Dest

Reorder Buffer

Registers

ROB7

ROB6

ROB5

ROB4

F2 MULD F2,F10,F6 I ROB3

F10 ADDD F10,F4,F0 I ROB2

F0 LD F0, 10(R2) E ROB1

2 ADDD R(F4),1

Dest

3 MULD 2,R(F6)

1 10+R2

State

Page 104: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo With ROB

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

Dest

Oldest

Newest

From Memory

Dest

Reorder Buffer

Registers

ROB7

F0 ADDD F0,F4,F6 I ROB6

F4 LD F4,0(R3) E ROB5

-- BNE F0, 0, L I ROB4

F2 MULD F2,F10,F6 I ROB3

F10 ADDD F10,F4,F0 I ROB2

F0 LD F0, 10(R2) E ROB1

2 ADDD R(F4),1

6 ADDD 5,R(F6)

Dest

3 MULD 2,R(F6)

1 10+R2

5 0+R3

State

Page 105: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo With ROB

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

Dest

Oldest

Newest

From Memory

Dest

Reorder Buffer

Registers

[R3] ROB5 ST F4, 0(R3) I ROB7

F0 ADDD F0,F4,F6 I ROB6

F4 LD F4,0(R3) E ROB5

-- BNE F0, 0, L I ROB4

F2 MULD F2,F10,F6 I ROB3

F10 ADDD F10,F4,F0 I ROB2

F0 LD F0, 10(R2) E ROB1

2 ADDD R(F4),1

6 ADDD 5,R(F6)

Dest

3 MULD 2,R(F6)

1 10+R2

5 0+R3

State

Page 106: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo With ROB

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

Dest

Oldest

Newest

From Memory

Dest

Reorder Buffer

Registers

[R3] V1 ST F4, 0(R3) W ROB7

F0 ADDD F0,F4,F6 I ROB6

F4 V1 LD F4,0(R3) W ROB5

-- BNE F0, 0, L I ROB4

F2 MULD F2,F10,F6 I ROB3

F10 ADDD F10,F4,F0 I ROB2

F0 LD F0, 10(R2) E ROB1

2 ADDD R(F4),1

6 ADDD V1,R(F6)

Dest

3 MULD 2,R(F6)

1 10+R2

State

Page 107: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo With ROB

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

Dest

Oldest

Newest

From Memory

Dest

Reorder Buffer

Registers

[R3] V1 ST F4, 0(R3) W ROB7

F0 ADDD F0,F4,F6 E ROB6

F4 V1 LD F4,0(R3) W ROB5

-- BNE F0, 0, L I ROB4

F2 MULD F2,F10,F6 I ROB3

F10 ADDD F10,F4,F0 I ROB2

F0 LD F0, 10(R2) E ROB1

2 ADDD R(F4),1

6 ADDD V1,R(F6)

Dest

3 MULD 2,R(F6)

1 10+R2

State

Page 108: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo With ROB

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

Dest

Oldest

Newest

From Memory

Dest

Reorder Buffer

Registers

[R3] V1 ST F4, 0(R3) W ROB7

F0 V2 ADDD F0,F4,F6 W ROB6

F4 V1 LD F4,0(R3) W ROB5

-- BNE F0, 0, L I ROB4

F2 MULD F2,F10,F6 I ROB3

F10 ADDD F10,F4,F0 I ROB2

F0 LD F0, 10(R2) E ROB1

2 ADDD R(F4),1

Dest

3 MULD 2,R(F6)

1 10+R2

State

Page 109: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo With ROB

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

Dest

Oldest

Newest

From Memory

Dest

Reorder Buffer

Registers

[R3] V1 ST F4, 0(R3) W ROB7

F0 V2 ADDD F0,F4,F6 W ROB6

F4 V1 LD F4,0(R3) W ROB5

-- BNE F0, 0, L I ROB4

F2 MULD F2,F10,F6 I ROB3

F10 ADDD F10,F4,F0 I ROB2

F0 V3 LD F0, 10(R2) W ROB1

2 ADDD R(F4),V3

Dest

3 MULD 2,R(F6)

State

Page 110: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo With ROB

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

Dest

Oldest

Newest

From Memory

Dest

Reorder Buffer

Registers

[R3] V1 ST F4, 0(R3) W ROB7

F0 V2 ADDD F0,F4,F6 W ROB6

F4 V1 LD F4,0(R3) W ROB5

-- BNE F0, 0, L E ROB4

F2 MULD F2,F10,F6 I ROB3

F10 ADDD F10,F4,F0 E ROB2

F0 V3 LD F0, 10(R2) C ROB1

2 ADDD R(F4),V3

Dest

3 MULD 2,R(F6)

State

F0=V3

Page 111: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo With ROB

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

Dest

Oldest

Newest

From Memory

Dest

Reorder Buffer

Registers

[R3] V1 ST F4, 0(R3) W ROB7

F0 V2 ADDD F0,F4,F6 W ROB6

F4 V1 LD F4,0(R3) W ROB5

-- BNE F0, 0, L W ROB4

F2 MULD F2,F10,F6 I ROB3

F10 V4 ADDD F10,F4,F0 W ROB2

F0 V3 LD F0, 10(R2) C ROB1

Dest

3 MULD V4,R(F6)

State

F0=V3

Page 112: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo With ROB

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

Dest

Oldest

Newest

From Memory

Dest

Reorder Buffer

Registers

[R3] V1 ST F4, 0(R3) W ROB7

F0 V2 ADDD F0,F4,F6 W ROB6

F4 V1 LD F4,0(R3) W ROB5

-- BNE F0, 0, L W ROB4

F2 MULD F2,F10,F6 E ROB3

F10 V4 ADDD F10,F4,F0 C ROB2

F0 V3 LD F0, 10(R2) C ROB1

Dest

3 MULD V4,R(F6)

State

F0=V3F10=V4

Page 113: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo With ROB

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

Dest

Oldest

Newest

From Memory

Dest

Reorder Buffer

Registers

[R3] V1 ST F4, 0(R3) W ROB7

F0 V2 ADDD F0,F4,F6 W ROB6

F4 V1 LD F4,0(R3) W ROB5

-- BNE F0, 0, L W ROB4

F2 V5 MULD F2,F10,F6 W ROB3

F10 V4 ADDD F10,F4,F0 C ROB2

F0 V3 LD F0, 10(R2) C ROB1

Dest

State

F0=V3F10=V4

Page 114: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo With ROB

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

Dest

Oldest

Newest

From Memory

Dest

Reorder Buffer

Registers

[R3] V1 ST F4, 0(R3) W ROB7

F0 V2 ADDD F0,F4,F6 W ROB6

F4 V1 LD F4,0(R3) W ROB5

-- BNE F0, 0, L W ROB4

F2 V5 MULD F2,F10,F6 C ROB3

F10 V4 ADDD F10,F4,F0 C ROB2

F0 V3 LD F0, 10(R2) C ROB1

Dest

State

F0=V3F10=V4F2=V5

Page 115: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Tomasulo With ROB

ToMemory

FP adders FP multipliers

Reservation Stations

FP OpQueue

Dest

Oldest

Newest

From Memory

Dest

Reorder Buffer

Registers

[R3] V1 ST F4, 0(R3) W ROB7

F0 V2 ADDD F0,F4,F6 W ROB6

F4 V1 LD F4,0(R3) W ROB5

-- BNE F0, 0, L C ROB4

F2 V5 MULD F2,F10,F6 C ROB3

F10 V4 ADDD F10,F4,F0 C ROB2

F0 V3 LD F0, 10(R2) C ROB1

Dest

State

F0=V3F10=V4F2=V5

Page 116: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Avoiding Memory Hazards

• A store only updates memory when it reaches the head of the ROB– Otherwise WAW and WAR hazards are possible– By waiting to reach the head memory is updated in order and no

earlier loads or stores can still be pending

• If a load accesses a memory location written to by an earlier store then it cannot perform the memory access until the store has written the data– Prevents RAW hazard through memory

Page 117: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Reorder Buffer Implementation

• In practice– Try to recover as early as possible after a branch is

mispredicted rather than wait until branch reaches the head

– Performance in speculative processors more sensitive to branch prediction• Higher cost of misprediction

• Exceptions (will look into that next week)– Don’t recognize the exception until it is ready to

commit– Could try to handle exceptions as they arise and

earlier branches resolved, but more challenging

Page 118: Instruction-Level Parallelism and Its Exploitation (part I ...strukov/ece154bSpring2016/week5.pdf · Exploitation (part I and II) ECE 154B Dmitri Strukov. Introduction •Pipelining

Acknowledgements

Some of the slides contain material developed and copyrighted by Sally A. McKee and K. Mock (Cornell University) and instructor material for the textbook

119