lecture 8 advanced pipeline

Pipeline Complications CS510 Computer Architectures Lecture 8 - 1

Lecture 8Lecture 8Advanced PipelineAdvanced Pipeline


Extending the DLX to Handle Extending the DLX to Handle Multi-cycle Operations Multi-cycle Operations

IF ID MEM WBEXIF ID MEM WB

EXint unit

EXFP/int

Multiply

EXFP/Intdivider

EXFP adder

DLX pipeline with 3 additional unpipelined, floadting-point functional units


Multicycle OperationsMulticycle Operations

IF ID MEM WB

EX

integer unit

M1 M2 M3 M4 M5 M6 M7

FP/integer multiply

A1 A2 A3 A4

FP adder

DIV

FP/integer divider

24 clock cycles


Latency and Initiation IntervalLatency and Initiation Interval

Latency: Number of intervening cycles between an instruction that produces a result and an instruction that uses the result

Initiation Interval: number of cycles that must elapse between issuing of two operations of a given type

Integer ALU 0 1Load 1 1FP add 3 1FP mul 6 1FP div 24 25

Data needed Result available

* FP LD and ST aresame as integer byhaving 64-bit pathto memory.

MULTD IF ID M1 M2 M3 M4 M5 M6 M7 MEM WBADDD IF ID AI A2 A3 A4 MEM WBLD* IF ID EX MEM WBSD* IF ID EX MEM WB

Example

latency initiation interval


Floating Point OperationsFloating Point Operations

FP Instruction Latency Initiation Interval (MIPS R4000)Add, Subtract 4 3Multiply 8 4Divide 36 35Square root 112 111Negate 2 1Absolute value 2 1FP compare 3 2

Cycles before using result

Cycles before issuing instr of the same type

Floating Point: long execution time Also, pipeline FP execution unit may initiate new instructions without waiting full latency

Reality: MIPS R4000


Complications Complications Due to FP Operations (in DLX)Due to FP Operations (in DLX)

– Because the divide unit is not fully pipelined, structural hazards can ocur

– WAW hazards are possible, since instructions no longer reach WB in order. (WAR hazards are not possible, since register reads always occur in ID)

– Instructions can complete in a different order than they were issued, causing problems with exceptions

– Because of longer latency of operations, stalls for RAW hazards will be more frequent


Summary of Pipelining BasicsSummary of Pipelining Basics• Hazards limit performance

– Structural: need more HW resources– Data: need forwarding, compiler scheduling– Control: early evaluation of PC, delayed branch, prediction

• Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency

• Interrupts, FP Instruction Set makes pipelining harder• Compilers reduce cost of data and control hazards

– Load delay slots– Branch delay slots– Branch prediction


Case Study:Case Study:MIPS R4000 and MIPS R4000 and

Introduction to Advanced Introduction to Advanced PipeliningPipelining


Case Study:Case Study: MIPS R4000 PipelineMIPS R4000 Pipeline

8 Stage Pipeline:IF First half of fetching of instruction

• PC selection • Initiation of instruction cache access

IS - Second half of fetching of instruction • Access to instruction cache

RF Instruction decode, register fetch, hazard checking, and also instruction cache hit detection(tag check)EX Execution

• Effective address calculation • ALU operation• Branch target computation and condition evaluation

DF - First half of access to data cacheDS - Second half of access to data cacheTC - Tag check for data cache hitWB -Write back for loads and register-register operations


The Pipeline Structure The Pipeline Structure of the R4000of the R4000

Instruction Memory REG

ALU Data Memory REG

Instruction is available

Tag check

load data available

IF IS RF EX DF DS TC WB


Case Study: MIPS R4000Case Study: MIPS R4000LOAD LatencyLOAD Latency

2 Cycle Load Latency

Load data availableLoad data availablewith forwardingwith forwarding

LD R1, X IF IS RF EX DF DS TC WB

IF IS RF EX DF DS IF IS RF EX DF DS . . .ADD R3, R1, R2 IF IS RF EX DF DS TC WBIF IS RF EX DF DS TC . . .

IF IS RF EX DF . . .

EX

Load data neededLoad data needed

EX

2 Stall Cycles2 Stall Cycles


Case Study: MIPS R4000Case Study: MIPS R4000LOAD Followed by ALU InstructionsLOAD Followed by ALU Instructions

2 cycle Load Latency with Forwarding Circuit

IF ISIF

RFISIF

EXRFISIF

DFstallstallstall

DSstallstallstall

TCEXRFIS

WBDF ...EX ...RF ...

LW R1ADD R2, R1SUB R3, R1OR R4, R1

Forwarding


Case Study: MIPS R4000Case Study: MIPS R4000Branch LatencyBranch Latency

Predict NOT TAKEN strategy NOT TAKEN: one-cycle delayed slot TAKEN: one-cycle delayed slot followed by two stalls - 3 cycle latency

R4000 uses Predict NOT TAKENNOT TAKEN

Delay Slot plus 2 stall cycles

IF ISIF

RFISIF

RFISIF

DFEXRFIS

DSDFEXRFIS

TCDSDFEXRF

WB TC ...DS ...DF ...EX ...

NOT TAKENNOT TAKEN BrDelay SlotDelay SlotBr instr +2Br instr +3Br instr +4

EX

IF

DSDF

IS

TCDS

RF

IF ISIF

RFIS RF

DFEX

WB TC ...

EX ...

EXTAKENTAKEN BrDelay SlotDelay SlotStallStallStallStallBr Target instr IF

Branch target address available after EX stage


Extending DLX to Handle Extending DLX to Handle Floating Point OperationsFloating Point Operations

IF ID MEM WB

Integer Unit(EX)Integer Unit(EX)

FP/integer multiplyFP MultiplierFP Multiplier

FP AdderFP Adder

FP DividerFP Divider


MIPS R4000 FP UnitMIPS R4000 FP Unit• FP Adder, FP Multiplier, FP Divider• Last step of FP Multiplier/Divider uses FP Adder HW• 8 kinds of stages in FP units: (single copy of each)

Stage Functional unit DescriptionA FP adder Mantissa ADD stage D FP divider Divide pipeline stageE FP multiplier Exception test stageM FP multiplier First stage of multiplierN FP multiplier Second stage of multiplierR FP adder Rounding stageS FP adder Operand shift stageU Unpack FP numbers


MIPS R4000 FP Pipe StagesMIPS R4000 FP Pipe StagesFP Instr 1 2 3 4 5 6 7 8 latencyAdd, Subtract U S+A A+R R+S 4Multiply U E+M M M M N N+A R 8Divide U A R D27 … D+A D+R, D+A, D+R, A, R 36Square root U E (A+R)108 A R 112Negate U S 2Absolute value U S 2FP compare U A R 3Stages:

M First stage of multiplier N Second stage of multiplierR Rounding stage A Mantissa ADD stageS Operand shift stage D Divide pipeline stageU Unpack FP numbers E Exception test stage


Latency and Initiation IntervalsLatency and Initiation Intervals

FP Instruction Latency Initiation Interval Add, Subtract 4 3Multiply 8 4Divide 36 35Square root 112 111Negate 2 1Absolute value 2 1FP compare 3 2


MIPS R4000 FP Pipe StagesMIPS R4000 FP Pipe Stages

Add Issue U S+A A+R R+S



Add Stall U S+A A + R R +S

Add Stall U S + A A +R R +S



Multiply Issue U M M M M N N+ A R

clock cycle

Operation Issue/stall 0 1 2 3 4 5 6 7 8 9 10 11 12

A

A

A

A

ADD issued at 5 cycles after Multiply will stall 1 cycle.

Stall

Stall

ADD issued at 4 cycles after Multiply will stall 2 cycles.


R4000 PerformanceR4000 PerformanceNot an ideal pipeline CPI of 1:

– Load stalls– Branch stalls: (2 cycles for taken br. + unfilled branch slots o

r cancelled branch delay slots)– FP result stalls: RAW data hazard (latency)– FP structural stalls: Not enough FP hardware (parallelism)

00.5

11.5

22.5

33.5

44.5

eqnt

ott

espr

esso

gcc li

dodu

c

nasa

7

ora

spic

e2g6

su2c

or

tom

catv

Base Load stalls Branch stalls FP result stalls FP structuralstalls

Integer programs Floating Point programs

Pipe

line

CPI


Advanced PipelineAdvanced PipelineAndAnd

Instruction Level ParallelismInstruction Level Parallelism


Advanced Pipelining and Advanced Pipelining and Instruction Level ParallelismInstruction Level Parallelism

• gcc 17% control transfer– 5 instructions + 1 branch– Beyond single block to get more instruction level paralleli

sm• Loop level parallelism is one opportunity, SW and HW

. . .Branch Target . . .

Branch instruction . . .

. . .Any instruction

. . .

Branch instruction . . .

Block of Code


Advanced Pipelining Advanced Pipelining and Instruction Level Parallelismand Instruction Level Parallelism

Loop unrolling Control stalls

Basic pipeline scheduling RAW stalls

Dynamic scheduling with scoreboarding RAW stalls

Dynamic scheduling with register renaming WAR and WAW stalls

Dynamic branch prediction Control stalls

Issuing multiple instructions per cycle Ideal CPI

Compiler dependence analysis Ideal CPI and data stalls

Software pipelining and trace scheduling Ideal CPI and data stalls

Speculation All data and control stalls

Dynamic memory disambiguation RAW stalls involving memory

Technique Reduces


Basic Pipeline Scheduling Basic Pipeline Scheduling and Loop Unrollingand Loop Unrolling

FP unit latencies

Instruction producing Instruction using Latency in result result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2Load double* FP ALU op 1Load double* Store double 0 * Same as integer Load since there is a 64-bit data path from/to memory.Fully pipelined or replicated --- no structural hazards, issue on every clock cycle

for ( i =1; i <= 1000; i++)x[i] = x[i] + s;


Loop: LD F0,0(R1) ;R1 is the pointer to a vector ADDD F4,F0,F2 ;F2 contains a scalar value SD 0(R1),F4 ;store back result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot

FP Loop HazardsFP Loop Hazards

Where are the stalls?

Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op 1Load double Store double 0Integer op Integer op 0


FP Loop Showing StallsFP Loop Showing Stalls 1 Loop: LD F0,0(R1) ;F0=vector element 2 stall 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8B

(DW) 8 stall 9 BNEZ R1,Loop ;branch R1!=zero10 stall ;delayed branch slot

Rewrite code to minimize stalls?Rewrite code to minimize stalls?


Reducing StallsReducing Stalls1 Loop: LD F0,0(R1)2 stall3 ADDD F4,F0,F24 stall5 stall6 SD 0(R1),F4 7 SUBI R1,R1,#88 stall9 BNEZ R1,Loop10 stall

For Load-ALU latency

There is only one instruction left, i.e., BNEZ.

When we do that, SD instruction fills the delayedbranch slot.

For ALU-ALU latencyReading R1 by LD is done before Writing R1 by SUBI. Yes we can.

Consider moving SUBI into this Load Delay Slot.

When we do this, we need to change the immediate value 0 to 8 in SD

8


Revised FP Loop Revised FP Loop to Minimize Stallsto Minimize Stalls

1 Loop: LD F0,0(R1) 2 SUBI R1,R1,#8 3 ADDD F4,F0,F2 4 stall 5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4 ;altered when move past SUBI

Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op 1

Unroll loop 4 times to make the code faster


Unroll Loop 4 TimesUnroll Loop 4 Times 1 Loop:LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 ;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,Loop 15 NOP

15 + 4 x (1*+2+)+1^= 28 clock cycles, or 7 per iteration. 1*: LD to ADDD stall 1 cycle 2+: ADDD to SD stall 2 cycles 1^: Data dependency on R1

Rewrite loop to minimize the stalls


Unrolled Loop Unrolled Loop to Minimize Stallsto Minimize Stalls

1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SUBI R1,R1,#3212 SD 16(R1),F12 ; 16-32= -1613 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24

Assumptions - OK to move SD past SUBI even though

SUBI changes R1 SUBI IF RF EX MEM WB SD IF ID EX MEM WB

BNEZ IF ID EX MEM WB

- OK to move loads before stores(Get right data) - When is it safe for compiler to do such

changes?

14 clock cycles, or 3.5 per iteration


Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement

• Definitions: Compiler is concerned about dependencies in the program, whether this causes a HW hazard or not depends on a given pipeline

• Data dependencies (RAW if a hazard for HW): Instruction j is data dependent on instruction i if either– Instruction i produces a result used by instruction j, or– Instruction j is data dependent on instruction k, and

instruction k is data dependent on instruction i.• Easy to determine for registers (fixed names)• Hard for memory:

– Does 100(R4) = 20(R6)?– From different loop iterations, does 20(R6) = 20(R6)?



• Name Dependence: Two instructions use the same name(register or memory location) but they do not exchange data

• Two kinds of Name Dependence

Instruction i precedes instruction j– Antidependence (WAR if a hazard for HW)

• Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first

– Output dependence (WAW if a hazard for HW)• Instruction i and instruction j write the same register or memory locatio

n; ordering between instructions must be preserved.


• Again Hard for Memory Accesses – Does 100(R4) = 20(R6)?– From different loop iterations, does 20(R6) = 20(R6)?

• Our example required compiler to know that if R1 doesn’t change then:

0(R1) ¹ -8(R1) ¹ -16(R1) ¹ -24(R1) 1

There were no dependencies between some loads and stores, so they could be moved by each other.




• Control Dependence• Example

if p1 {S1;};if p2 {S2;}

S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.



• Two (obvious) constraints on control dependencies:– An instruction that is control dependent on a branch

cannot be moved before the branch so that its execution is no longer controlled by the branch.

– An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch.

• Control dependencies may be relaxed in some systems to get parallelism; get the same effect if preserve the order of exceptions and data flow


When Safe to Unroll Loop?When Safe to Unroll Loop?

• Example: When a loop is unrolled, where are data dependencies? (A,B,C distinct, non-overlapping)

for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */

1. S2 uses the value A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1].

This is a loop-carried dependence between iterations• Implies that iterations are dependent, and can’t be executed in parallel• Not the case for our example; each iteration was distinct


When Safe to Unroll Loop?When Safe to Unroll Loop?

• Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping)

Following looks like there is a loop carried dependencefor (i=1; i<=100; i=i+1) {A[i] = A[i] + B[i]; /* S1 */B[i+1] = C[i] + D[i];} /* S2 */

However, we can rewrite it as follows for loop carried dependence-freeA[1] = A[1] + B[1];for (i=1; i<=99; i=i+1) {B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1];} B[101] = C[100]+D[100];


SummarySummary

• Instruction Level Parallelism in SW or HW

• Loop level parallelism is easiest to see

• SW parallelism dependencies defined for a program, hazards if HW cannot resolve

• SW dependencies/compiler sophistication determine if compiler can unroll loops

lecture 8 advanced pipeline

Documents