lecture 8 advanced pipeline

Pipeline Complications CS510 Computer Architectures Lecture 8 - 1

Lecture 8Lecture 8

Advanced PipelineAdvanced PipelineLecture 8Lecture 8

Advanced PipelineAdvanced Pipeline


Advanced Pipelining Advanced Pipelining and Instruction Level Parallelismand Instruction Level Parallelism

Advanced Pipelining Advanced Pipelining and Instruction Level Parallelismand Instruction Level Parallelism

Loop unrolling Control stalls

Basic pipeline scheduling RAW stalls

Dynamic scheduling with scoreboarding RAW stalls

Dynamic scheduling with register renaming WAR and WAW stalls

Dynamic branch prediction Control stalls

Issuing multiple instructions per cycle Ideal CPI

Compiler dependence analysis Ideal CPI and data stalls

Software pipelining and trace scheduling Ideal CPI and data stalls

Speculation All data and control stalls

Dynamic memory disambiguation RAW stalls involving memory

Technique Reduces


Basic Pipeline Scheduling Basic Pipeline Scheduling and Loop Unrollingand Loop Unrolling

Basic Pipeline Scheduling Basic Pipeline Scheduling and Loop Unrollingand Loop Unrolling

FP unit latencies

Instruction producing Instruction using Latency in result result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2Load double* FP ALU op 1Load double* Store double 0 * Same as integer Load since there is a 64-bit data path from/to memory.

Fully pipelined or replicated --- no structural hazards, issue on every clock cycle

for ( i =1; i <= 1000; i++)x[i] = x[i] + s;


Loop: LD F0,0(R1) ;R1 is the pointer to a vector ADDD F4,F0,F2 ;F2 contains a scalar value SD 0(R1),F4 ;store back result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot

FP Loop HazardsFP Loop HazardsFP Loop HazardsFP Loop Hazards

Where are the stalls?

Instruction Instruction Latency inproducing result using result clock cycles

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Load double Store double 0

Integer op Integer op 0


FP Loop Showing StallsFP Loop Showing StallsFP Loop Showing StallsFP Loop Showing Stalls

1 Loop: LD F0,0(R1) ;F0=vector element

2 stall

3 ADDD F4,F0,F2 ;add scalar in F2

4 stall

5 stall

6 SD 0(R1),F4 ;store result

7 SUBI R1,R1,8 ;decrement pointer 8B (DW)

8 stall

9 BNEZ R1,Loop ;branch R1!=zero

10 stall ;delayed branch slot

Rewrite code to minimize stalls?Rewrite code to minimize stalls?


Reducing StallsReducing StallsReducing StallsReducing Stalls

1 Loop: LD F0,0(R1)

2 stall

3 ADDD F4,F0,F2

4 stall

5 stall

6 SD 0(R1),F4

7 SUBI R1,R1,#8

8 stall

9 BNEZ R1,Loop

10 stall

For Load-ALU latency

There is only one instruction left, i.e., BNEZ.

When we do that, SD instruction fills the delayedbranch slot.

For ALU-ALU latencyReading R1 by LD is done before Writing R1 by SUBI. Yes we can.

Consider moving SUBI into this Load Delay Slot.

When we do this, we need to change the immediate value 0 to 8 in SD

8


Revised FP Loop Revised FP Loop to Minimize Stallsto Minimize StallsRevised FP Loop Revised FP Loop to Minimize Stallsto Minimize Stalls

1 Loop: LD F0,0(R1)

2 SUBI R1,R1,#8

3 ADDD F4,F0,F2

4 stall

5 BNEZ R1,Loop ;delayed branch

6 SD 8(R1),F4 ;altered when move past SUBI

Instruction Instruction Latency inproducing result using result clock cycles

FP ALU op Another FP ALU op 3

FP ALU op Store double 2

Load double FP ALU op 1

Unroll loop 4 times to make the code faster


Unroll Loop 4 TimesUnroll Loop 4 TimesUnroll Loop 4 TimesUnroll Loop 4 Times 1 Loop:LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 ;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,Loop

15 NOP 15 + 4 x (1*+2+)+1^= 28 clock cycles, or 7 per iteration. 1*: LD to ADDD stall 1 cycle 2+: ADDD to SD stall 2 cycles 1^: Data dependency on R1

Rewrite loop to minimize the stalls


Unrolled Loop Unrolled Loop to Minimize Stallsto Minimize Stalls

Unrolled Loop Unrolled Loop to Minimize Stallsto Minimize Stalls

1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SUBI R1,R1,#3212 SD 16(R1),F12 ; -16 +32=1613 BNEZ R1,LOOP14 SD 8(R1),F16 ; -24+32 = 8

14 clock cycles, or 3.5 per iteration


Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement


• Definitions: Compiler is concerned about dependencies in the program, whether this causes a HW hazard or not depends on a given pipeline

• Data dependencies (RAW if a hazard for HW): Instruction j is data dependent on instruction i if either

– Instruction i produces a result used by instruction j, or

– Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.

• Easy to determine for registers (fixed names)• Hard for memory:

– Does 100(R4) = 20(R6)?

– From different loop iterations, does 20(R6) = 20(R6)?




• Name Dependence: Two instructions use the same name(register or memory location) but they do not exchange data

• Two kinds of Name Dependence

Instruction i precedes instruction j– Antidependence (WAR if a hazard for HW)

• Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first

– Output dependence (WAW if a hazard for HW)• Instruction i and instruction j write the same register or memory

location; ordering between instructions must be preserved.


• Again Hard for Memory Accesses

– Does 100(R4) = 20(R6)?

– From different loop iterations, does 20(R6) = 20(R6)?• Our example required compiler to know that if R1 doesn’t change

then:

0(R1) ¹ -8(R1) ¹ -16(R1) ¹ -24(R1) 1

There were no dependencies between some loads and stores, so they could be moved by each other.






• Control Dependence

• Example

if p1 {S1;};

if p2 {S2;}

S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.




• Two (obvious) constraints on control dependencies:

– An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch.

– An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch.

• Control dependencies may be relaxed in some systems to get parallelism; get the same effect if preserve the order of exceptions and data flow


When Safe to Unroll Loop?When Safe to Unroll Loop?When Safe to Unroll Loop?When Safe to Unroll Loop?

• Example: When a loop is unrolled, where are data dependencies? (A,B,C distinct, non-overlapping)

for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */

1. S2 uses the value A[i+1], computed by S1 in the same iteration.

2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1].

This is a loop-carried dependence between iterations• Implies that iterations are dependent, and can’t be executed in parallel• Not the case for our example; each iteration was distinct


When Safe to Unroll Loop?When Safe to Unroll Loop?When Safe to Unroll Loop?When Safe to Unroll Loop?

• Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping)

Following looks like there is a loop carried dependence

for (i=1; i<=100; i=i+1) {A[i] = A[i] + B[i]; /* S1 */B[i+1] = C[i] + D[i];} /* S2 */

However, we can rewrite it as follows for loop carried dependence-free

A[1] = A[1] + B[1];

for (i=1; i<=99; i=i+1) {B[i+1] = C[i] + D[i];

A[i+1] = A[i+1] + B[i+1];}

B[101] = C[100]+D[100];

Dynamic Branch Prediction CS510 Computer Architectures Lecture 10 - 17

Software PipeliningSoftware PipeliningSoftware PipeliningSoftware Pipelining• Observation: if iterations from loops are independent, then can get

ILP by taking instructions from different iterations• Software pipelining: reorganizes loops so that each iteration is

made from instructions chosen from different iterations of the original loop .

Iteration 0 Iteration

1 Iteration 2 Iteration

3 Iteration 4

Software- pipelined iteration


SW Pipelining ExampleSW Pipelining ExampleSW Pipelining ExampleSW Pipelining ExampleBefore: Unrolled 3 times 1 LOOP LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F0,F2 6 SD -8(R1),F8 7 LD F10,16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 10 SUBI R1,R1,#24 11 BNEZ R1,LOOP

After: Software Pipelined version of loopLD F0,0(R1)ADDD F4,F0,F2LD F0,-8(R1)

1 LOOP SD 0(R1),F4; Stores to M[i] 2 ADDD F4,F0,F2; Adds to M[i-1] 3 LD F0,-16(R1); Loads from

M[i-2] 4 SUBI R1,R1,#8 5 BNEZ R1,LOOP

SD 0(R1),F4ADDD F4,F0,F2SD -8(R1),F4

Start-up code

Finish code

Iter i

Iter i+1

Iter i+2

IF ID EX Mem WB IF ID EX Mem WB IF ID EX Mem WB

SDADDDLD

Read F4(i)Write F4(i+1)

Read F0(i)Write F0(i+2)


SW Pipelining ExampleSW Pipelining ExampleSW Pipelining ExampleSW Pipelining Example

Symbolic Loop Unrolling– Less code space– Overhead paid only once vs. each iteration in loop unrolling

Software PipeliningNumber of

overlappedoperations

Time

Loop Unrolling

100 iterations = 25 loops with 4 unrolled iterations each

Number ofoverlappedoperations

Time

. . .

lecture 8 advanced pipeline

Documents