lecture 8 advanced pipeline
DESCRIPTION
Lecture 8 Advanced Pipeline. EX int unit. IF ID. IF ID. EX FP/int Multiply. EX. MEM WB. MEM WB. EX FP adder. EX FP/Int divider. Extending the DLX to Handle Multi-cycle Operations. DLX pipeline with 3 additional unpipelined, - PowerPoint PPT PresentationTRANSCRIPT
Pipeline Complications CS510 Computer Architectures Lecture 8 - 1
Lecture 8Lecture 8Advanced PipelineAdvanced Pipeline
Pipeline Complications CS510 Computer Architectures Lecture 8 - 2
Extending the DLX to Handle Extending the DLX to Handle Multi-cycle Operations Multi-cycle Operations
IF ID MEM WBEXIF ID MEM WB
EXint unit
EXFP/int
Multiply
EXFP/Intdivider
EXFP adder
DLX pipeline with 3 additional unpipelined, floadting-point functional units
Pipeline Complications CS510 Computer Architectures Lecture 8 - 3
Multicycle OperationsMulticycle Operations
IF ID MEM WB
EX
integer unit
M1 M2 M3 M4 M5 M6 M7
FP/integer multiply
A1 A2 A3 A4
FP adder
DIV
FP/integer divider
24 clock cycles
Pipeline Complications CS510 Computer Architectures Lecture 8 - 4
Latency and Initiation IntervalLatency and Initiation Interval
Latency: Number of intervening cycles between an instruction that produces a result and an instruction that uses the result
Initiation Interval: number of cycles that must elapse between issuing of two operations of a given type
Integer ALU 0 1Load 1 1FP add 3 1FP mul 6 1FP div 24 25
Data needed Result available
* FP LD and ST aresame as integer byhaving 64-bit pathto memory.
MULTD IF ID M1 M2 M3 M4 M5 M6 M7 MEM WBADDD IF ID AI A2 A3 A4 MEM WBLD* IF ID EX MEM WBSD* IF ID EX MEM WB
Example
latency initiation interval
Pipeline Complications CS510 Computer Architectures Lecture 8 - 5
Floating Point OperationsFloating Point Operations
FP Instruction Latency Initiation Interval (MIPS R4000)Add, Subtract 4 3Multiply 8 4Divide 36 35Square root 112 111Negate 2 1Absolute value 2 1FP compare 3 2
Cycles before using result
Cycles before issuing instr of the same type
Floating Point: long execution time Also, pipeline FP execution unit may initiate new instructions without waiting full latency
Reality: MIPS R4000
Pipeline Complications CS510 Computer Architectures Lecture 8 - 6
Complications Complications Due to FP Operations (in DLX)Due to FP Operations (in DLX)
– Because the divide unit is not fully pipelined, structural hazards can ocur
– WAW hazards are possible, since instructions no longer reach WB in order. (WAR hazards are not possible, since register reads always occur in ID)
– Instructions can complete in a different order than they were issued, causing problems with exceptions
– Because of longer latency of operations, stalls for RAW hazards will be more frequent
Pipeline Complications CS510 Computer Architectures Lecture 8 - 7
Summary of Pipelining BasicsSummary of Pipelining Basics• Hazards limit performance
– Structural: need more HW resources– Data: need forwarding, compiler scheduling– Control: early evaluation of PC, delayed branch, prediction
• Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency
• Interrupts, FP Instruction Set makes pipelining harder• Compilers reduce cost of data and control hazards
– Load delay slots– Branch delay slots– Branch prediction
Pipeline Complications CS510 Computer Architectures Lecture 8 - 8
Case Study:Case Study:MIPS R4000 and MIPS R4000 and
Introduction to Advanced Introduction to Advanced PipeliningPipelining
Pipeline Complications CS510 Computer Architectures Lecture 8 - 9
Case Study:Case Study: MIPS R4000 PipelineMIPS R4000 Pipeline
8 Stage Pipeline:IF First half of fetching of instruction
• PC selection • Initiation of instruction cache access
IS - Second half of fetching of instruction • Access to instruction cache
RF Instruction decode, register fetch, hazard checking, and also instruction cache hit detection(tag check)EX Execution
• Effective address calculation • ALU operation• Branch target computation and condition evaluation
DF - First half of access to data cacheDS - Second half of access to data cacheTC - Tag check for data cache hitWB -Write back for loads and register-register operations
Pipeline Complications CS510 Computer Architectures Lecture 8 - 10
The Pipeline Structure The Pipeline Structure of the R4000of the R4000
Instruction Memory REG
ALU Data Memory REG
Instruction is available
Tag check
load data available
IF IS RF EX DF DS TC WB
Pipeline Complications CS510 Computer Architectures Lecture 8 - 11
Case Study: MIPS R4000Case Study: MIPS R4000LOAD LatencyLOAD Latency
2 Cycle Load Latency
Load data availableLoad data availablewith forwardingwith forwarding
LD R1, X IF IS RF EX DF DS TC WB
IF IS RF EX DF DS IF IS RF EX DF DS . . .ADD R3, R1, R2 IF IS RF EX DF DS TC WBIF IS RF EX DF DS TC . . .
IF IS RF EX DF . . .
EX
Load data neededLoad data needed
EX
2 Stall Cycles2 Stall Cycles
Pipeline Complications CS510 Computer Architectures Lecture 8 - 12
Case Study: MIPS R4000Case Study: MIPS R4000LOAD Followed by ALU InstructionsLOAD Followed by ALU Instructions
2 cycle Load Latency with Forwarding Circuit
IF ISIF
RFISIF
EXRFISIF
DFstallstallstall
DSstallstallstall
TCEXRFIS
WBDF ...EX ...RF ...
LW R1ADD R2, R1SUB R3, R1OR R4, R1
Forwarding
Pipeline Complications CS510 Computer Architectures Lecture 8 - 13
Case Study: MIPS R4000Case Study: MIPS R4000Branch LatencyBranch Latency
Predict NOT TAKEN strategy NOT TAKEN: one-cycle delayed slot TAKEN: one-cycle delayed slot followed by two stalls - 3 cycle latency
R4000 uses Predict NOT TAKENNOT TAKEN
Delay Slot plus 2 stall cycles
IF ISIF
RFISIF
RFISIF
DFEXRFIS
DSDFEXRFIS
TCDSDFEXRF
WB TC ...DS ...DF ...EX ...
NOT TAKENNOT TAKEN BrDelay SlotDelay SlotBr instr +2Br instr +3Br instr +4
EX
IF
DSDF
IS
TCDS
RF
IF ISIF
RFIS RF
DFEX
WB TC ...
EX ...
EXTAKENTAKEN BrDelay SlotDelay SlotStallStallStallStallBr Target instr IF
Branch target address available after EX stage
Pipeline Complications CS510 Computer Architectures Lecture 8 - 14
Extending DLX to Handle Extending DLX to Handle Floating Point OperationsFloating Point Operations
IF ID MEM WB
Integer Unit(EX)Integer Unit(EX)
FP/integer multiplyFP MultiplierFP Multiplier
FP AdderFP Adder
FP DividerFP Divider
Pipeline Complications CS510 Computer Architectures Lecture 8 - 15
MIPS R4000 FP UnitMIPS R4000 FP Unit• FP Adder, FP Multiplier, FP Divider• Last step of FP Multiplier/Divider uses FP Adder HW• 8 kinds of stages in FP units: (single copy of each)
Stage Functional unit DescriptionA FP adder Mantissa ADD stage D FP divider Divide pipeline stageE FP multiplier Exception test stageM FP multiplier First stage of multiplierN FP multiplier Second stage of multiplierR FP adder Rounding stageS FP adder Operand shift stageU Unpack FP numbers
Pipeline Complications CS510 Computer Architectures Lecture 8 - 16
MIPS R4000 FP Pipe StagesMIPS R4000 FP Pipe StagesFP Instr 1 2 3 4 5 6 7 8 latencyAdd, Subtract U S+A A+R R+S 4Multiply U E+M M M M N N+A R 8Divide U A R D27 … D+A D+R, D+A, D+R, A, R 36Square root U E (A+R)108 A R 112Negate U S 2Absolute value U S 2FP compare U A R 3Stages:
M First stage of multiplier N Second stage of multiplierR Rounding stage A Mantissa ADD stageS Operand shift stage D Divide pipeline stageU Unpack FP numbers E Exception test stage
Pipeline Complications CS510 Computer Architectures Lecture 8 - 17
Latency and Initiation IntervalsLatency and Initiation Intervals
FP Instruction Latency Initiation Interval Add, Subtract 4 3Multiply 8 4Divide 36 35Square root 112 111Negate 2 1Absolute value 2 1FP compare 3 2
Pipeline Complications CS510 Computer Architectures Lecture 8 - 18
MIPS R4000 FP Pipe StagesMIPS R4000 FP Pipe Stages
Add Issue U S+A A+R R+S
Add Issue U S+A A+R R+S
Add Issue U S+A A+R R+S
Add Stall U S+A A + R R +S
Add Stall U S + A A +R R +S
Add Issue U S+A A+R R+S
Add Issue U S+A A+R R+S
Multiply Issue U M M M M N N+ A R
clock cycle
Operation Issue/stall 0 1 2 3 4 5 6 7 8 9 10 11 12
A
A
A
A
ADD issued at 5 cycles after Multiply will stall 1 cycle.
Stall
Stall
ADD issued at 4 cycles after Multiply will stall 2 cycles.
Pipeline Complications CS510 Computer Architectures Lecture 8 - 19
R4000 PerformanceR4000 PerformanceNot an ideal pipeline CPI of 1:
– Load stalls– Branch stalls: (2 cycles for taken br. + unfilled branch slots o
r cancelled branch delay slots)– FP result stalls: RAW data hazard (latency)– FP structural stalls: Not enough FP hardware (parallelism)
00.5
11.5
22.5
33.5
44.5
eqnt
ott
espr
esso
gcc li
dodu
c
nasa
7
ora
spic
e2g6
su2c
or
tom
catv
Base Load stalls Branch stalls FP result stalls FP structuralstalls
Integer programs Floating Point programs
Pipe
line
CPI
Pipeline Complications CS510 Computer Architectures Lecture 8 - 20
Advanced PipelineAdvanced PipelineAndAnd
Instruction Level ParallelismInstruction Level Parallelism
Pipeline Complications CS510 Computer Architectures Lecture 8 - 21
Advanced Pipelining and Advanced Pipelining and Instruction Level ParallelismInstruction Level Parallelism
• gcc 17% control transfer– 5 instructions + 1 branch– Beyond single block to get more instruction level paralleli
sm• Loop level parallelism is one opportunity, SW and HW
. . .Branch Target . . .
Branch instruction . . .
. . .Any instruction
. . .
Branch instruction . . .
Block of Code
Pipeline Complications CS510 Computer Architectures Lecture 8 - 22
Advanced Pipelining Advanced Pipelining and Instruction Level Parallelismand Instruction Level Parallelism
Loop unrolling Control stalls
Basic pipeline scheduling RAW stalls
Dynamic scheduling with scoreboarding RAW stalls
Dynamic scheduling with register renaming WAR and WAW stalls
Dynamic branch prediction Control stalls
Issuing multiple instructions per cycle Ideal CPI
Compiler dependence analysis Ideal CPI and data stalls
Software pipelining and trace scheduling Ideal CPI and data stalls
Speculation All data and control stalls
Dynamic memory disambiguation RAW stalls involving memory
Technique Reduces
Pipeline Complications CS510 Computer Architectures Lecture 8 - 23
Basic Pipeline Scheduling Basic Pipeline Scheduling and Loop Unrollingand Loop Unrolling
FP unit latencies
Instruction producing Instruction using Latency in result result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2Load double* FP ALU op 1Load double* Store double 0 * Same as integer Load since there is a 64-bit data path from/to memory.Fully pipelined or replicated --- no structural hazards, issue on every clock cycle
for ( i =1; i <= 1000; i++)x[i] = x[i] + s;
Pipeline Complications CS510 Computer Architectures Lecture 8 - 24
Loop: LD F0,0(R1) ;R1 is the pointer to a vector ADDD F4,F0,F2 ;F2 contains a scalar value SD 0(R1),F4 ;store back result SUBI R1,R1,8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero NOP ;delayed branch slot
FP Loop HazardsFP Loop Hazards
Where are the stalls?
Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op 1Load double Store double 0Integer op Integer op 0
Pipeline Complications CS510 Computer Architectures Lecture 8 - 25
FP Loop Showing StallsFP Loop Showing Stalls 1 Loop: LD F0,0(R1) ;F0=vector element 2 stall 3 ADDD F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 SD 0(R1),F4 ;store result 7 SUBI R1,R1,8 ;decrement pointer 8B
(DW) 8 stall 9 BNEZ R1,Loop ;branch R1!=zero10 stall ;delayed branch slot
Rewrite code to minimize stalls?Rewrite code to minimize stalls?
Pipeline Complications CS510 Computer Architectures Lecture 8 - 26
Reducing StallsReducing Stalls1 Loop: LD F0,0(R1)2 stall3 ADDD F4,F0,F24 stall5 stall6 SD 0(R1),F4 7 SUBI R1,R1,#88 stall9 BNEZ R1,Loop10 stall
For Load-ALU latency
There is only one instruction left, i.e., BNEZ.
When we do that, SD instruction fills the delayedbranch slot.
For ALU-ALU latencyReading R1 by LD is done before Writing R1 by SUBI. Yes we can.
Consider moving SUBI into this Load Delay Slot.
When we do this, we need to change the immediate value 0 to 8 in SD
8
Pipeline Complications CS510 Computer Architectures Lecture 8 - 27
Revised FP Loop Revised FP Loop to Minimize Stallsto Minimize Stalls
1 Loop: LD F0,0(R1) 2 SUBI R1,R1,#8 3 ADDD F4,F0,F2 4 stall 5 BNEZ R1,Loop ;delayed branch 6 SD 8(R1),F4 ;altered when move past SUBI
Instruction Instruction Latency inproducing result using result clock cyclesFP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op 1
Unroll loop 4 times to make the code faster
Pipeline Complications CS510 Computer Architectures Lecture 8 - 28
Unroll Loop 4 TimesUnroll Loop 4 Times 1 Loop:LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop SUBI & BNEZ 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 ;drop SUBI & BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 ;drop SUBI & BNEZ 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 12 SD -24(R1),F16 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,Loop 15 NOP
15 + 4 x (1*+2+)+1^= 28 clock cycles, or 7 per iteration. 1*: LD to ADDD stall 1 cycle 2+: ADDD to SD stall 2 cycles 1^: Data dependency on R1
Rewrite loop to minimize the stalls
Pipeline Complications CS510 Computer Architectures Lecture 8 - 29
Unrolled Loop Unrolled Loop to Minimize Stallsto Minimize Stalls
1 Loop: LD F0,0(R1)2 LD F6,-8(R1)3 LD F10,-16(R1)4 LD F14,-24(R1)5 ADDD F4,F0,F26 ADDD F8,F6,F27 ADDD F12,F10,F28 ADDD F16,F14,F29 SD 0(R1),F410 SD -8(R1),F811 SUBI R1,R1,#3212 SD 16(R1),F12 ; 16-32= -1613 BNEZ R1,LOOP14 SD 8(R1),F16 ; 8-32 = -24
Assumptions - OK to move SD past SUBI even though
SUBI changes R1 SUBI IF RF EX MEM WB SD IF ID EX MEM WB
BNEZ IF ID EX MEM WB
- OK to move loads before stores(Get right data) - When is it safe for compiler to do such
changes?
14 clock cycles, or 3.5 per iteration
Pipeline Complications CS510 Computer Architectures Lecture 8 - 30
Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement
• Definitions: Compiler is concerned about dependencies in the program, whether this causes a HW hazard or not depends on a given pipeline
• Data dependencies (RAW if a hazard for HW): Instruction j is data dependent on instruction i if either– Instruction i produces a result used by instruction j, or– Instruction j is data dependent on instruction k, and
instruction k is data dependent on instruction i.• Easy to determine for registers (fixed names)• Hard for memory:
– Does 100(R4) = 20(R6)?– From different loop iterations, does 20(R6) = 20(R6)?
Pipeline Complications CS510 Computer Architectures Lecture 8 - 31
Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement
• Name Dependence: Two instructions use the same name(register or memory location) but they do not exchange data
• Two kinds of Name Dependence
Instruction i precedes instruction j– Antidependence (WAR if a hazard for HW)
• Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first
– Output dependence (WAW if a hazard for HW)• Instruction i and instruction j write the same register or memory locatio
n; ordering between instructions must be preserved.
Pipeline Complications CS510 Computer Architectures Lecture 8 - 32
• Again Hard for Memory Accesses – Does 100(R4) = 20(R6)?– From different loop iterations, does 20(R6) = 20(R6)?
• Our example required compiler to know that if R1 doesn’t change then:
0(R1) ¹ -8(R1) ¹ -16(R1) ¹ -24(R1) 1
There were no dependencies between some loads and stores, so they could be moved by each other.
Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement
Pipeline Complications CS510 Computer Architectures Lecture 8 - 33
Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement
• Control Dependence• Example
if p1 {S1;};if p2 {S2;}
S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.
Pipeline Complications CS510 Computer Architectures Lecture 8 - 34
Compiler Perspectives Compiler Perspectives on Code Movementon Code Movement
• Two (obvious) constraints on control dependencies:– An instruction that is control dependent on a branch
cannot be moved before the branch so that its execution is no longer controlled by the branch.
– An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch.
• Control dependencies may be relaxed in some systems to get parallelism; get the same effect if preserve the order of exceptions and data flow
Pipeline Complications CS510 Computer Architectures Lecture 8 - 35
When Safe to Unroll Loop?When Safe to Unroll Loop?
• Example: When a loop is unrolled, where are data dependencies? (A,B,C distinct, non-overlapping)
for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */
1. S2 uses the value A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1].
This is a loop-carried dependence between iterations• Implies that iterations are dependent, and can’t be executed in parallel• Not the case for our example; each iteration was distinct
Pipeline Complications CS510 Computer Architectures Lecture 8 - 36
When Safe to Unroll Loop?When Safe to Unroll Loop?
• Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping)
Following looks like there is a loop carried dependencefor (i=1; i<=100; i=i+1) {A[i] = A[i] + B[i]; /* S1 */B[i+1] = C[i] + D[i];} /* S2 */
However, we can rewrite it as follows for loop carried dependence-freeA[1] = A[1] + B[1];for (i=1; i<=99; i=i+1) {B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1];} B[101] = C[100]+D[100];
Pipeline Complications CS510 Computer Architectures Lecture 8 - 37
SummarySummary
• Instruction Level Parallelism in SW or HW
• Loop level parallelism is easiest to see
• SW parallelism dependencies defined for a program, hazards if HW cannot resolve
• SW dependencies/compiler sophistication determine if compiler can unroll loops