\course\eleg652-03fall\topic3-6521 exploitation of instruction-level parallelism (ilp)
Post on 21-Dec-2015
229 views
TRANSCRIPT
\course\ELEG652-03Fall\Topic3-652 1
Exploitation ofInstruction-Level Parallelism
(ILP)
\course\ELEG652-03Fall\Topic3-652 2
Reading List
• Slides: Topic4x
• Henn&Patt: Chapter 4
• Other assigned readings from homework and classes
\course\ELEG652-03Fall\Topic3-652 3
Design Space for Processors
20
10
5.0
2.0
1.0
0.5
0.2
0.1
0.05
Cyc
le p
er I
nstr
ucti
on
{
EnoughParallelism
?[TheobaldGaoHen1992,1993,1994]
Scalar CISC
Scalar RISC
SuperpipelinedMost likely futureprocessor space
MultithreadedSuperscalar
RISCVector
SupercomputerVLIW
5 10 20 50 100 200 500 1000 MHz
Clock Rate
\course\ELEG652-03Fall\Topic3-652 4
Pipelining - A Review
Hazards
• Structural: resource conflicts when hardware cannot support all possible combinations of insets.. in overlapped exec.
• Data: insts depend on the results of a previous inst.
• Control: due to branches and other insts that change PC
• Hazard will cause “stall”
• but in pipeline “stall” is serious - it will hold up multiple insts.
\course\ELEG652-03Fall\Topic3-652 5
RISC Concepts: Revisit
• What makes it a success ?- Pipeline- cache
• What prevents CPI = 1?- hazards and its resolution- Def- dependence graph
\course\ELEG652-03Fall\Topic3-652 6
Structural Hazards- Non-pipelined Fus- One port of a R-file- One port of M.
Data hazards for some data hazards
( e.g. ALU/ALU ops solutions): forwards (bypass)
for others:
pipeline interlock + pipeline stall(bypass cannot do on time)
LD R1 A+ R4 R1 R7
this may need a “stall” or bubble
\course\ELEG652-03Fall\Topic3-652 7
Example of Structural Hazard
Instruction Clock cycle number
1 2 3 4 5 6 7 8 9
Load instruction IF ID EX MEM WB
Instruction i+1 IF ID EX MEM WB
Instruction i+2 IF ID EX MEM WB
Instruction i+3 stall IF ID EX MEM WB
Instruction i+4 IF ID EX MEM
\course\ELEG652-03Fall\Topic3-652 8
Clock cycleInstruction 1 2 3 4 5 6
ADD instruction IF ID EX MEM WB- data written here
SUB instruction IF ID- data read here EX MEM WB
The ADD instruction writes a register that is a source operand for the SUB instruction. But the ADD doesn’t finish writing the data into the register file until three clock cycles afterSUB begins reading it!
(1) data hazard may cause SUB read wrong value(2) this is dangerous: as the result may be non-deterministic (3) forwarding (by-passing)
Data Hazard
\course\ELEG652-03Fall\Topic3-652 9
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
ADD R1,R2,R3
SUB R4,R1,R5
AND R6,R1,R7
OR R8,R1,R9
XOR R10,R1,R11
A set of instructions in the pipeline that need to forward results.
\course\ELEG652-03Fall\Topic3-652 11
A B + C...E A + D
Flow-dependency
(R/W conflicts)
\course\ELEG652-03Fall\Topic3-652 12
A B + C...A B - C
Output dependency
(W/W conflicts)
Leave A in wrong state if order is changed
\course\ELEG652-03Fall\Topic3-652 13
A A + B...A C + D
anti-dependency
(W/R conflicts)
\course\ELEG652-03Fall\Topic3-652 16
Not all data hazards can be eliminated by bypassing
LW R1, 32 (R6)
ADD R4, R1, R7SUB R5, R1, R8AND R6, R1, R7
\course\ELEG652-03Fall\Topic3-652 17
• Load latency cannot be eliminated by forward alone
• It is handled often by “pipeline interlock” - which detects a hazard and “stall” the pipeline
the delay cycle - called stall or “bubble”
Any instruction IF ID EX MEM WB
LW R1, 32 (R6) IF ID EX MEM WB
ADD R4, R1, R7 IF ID stall EX MEM WB
SUB R5, R1, R8 IF stall ID EX MEM WB
AND R6, R1, R7 stall IF ID EX MEM45
\course\ELEG652-03Fall\Topic3-652 18
“Issue” - pass ID stage
“Issued instructions” -
DLX always only issue inst where there is no hazard.
Detect interlock early in the pipeline has the advantage that it never needs to suspend an inst and undo the state changes.
\course\ELEG652-03Fall\Topic3-652 19
Exploitation Instruction Level Parallelism
staticscheduling
dynamicscheduling
• simple scheduling
• loop unrolling
• loop unrolling + scheduling
• software pipelining
• out-of-order execution
• dataflow computers
\course\ELEG652-03Fall\Topic3-652 20
• directed-edges: data-dependence
• undirected-edges: Resources constraint
• An edge (u,v) (directed or undirected) of length e represent an
interlock between node u and v, and they must be separated by
e time.
Constraint GraphS1
S6
S5S4
S3S2
12
62
1 1
operation latencies4
3
\course\ELEG652-03Fall\Topic3-652 21
Code Scheduling for Single Pipeline (CSSP problem)
Input: A constraint Graph G = (V.E.)
Output: A sequence of operations in G, v1, v2,...vn with number of no-ops no greater than k such that:
1. if the no-ops are deleted, the result is a topological sort of G.
2. any two nodes u,v in the sequence is separated by a distance >= d (u,v)
\course\ELEG652-03Fall\Topic3-652 22
Advanced Pipelining
• Instruction reordering/scheduling within loop body
• loop unrolling : the code is not compact
• superscalar: compact code + multiple issuing of
different class of instructions
• VLIW
\course\ELEG652-03Fall\Topic3-652 23
Loop : LD F0, 0 (R1) ; load the vector element
ADDD F4, F0, F2 ; add the scalar in F2
SD 0 (R1), F4 ; store the vector element
SUB R1, R1, #8 ; decrement the pointer
by
; 8 bytes (per DW)
BNEZ R1, LOOP ; branch when it’s zero
An Example: X + a
\course\ELEG652-03Fall\Topic3-652 24
Instruction producing Destination instruction Latency in ?
result
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Latencies of FP operations used in this section. The first column shows the originating instruction type. The second column is the type of the consuming instruction. The last column is the number of intervening clock cycles needed to avoid a stall. These numbers are similar to the average latencies we would see on an FP unit, like the one we described for DLX in the last chapter. The major change versus the DLX FP pipeline was to reduce the latency of FP multiply; this helps keep our examples from becoming unwieldy. The latency of a floating-point load to a store is zero, since the result of the load can be bypassed without stalling the store. We will continue to assume an integer load latency of 1 and an integer ALU operation latency of 0.
\course\ELEG652-03Fall\Topic3-652 25
Without any scheduling the loop will execute as follows:
Clock cycle issued
Loop : LD F0, 0 (R1) 1
stall 2
ADDD F4, F0, F2 3
stall 4
stall 5
SD 0(R1), F4 6
SUB R1, R1, #8 7
BNEZ R1, LOOP 8
stall 9
This requires 9 clock cycles per iteration.
\course\ELEG652-03Fall\Topic3-652 26
We can schedule the loop to obtain
Loop : LD F0, 0 (R1)
stall
ADDD F4, F0, F2
SUB R1, R1, #8
BNEZ R1, LOOP ; delayed branch
SD 8 (R1), F4 ; changed
because
interchanged
with SUB
Average: 6 cycles per element
\course\ELEG652-03Fall\Topic3-652 27
Loop unrolling:
Here is the result after dropping the unnecessary SUB and BNEZ operations duplicated during unrolling.
Loop : LD F0, 0 (R1)ADDD F4, F0, F2SD 0 (R1), F4 ; drop SUB & BNEZLD F6, -8 (R1)ADDD F8, F6, F2SD -8 (R1), F8 ; drop SUB & BNEZLD F10, -16 (R1)ADDD F12, F10, F2SD -16 (R1), F12 ; drop SUB & BNEZLD F14, -24 (R1)ADDD F16, F14, F2SD -24 (R1), F16SUB R1, R1, #32BNEZ R1, LOOP
Average: 6.8 cycles per element
\course\ELEG652-03Fall\Topic3-652 28
Unrolling + Scheduling
Show the unrolled loop in the previous example after it has been scheduled on DLX.
Loop : LD F0, 0 (R1)LD F6, - 8 (R1)LD F10, -16 (R1)LD F14, -24 (R1)ADDD F4, F0, F2ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0 (R1), F4SD -8 (R1), F8SD -16 (R1), F12SUB R1, R1, #32 ; branch dependenceBNEZ R1, LOOPSD 8 (R1), F16 ; 8-32 = -24
The execution time of the unrolled loop has dropped to a total of 14 clock cycles, or 3.5 clock cycles per element, compared to 6.8 per element before scheduling
\course\ELEG652-03Fall\Topic3-652 29
R1
LD
0
F0 a
+
SD
R1 F4
F2
1
2
3
0 R1
LD
-24
F14 a
+
SD
R1 F16
F2
10
11
12
-24
R1
LD
-8
F6 a
+
SD
R1 F8
F2
4
5
6
-8
R1
LD
-6
F10 a
+
SD
R1 F12
F2
7
8
9
-16
Simple unrolling :
We have eliminated three branches and three decrements of R1.The addresses on the loads and stores have been compensatedfor. Without scheduling, every operation is followed by a dependent operation, and thus will cause a stall. This loop willrun in 27 clock cycles - each LD takes 2 clock cycles,eachADDD 3, the branch 2, and all other instructions 1 - or 6.8 clockcycles for each of the four elements
y[i] = X [i] + a
27 cycle4 elem. = 6.8 cycle/elem.
\course\ELEG652-03Fall\Topic3-652 30
LD
F0 a
+
SD
F4
F2
1
5
4
LD
F6 a
+
SD
F8
F2
2
6
10
LD
F10 a
+
SD
F12
F2
3
7
11
LD
F14 a
+
SD
F16
F2
4
8
12
Unrolling + Scheduling
14 cycle
4 elem= 3.5 cycle/elem