advanced compiler techniques
DESCRIPTION
Advanced Compiler Techniques. Parallelism & Locality. LIU Xianhua School of EECS, Peking University. Outline. Data dependences Loop transformation Software pipelining Software prefetching & d ata layout. Data Dependence of Variables. Output dependence Input dependence . - PowerPoint PPT PresentationTRANSCRIPT
Advanced Compiler Techniques
LIU Xianhua
School of EECS, Peking University
Parallelism & Locality
“Advanced Compiler Techniques”
Outline
• Data dependences• Loop transformation• Software pipelining• Software prefetching & data layout
2
“Advanced Compiler Techniques”
Data Dependence of Variables
• True dependence
• Anti-dependence
a = = a
= aa =
a = a =
= a = a
Output dependence
Input dependence
3
4
Domain of Data Dependence Analysis Only use loop bounds and array indexes that are
affine functions of loop variablesfor i = 1 to nfor j = 2i to 100
a[i + 2j + 3][4i + 2j][i * i] = …… = a[1][2i + 1][j]
Assume a data dependence between the read & write operation if there exists:
∃integers ir,jr,iw,jw 1 ≤ iw, ir ≤ n 2iw ≤ jw ≤ 1002ir ≤ jr ≤ 10iw + 2jw + 3 = 1 4iw + 2jw = 2ir + 1
Equate each dimension of array access; ignore non-affine ones No solution No data dependence Solution There may be a dependence
“Advanced Compiler Techniques”
5
Iteration Space
An abstraction for loops
Iteration is represented as coordinates in iteration space.
for i= 0, 5 for j = 0, 3 a[i, j] = 3
i
j
“Advanced Compiler Techniques”
6
Iteration Space An abstraction for
loops for i = 0, 5 for j = i, 3 a[i, j] = 0
i
j
“Advanced Compiler Techniques”
7
Iteration Space An abstraction for
loops for i = 0, 5 for j = i, 7 a[i, j] = 0
i
j
0000
7050
10110101
ji
“Advanced Compiler Techniques”
8
Affine Access
“Advanced Compiler Techniques”
9
Affine Transform
i
j
u
v
bjiB
vu
“Advanced Compiler Techniques”
10
Loop Transformation
for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for
for u = 1, 200 for v = 1, 100 A[v,u] = A[v,u]+ 3 end_forend_for
ji
vu
0110
“Advanced Compiler Techniques”
11
Interchange Loops?
for i = 2, 1000 for j = 1, 1000 A[i, j] = A[i-1, j+1]+3 end_forend_for
• e.g. dependence vector dold = (1,-1)
i
jfor u = 1, 1000 for v = 2, 1000 A[v, u] = A[v-1, u+1]+3 end_forend_for
“Advanced Compiler Techniques”
12
GCD Test
Is there any dependence?
Solve a linear Diophantine equation 2*iw = 2*ir + 1
for i = 1, 100 a[2*i] = … … = a[2*i+1] + 3
“Advanced Compiler Techniques”
13
GCD The greatest common divisor (GCD) of
integers a1, a2, …, an, denoted gcd(a1, a2, …, an), is the largest integer that evenly divides all these integers.
Theorem: The linear Diophantine equation
has an integer solution x1, x2, …, xn iff gcd(a1, a2, …, an) divides c
cxaxaxa nn ...2211
“Advanced Compiler Techniques”
14
Loop Transformation Loop Fusion Loop Distribution/Fission Loop Re-Indexing Loop Scaling Loop Reversal Loop Permutation/Interchange Loop Skew
Uni-modular Transformation
“Advanced Compiler Techniques”
15
Loop Transformation
“Advanced Compiler Techniques”
16
Cache Locality Suppose array A has column-major layout Loop nest has poor spatial cache locality.
for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for
A[1,1]
A[2,1]
… A[1,2]
A[2,2]
… A[1,3]
…
“Advanced Compiler Techniques”
Performance Improvement 1 1.5 2 2.5 3
compress
cholesky(nasa7)
spicemxm (nasa7)btrix (nasa7)
tomcatvgmty (nasa7)
vpenta (nasa7)
mergedarrays
loopinterchange
loop fusion blocking
Summary of Compiler Optimizationsto Reduce Cache Misses (by hand)
“Advanced Compiler Techniques” 17
18
Resource Constraints
Functional unitsld st alu fmpy fadd br …
Tim
e 0
1
2
Each instruction type has a resource reservation table
Pipelined functional units: occupy only one slot Non-pipelined functional units: multiple time slots Instructions may use more than one resource Multiple units of same resource Limited instruction issue slots
may also be managed like a resource
“Advanced Compiler Techniques”
19
List Scheduling
Scope: DAGs Schedules operations in topological
order Never backtracks
Variations: Priority function for node n
critical path: max clocks from n to any node
resource requirements source order
“Advanced Compiler Techniques”
20
Global Scheduling
Assume each clock can execute 2 operations of any kind.
if (a==0) goto L
e = d + d
c = b
L:
LD R6 <- 0(R1)nopBEQZ R6, L
LD R8 <- 0(R4)nopADD R8 <- R8,R8ST 0(R5) <- R8
LD R7 <- 0(R2)nopST 0(R3) <- R7
L:
B1
B2
B3
“Advanced Compiler Techniques”
21
Result of Code Scheduling
LD R6 <- 0(R1) ; LD R8 <- 0(R4)LD R7 <- 0(R2)ADD R8 <- R8,R8 ; BEQZ R6, L
ST 0(R5) <- R8 ST 0(R5) <- R8 ; ST 0(R3) <- R7 L:
B1
B3’B3
“Advanced Compiler Techniques”
22
Code Motions
Goal: Shorten execution time probabilistically
Moving instructions up: Move instruction to a cut set (from
entry) Speculation: even when not
anticipated.
Moving instructions down: Move instruction to a cut set (from
exit) May execute extra instruction Can duplicate code
src
src
“Advanced Compiler Techniques”
23
General-Purpose Applications
Lots of data dependences Key performance factor: memory latencies Move memory fetches up
Speculative memory fetches can be expensive Control-intensive: get execution profile
Static estimation Innermost loops are frequently executed
back edges are likely to be taken Edges that branch to exit and exception routines are
not likely to be taken Dynamic profiling
Instrument code and measure using representative data
“Advanced Compiler Techniques”
24
A Basic Global Scheduling Algorithm
Schedule innermost loops first Only upward code motion No creation of copies Only one level of speculation
Extra: Prepass before scheduling: loop
unrolling“Advanced Compiler Techniques”
Software Pipelining
“Advanced Compiler Techniques”
“Advanced Compiler Techniques”
Software Pipelining
• Obtain parallelism by executing iterations of a loop in an overlapping way.
• Try to maximize parallelism across loop iterations
• We’ll focus on simplest case: the do-all loop, where iterations are independent.
• Goal: Initiate iterations as frequently as possible.
• Limitation: Use same schedule and delay for each iteration.
26
27
Example of DoAll Loops
Machine: Per clock: 1 read, 1 write, 1 (2-stage) arithmetic op, with
hardware loop op and auto-incrementing addressing mode. Source code:
For i = 1 to n D[i] = A[i] * B[i]+ c
Code for one iteration: 1. LD R5,0(R1++) 2. LD R6,0(R2++) 3. MUL R7,R5,R6 4. 5. ADD R8,R7,R4 6. 7. ST 0(R3++),R8
No parallelism in basic block
“Advanced Compiler Techniques”
28
Unrolling 1.L: LD 2. LD 3. LD 4. MUL LD 5. MUL LD 6. ADD LD 7. ADD LD 8. ST MUL LD 9. MUL10. ST ADD11. ADD12. ST13. ST BL (L)
• Let u be the degree of unrolling:– Length of u iterations = 7+2(u-1)– Execution time per source iteration = (7+2(u-1)) / u = 2 + 5/u
“Advanced Compiler Techniques”
Software Pipelined Code 1. LD 2. LD 3. MUL LD 4. LD 5. MUL LD 6. ADD LD 7. MUL LD 8. ST ADD LD 9. MUL LD10. ST ADD LD11. MUL12. ST ADD13. 14. ST ADD15.16. ST Unlike unrolling, software pipelining can give optimal
result. Locally compacted code may not be globally optimal DOALL: Can fill arbitrarily long pipelines with infinitely
many iterations “Advanced Compiler Techniques” 29
Example of DoAcross LoopLoop: Sum = Sum + A[i]; B[i] = A[i] * c;
Software Pipelined Code1. LD2. MUL3. ADD LD4. ST MUL5. ADD6. ST
Doacross loops Recurrences can be parallelized Harder to fully utilize hardware with large degrees of
parallelism
1. LD2. MUL3. ADD4. ST
“Advanced Compiler Techniques” 30
Problem Formulation
Goals: maximize throughput small code size
Find: an identical relative schedule S(n)
for every iteration a constant initiation interval (T)
such that the initiation interval is minimized
Complexity: NP-complete in general
S0 LD1 MUL2 ADD LD3 ST MUL ADD ST
T=2
“Advanced Compiler Techniques” 31
Scheduling Constraints: Resource
RT: resource reservation table for single iteration RTs: modulo resource reservation table
RTs[i] = t|(t mod T = i) RT[t]
LD Alu ST
LD Alu ST
LD Alu ST
LD Alu ST
Iteration 1
Iteration 2
Iteration 3
Iteration 4
T=2Ti
me
LD Alu STSteady State
T=2
“Advanced Compiler Techniques” 32
“Advanced Compiler Techniques”
Is It Possible to Have an IterationStart at Every Clock?
• Hint: No.• Why?• An iteration injects 2 MEM and 2 ALU
resource requirements.– If injected every clock, the machine
cannot possibly satisfy all requests.• Minimum delay = 2.
33
“Advanced Compiler Techniques”
Assigning Registers
• We don’t need an infinite number of registers.
• We can reuse registers for iterations that do not overlap in time.
• But we can’t just use the same old registers for every iteration.
34
“Advanced Compiler Techniques”
Assigning Registers --- (2)
• The inner loop may have to involve more than one copy of the smallest repeating pattern.– Enough so that registers may be reused
at each iteration of the expanded inner loop.
• Our example: 3 iterations coexist, so we need 3 sets of registers and 3 copies of the pattern.
35
“Advanced Compiler Techniques”
Example: Assigning Registers
• Our original loop used registers:– r9 to hold a constant 4N.– r8 to count iterations and index the
arrays.– r1 to copy a[i] into b[i].
• The expanded loop needs:– r9 holds 12N.– r6, r7, r8 to count iterations and index.– r1, r2, r3 to copy certain array elements.
36
37
The Loop Body
L: ADD r8,r8,#12 nop LD r3,a(r6)BGE r8,r9,L’ ST b(r7),r2 nopLD r1,a(r8) ADD r7,r7,#12 nopnop BGE r7,r9,L’’ ST b(r6),r3nop LD r2,a(r7) ADD r6,r6,#12ST b(r8),r1 nop BLT r6,r9,L
Iteration i
Iteration i + 4
Iteration i + 1
Iteration i + 3
Iteration i + 2
To break the loop early Each register handles everythird element of the arrays.
L’ and L’’ are places for appropriate codas.“Advanced Compiler Techniques”
“Advanced Compiler Techniques”
Cyclic Data Dependence Graph
• We assumed that data at an iteration depends only on data computed at the same iteration.– Not even true for our example.• r8 computed from its previous iteration.• But it doesn’t matter in this example.
• Fixup: edge labels have two components: (iteration change, delay).
38
39
Cyclic Data Dependence Graph Label edges with < , d >
= iteration difference, d = delay x T + S(n2) – S(n1) d
LD r1,a(r8)
ST b(r8),r1
ADD r8,r8,#4
BLT r8,r9,L
<0,2> <1,1>
<0,1>
<0,1>
(A)
(B)
(C)
(D)
(C) must wait atleast one clockafter the (B) fromthe same iteration.
(A) must wait atleast one clockafter the (C) fromthe previous iteration.
“Advanced Compiler Techniques”
“Advanced Compiler Techniques”
Matrix of Delays
• Let T be the delay between the start times of one iteration and the next.
• Replace edge label <i,j> by delay j-iT.
• Compute, for each pair of nodes n and m the total delay along the longest acyclic path from n to m.
• Gives upper and lower bounds relating the times to schedule n and m. 40
41
Example: Delay Matrix
AA
BCD
B C D2
1-T1
1
AA
BCD
B C D2
1-T1
1
3
2-T 23-T
4
Edges Acyclic Transitive Closure
S(B) ≥ S(A)+2S(A) ≥ S(B)+2-T
S(B)-2 ≥ S(A) ≥ S(B)+2-T
Note: Implies T ≥ 4 (because onlyone register used for loop-counting).If T=4, then A (LD) must be 2 clocksbefore B (ST). If T=5, A can be 2-3clocks before B.
“Advanced Compiler Techniques”
“Advanced Compiler Techniques”
Iterative Modulo Scheduling
• Compute the lower bounds (MII) on the delay between the start times of one iteration and the next (initiation interval, aka II)– due to resources– due to recurrences
• Try to find a schedule for II = MII• If no schedule can be found, try a
larger II.42
The Algorithm
1. Choose an initiation interval, ii> Compute lower bounds on ii > Shorter ii means faster overall execution
2. Generate a loop body that takes ii cycles> Try to schedule into ii cycles, using modulo scheduler> If it fails, bump ii by one and try again
3. Generate the needed prologue & epilogue code> For prologue, work backward from upward exposed uses
in the schedulued loop body> For epilogue, work forward from downward exposed
definitions in the scheduled loop bodyM. Lam, "Software pipelining: An effective scheduling technique for VLIW machines", In Proceedings of the ACM SIGPLAN 88 Conference on Programming Language Design and Implementation (PLDI 88), July 1988 pages 318-328. Also published as ACM SIGPLAN Notices 23(7).
“Advanced Compiler Techniques” 43
44
A simple sum reduction loop
c = 0for i = 1 to n c = c + a[i]
rc 0 r@a @a r1 n x 4 rub r1 + r@a if r@a > rub goto ExitLoop: ra MEM(r@a) rc rc + ra
r@a r@a + 4 if r@a ≤ rub goto LoopExit: c rc
Source code LLIR code
“Advanced Compiler Techniques”
45
OOO Execution
The loop’s steady state
behavior
“Advanced Compiler Techniques”
46
OOO Execution
The loop’s prologue
The loop’s epilogue
“Advanced Compiler Techniques”
Implementing the Concept
ra MEM(r@a)r@a r@a + 4rc rc + ra
if r@a ≤ rub goto Loop
r@a r@a + 4rc rc + ra
Prologue
Epilogue
Kernel
The actual schedule must respect both the data dependences and the operation latencies.
3
3
11
1 1
1
General schema for the loop
ra MEM(r@a) r@a r@a + 4 if r@a > rub goto Exit
1
“Advanced Compiler Techniques” 47
The Example
1. rc 02. r@a @a3. r1 n x 44. rub r1 + r@a5. if r@a > rub goto Exit6. Loop: ra MEM(r@a)7. rc rc + ra8. r@a r@a + 49. if r@a ≤ rub goto
Loop10. Exit: c rc
The Code Its Dependence Graph
Focus on the loop body
1 2 3
4
56 8
7 9
Op 6 is not involved in a cycle
“Advanced Compiler Techniques” 48
The Example
Label Fetch Unit Integer Unit Branch UnitLoop:
ii = 2
Focus on the loop body6 8
7 9
Template for the Modulo Schedule
“Advanced Compiler Techniques” 49
The Example
Focus on the loop body• Schedule 6 on the fetch unit
3 1
Simulated
clock
Label Fetch Unit Integer Unit Branch UnitLoop: 6
6 8
7 9
“Advanced Compiler Techniques” 50
The Example
Focus on the loop body• Schedule 6 on the fetch unit• Schedule 8 on the integer unit
3 1
Simulated
clock
Label Fetch Unit Integer Unit Branch UnitLoop: 6 8
6 8
7 9
“Advanced Compiler Techniques” 51
The Example
Focus on the loop body• Schedule 6 on the fetch unit• Schedule 8 on the integer unit• Advance the scheduler’s clock3 1
Simulated
clock
Label Fetch Unit Integer Unit Branch UnitLoop: 6 8
6 8
7 9
“Advanced Compiler Techniques” 52
The Example
Focus on the loop body• Schedule 6 on the fetch unit• Schedule 8 on the integer unit• Advance the scheduler’s clock• Schedule 9 on the branch unit
3 1
Simulated
clock
Label Fetch Unit Integer Unit Branch UnitLoop: 6 8
9
6 8
7 9
“Advanced Compiler Techniques” 53
The Example
Focus on the loop body• Schedule 6 on the fetch unit• Schedule 8 on the integer unit• Advance the scheduler’s clock• Schedule 9 on the branch unit• Advance the clock (modulo ii)
3 1
Simulated
clock
Label Fetch Unit Integer Unit Branch UnitLoop: 6 8
9
6 8
7 9
“Advanced Compiler Techniques” 54
The Example
Focus on the loop body• • Advance the scheduler’s clock• Schedule 9 on the branch unit• Advance the clock (modulo ii)• Advance the clock again
3 1
Simulated
clock
Label Fetch Unit Integer Unit Branch UnitLoop: 6 8
9
6 8
7 9
“Advanced Compiler Techniques” 55
The Example
Focus on the loop body• • Schedule 9 on the branch unit• Advance the clock (modulo ii)• Advance the clock again• Schedule 7 on the integer unit
3 1
Simulated
clock
Label Fetch Unit Integer Unit Branch UnitLoop: 6 8
7 9
6 8
7 9
“Advanced Compiler Techniques” 56
The Example
Focus on the loop body• • Advance the clock (modulo ii)• Advance the clock again• Schedule 7 on the integer unit• No unscheduled ops remain in loop body
3 1
The final schedule for the loop’s body
Simulated
clock
Label Fetch Unit Integer Unit Branch UnitLoop: 6 8
7 9
6 8
7 9
“Advanced Compiler Techniques” 57
The Example: Final Schedule
Label Fetch Unit Integer Unit Branch Unit
nop r@a @a nop nop r1 n x 4 nop nop rub r1 + r@a nop
ra MEM(r@a) r@a r@a + 4 nop nop rc 0 If r@a > rub goto Exit
Loop: ra MEM(r@a) r@a r@a + 4 nop
nop rc rc + ra if r@a ≤ rub goto LoopExit: nop nop nop
nop rc rc + ra nop
What about the empty slots? Fill them (if needed) in some other way (e.g., fuse loop with another loop that is memory bound?)
“Advanced Compiler Techniques” 58
59
Global Software Pipelining Conditional Execution
Control Dependence -> Data Dependence
Hardware Support FU, operation field, extra pipe stages
Conditional Composition Code Inflation Complicated scheduling
Branch Prediction Dynamic vs. Static
“Advanced Compiler Techniques”
60
Summary Eliminate most dependences, improve ILP No much code inflation Register Pressure
Software Pipelining
Loop Unrolling
“Advanced Compiler Techniques”
Data Prefetching
“Advanced Compiler Techniques”
“Advanced Compiler Techniques”
Why Data Prefetching
• Increasing Processor – Memory “distance”
• Caches do work !!! … IF …– Data set cache-able, accesses local (in
space/time)• Else ? …
62
“Advanced Compiler Techniques”
Data Prefetching
• What is it ?– Request for a future data need is
initiated– Useful execution continues during
access– Data moves from slow/far memory to
fast/near cache– Data ready in cache when needed
(load/store)
63
“Advanced Compiler Techniques”
Data Prefetching
• When can it be used ?– Future data needs are (somewhat)
predictable• How is it implemented ?– in hardware: history based prediction of
future access– in software: compiler inserted prefetch
instructions
64
“Advanced Compiler Techniques”
Software Data Prefetching
• Compiler scheduled prefetches• Moves entire cache lines (not just datum)– Spatial locality assumed – often the case
• Typically a non-faulting access– Compiler free to speculate prefetch address
• Hardware not obligated to obey– A performance enhancement, no functional
impact– Loads/store may be preferentially treated
65
“Advanced Compiler Techniques”
Software Data Prefetching Use
• Mostly in Scientific codes– Vectorizable loops accessing arrays
deterministically• Data access pattern is predictable• Prefetch scheduling easy (far in time, near in
code)– Large working data sets consumed
• Even large caches unable to capture access locality
• Sometimes in Integer codes– Loops with pointer de-references
66
67
Selective Data Prefetch
do j = 1, n do i = 1, m A(i,j) = B(1,i) + B(1,i+1) enddoenddo
E.g. A(i,j) has spatial locality, therefore only one prefetch is required for every cache line.
Aj
i
B 11 m
“Advanced Compiler Techniques”
68
Formal Definitions Temporal locality occurs when a given
reference reuses exactly the same data location
Spatial locality occurs when a given reference accesses different data locations that fall within the same cache line
Group locality occurs when different references access the same cache line
“Advanced Compiler Techniques”
69
Prefetch Predicates If an access has spatial locality, only the
first access to the same cache line will incur a miss.
For temporal locality, only the first access will incur a cache miss
If an access has group locality, only the leading reference incurs cache miss.
If an access has no locality, it will miss in every iteration.
“Advanced Compiler Techniques”
70
Example Code with Prefetches
do j = 1, n do i = 1, m A(i,j) = B(1,i) + B(1,i+1) if (iand(i,7) == 0) prefetch (A(i+k,j)) if (j == 1) prefetch (B(1,i+t)) enddoenddo
Aj
i
B 11 m
Assumed CLS = 64 bytes and data size = 8 bytesk and t are prefetch distance values
“Advanced Compiler Techniques”
71
Spreading of Prefetches If there is more than one reference that
has spatial locality within the same loop nest, spread these prefetches across the 8-iteration window
Reduces the stress on the memory subsystem by minimizing the number of outstanding prefetches
“Advanced Compiler Techniques”
72
Example Code with Spreading
do j = 1, n do i = 1, m C(i,j) = D(i-1,j) + D(i+1,j) if (iand(i,7) == 0)
prefetch (C(i+k,j)) if (iand(i,7) == 1) prefetch (D(i+k+1,j))enddo
enddo
C, Dj
i
Assumed CLS = 64 bytes and data size = 8 bytesk is the prefetch distance value
“Advanced Compiler Techniques”
73
Prefetch Strategy - Conditional
Example loopL:
Load A(I)Load B(I)...I = I + 1Br L, if I<n
Conditional Prefetching
L:Load A(I)Load B(I)Cmp pA=(I mod 8 == 0)if(pA) prefetch A(I+X)Cmp pB=(I mod 8 == 1)If(pB) prefetch B(I+X)...I = I + 1Br L, if I<n
Code for condition generationPrefetches occupy issue slots
“Advanced Compiler Techniques”
74
Prefetch Strategy - Unroll
Example loopL:
Load A(I)Load B(I)...I = I + 1Br L, if I<n
UnrolledUnr_Loop:
prefetch A(I+X)load A(I)load B(I)...prefetch B(I+X)load A(I+1)load B(I+1)...prefetch C(I+X)load A(I+2)load B(I+2)...prefetch D(I+X)load A(I+3)load B(I+3)
...prefetch E(I+X)load A(I+4)load B(I+4)
...load A(I+5)load B(I+5)
...load A(I+6)load B(I+6)
...load A(I+7)load B(I+7)...I = I + 8Br Unr_Loop, if I<n
Code bloat (>8X)Remainder loop
“Advanced Compiler Techniques”
“Advanced Compiler Techniques”
Software Data Prefetching Cost
• Requires memory instruction resources– A prefetch instruction for each access
stream• Issues every iteration, but needed less
often– If branched around, inefficient execution
results– If conditionally executed, more instruction
overhead results– If loop is unrolled, code bloat results
75
“Advanced Compiler Techniques”
Software Data Prefetching Cost
• Redundant prefetches get in the way– Resources consumed until prefetches
discarded!• Non redundant need careful
scheduling– Resources overwhelmed when many
issued & miss• Redundant prefetches increases
power/energy consumption76
77
Prefetch Execution Diagram
“Advanced Compiler Techniques”
“Advanced Compiler Techniques”
Measurements
78
79
Data Re-placement/Layout
#Words
B[]
i60
CacheMemory
Main Memory
(16 words) (16 words)
AB[new]
A[]
15Detailed Analysis:
max=15 words
12
for(i=0; i<12; i++) A[i] = B[i+3]+B[i];
Traditional Analysis:
max=27 words
“Advanced Compiler Techniques”
“Advanced Compiler Techniques”
Summary
• Today– Data dependences– Loop transformation– Software pipelining– Software prefetching– Data replacement
• Next– Compiler for Modern Architectures– Future Compilers