cr18: advanced compilers l05: scheduling for locality tomofumi yuki 1
TRANSCRIPT
1
CR18: Advanced Compilers
L05: Scheduling for Locality
Tomofumi Yuki
2
Recap
Last time we saw scheduling techniques max. parallelism != best performance
This time how can we do better?
3
Pluto Strategy
We want only 1D parallelism coarse-grained (outer) parallelism good data locality
We want Tiling wave-front parallelism is guaranteed each tile can be executed atomically good for sequential performance
4
Intuition of Pluto Algorithm
Skew and Tile
i
j
i
j
5
Tiling Hyper-Planes
Another name for 1D schedule θ set of θs define tiling
Defines the transform (i,j->i+j,i) corresponds to the
skew in prev. slide
i
j
θ2=i
θ1=i+j
6
Legality of Tiling
Each tiling hyper-plane must satisfy:
What is difference from causality condition? note this is about affine transform, not
scheduleMust be weakly satisfied for each
dimension!
7
What does the condition mean? 1. Fully Permutable
recall θs define the transform all statements mapped to a common d-D
space let i1, ..., in be the new indices
Weakly satisfied in all dimensionsi1≥i’
1,...,in≥i’n for all dependences
Reformulation of the fully permutable condition
works for scheduling imperfect loop nests
8
What does the condition mean? 2. All statements are fused
somewhat implied by fully permutability what are possible dependences
from S1 to S2 from S2 to S1
Exception when S1 do not use value of S2
for i for j S1 for j S2
9
Selecting Tiling Hyper-planes Which is better?
i
j
i
j
10
Cost Functions in Pluto
Formulated as:
What does this capture? dep: (i,j->i+1,j-1) δ1 = (i+1+j-1) – (i+j) = 0 δ2 = (i+1) – (i) = 1
i
j
θ2=i
θ1=i+j
11
Cost Functions in Pluto
Formulated as:
What does this capture? dep: (i,j->i+1,j-1) δ1 = (i+1+j-1) – (i+j) = 0 δ2 = (i+1-(j-1)) – (i-j) = 2
i
j
θ2=i-j
θ1=i+j
12
Reuse Distance
When the θ corresponds to sequential loop
Two dependences (i,j->i+1,j) (i,0->i,j) : j>0 what are the δs?
δ represents #iterations in the loop
(corresponding to θ)until reuse via e
i
j
θ1=i
θ2=j
13
Communication Volume
When the θ corresponds to parallel loop Let si, sj be the tile sizes Horizontal dependence
sj values to the horizontal neighbor
Vertical dependence si valeus to N/sj tiles
Constant is better 0 is even better!
i
j
14
Iterative Search
We need d-hyper-planes for a d-D space note that we are not looking for
parallelism parallelism comes with the tile wave-
fronts
Approach: find one θ for each statement constraint the space to be linearly
independent of the θs already found repeat
15
Tilable Band
Band of Loops/Schedules consecutive sequence of dimensions
Tilable band a band that satisfies the legality
condition for a common set of dependences
PLuTo tiles the outermost tilable band
16
So, which is better?
What are the θs and δs? what is the order?
i
j
i
j
17
Solving with ILP
Farkas Lemma again we had enough of Farkas last time
There is a problem when the constraint is:
18
The “Practical” Choice
Given the schedule prototype:
Constraint the coefficients to:
What does this mean? Relaxed recently by a paper on PLUTO+
and
19
Example 1: Jacobi 1D
One example implementation:
but it is rather contrived due to limitations in polyhedral compilers
The dependences are simple
for t = 0 .. T for i = 1 .. N-1S1: B[i] = foo(A[i], A[i-1], A[i+1]); for i = 1 .. N-1S2: A[i] = foo(B[i], B[i-1], B[i+1]);
for t = 0 .. T for i = 1 .. N-1S1: A[t,i] = foo(A[t-1,i], A[t-1,i-1], A[t-1,i+1]);
20
Example 1: Jacobi 1D
Prototype: θS1(t,i) = a1t+a2i+a0
δ1=a1(t+1)+a2i+a0-(a1t+a2i+a0)=a1
δ2=a1(t+1)+a2(i+1)+a0-(a1t+a2i+a0)=a1+a2
δ3=a1(t+1)+a2(i-1)+a0-(a1t+a2i+a0)=a1-a2
S1[t,i] -> S1[t+1,i]
S1[t,i] -> S1[t+1,i+1]
S1[t,i] -> S1[t+1,i-1]
δ1=θS1(t+1,i)-θS1(t,i)
δ2=θS1(t+1,i+1)-θS1(t,i)
δ3=θS1(t+1,i-1)-θS1(t,i)
21
Example 1: Jacobi 1D
Prototype: θS1(t,i) = a1t+a2i+a0
δ1=a1(t+1)+a2i+a0-(a1t+a2i+a0)=a1
δ2=a1(t+1)+a2(i+1)+a0-(a1t+a2i+a0)=a1+a2
δ3=a1(t+1)+a2(i-1)+a0-(a1t+a2i+a0)=a1-a2
linearly independent with the previous
S1[t,i] -> S1[t+1,i]
S1[t,i] -> S1[t+1,i+1]
S1[t,i] -> S1[t+1,i-1]
δ1=θS1(t+1,i)-θS1(t,i)
δ2=θS1(t+1,i+1)-θS1(t,i)
δ3=θS1(t+1,i-1)-θS1(t,i)
22
Example 1: Jacobi 1D
We have a set of hyper-planes θS1(t,i) = (t,t+i)
t
i
t
i
23
Example 2: 2mm
Simplified a bitfor i = 0 .. N for j = 0 .. N for k = 0 .. NS1: C[i,j] += A[i,k] * B[k,j];
for i = 0 .. N for j = 0 .. N for k = 0 .. NS2: E[i,j] += C[i,k] * D[k,j];
S1[i,j,k] -> S1[i,j,k+1]
S2[i,j,k] -> S2[i,j,k+1]
S1[i,j,N] -> S2[i’,j’,k’]: i=i’ and j=k’
24
Example 2: 2mm (dim 1)
Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0
θS2(x,y,z) = b1x+b2y+b3z+b0
Easy ones:
Interesting case is the inter-statement dep.
S1[i,j,k] -> S1[i,j,k+1]
S2[x,y,z] -> S2[x,y,z+1]
S1[i,j,N] -> S2[x,y,z]: i=x and j=z
a3=0
b3=0
S2[i,j,k] -> S1[i,k,N]:or S2[x,y,z] -> S1[x,z,N]:
25
Example 2: 2mm (dim 1)
Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0
θS2(x,y,z) = b1x+b2y+b3z+b0
Interesting case is the inter-statement dep.
b1x+b2y+b3z+b0 - a1x+a2z+a3N+a0
S1[i,j,N] -> S2[x,y,z]: i=x and j=z
or S2[x,y,z] -> S1[x,z,N]:
(b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0
(b1-a1)x+b2y-a2za3=b3=0
26
Example 2: 2mm (dim 1)
Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0
θS2(x,y,z) = b1x+b2y+b3z+b0
Minimize: subject to a1+a2+b1+b2=0 a,b≥0 (plus weakly satisfied)
We get θS1(i,j,k) = i
θS2(x,y,z) = x
(b1-a1)x+b2y-a2z
27
Example 2: 2mm (dim 2)
Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0
θS2(x,y,z) = b1x+b2y+b3z+b0
Minimize:subject to (b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0
linearly independent with the previous We get
θS1(i,j,k) = j
θS2(x,y,z) = z
(b1-a1)x+b2y-a2z
28
Example 2: 2mm (dim 3)
Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0
θS2(x,y,z) = b1x+b2y+b3z+b0
Minimize:subject to (b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0
linearly independent with the previous does θS1=k and θS2=y work?
a3=1, b2=1, rest 0
(b1-a1)x+b2y-a2z
S1[i,j,N] -> S2[x,y,z]: i=x and j=z
29
Example 2: 2mm (dim 3)
Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0
θS2(x,y,z) = b1x+b2y+b3z+b0
Minimize:subject to (b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0
linearly independent with the previous does θS1=k and θS2=y work?
a3=1, b2=1, rest 0
(b1-a1)x+b2y-a2z
S1[i,j,N] -> S2[x,y,z]: i=x and j=z
or S2[x,y,z] -> S1[x,z,N]:
30
Example 2: 2mm (dim 3)
Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0
θS2(x,y,z) = b1x+b2y+b3z+b0
Minimize:subject to (b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0
linearly independent with the previous we have to split here
θS1=0 and θS2=1
(b1-a1)x+b2y-a2z
S1[i,j,N] -> S2[x,y,z]: i=x and j=z
or S2[x,y,z] -> S1[x,z,N]:
31
Example 2: 2mm (dim 4)
Proceed to the 4th dimension because the 3rd dimension is only for statement ordering Now solve the problem independently
for each statement Case S1:
linearly independent with [i] and [j] Case S2:
linearly independent with [x] and [z] We get [k] and [y]
S1[i,j,k] -> S1[i,j,k+1]
S2[x,y,z] -> S2[x,y,z+1]
32
Example 2: 2mm
Finally, we have a set of hyper-planes θS1(i,j,k) = (i,j,0,k)
θS2(i,j,k) = (i,k,1,j)
Tilable Bandfor i = 0 .. N for j = 0 .. N for k = 0 .. NS1: C[i,j] += A[i,k] * B[k,j];
for i = 0 .. N for j = 0 .. N for k = 0 .. NS2: E[i,j] += C[i,k] * D[k,j];
for i = 0 .. N for j = 0 .. N { for k = 0 .. NS1: C[i,j] += A[i,k] * B[k,j]; for k = 0 .. NS2: E[i,k] += C[i,j] * D[j,k]; }
33
Example 2: 2mm
Output of Pluto
34
Summary of Pluto
Paper in 2008 huge impact: 350+ citations already
Works very well as the default strategy
But, it is far from perfect!