cr18: advanced compilers l05: scheduling for locality tomofumi yuki 1

1

CR18: Advanced Compilers

L05: Scheduling for Locality

Tomofumi Yuki

2

Recap

Last time we saw scheduling techniques max. parallelism != best performance

This time how can we do better?

3

Pluto Strategy

We want only 1D parallelism coarse-grained (outer) parallelism good data locality

We want Tiling wave-front parallelism is guaranteed each tile can be executed atomically good for sequential performance

4

Intuition of Pluto Algorithm

Skew and Tile

i

j

i

j

5

Tiling Hyper-Planes

Another name for 1D schedule θ set of θs define tiling

Defines the transform (i,j->i+j,i) corresponds to the

skew in prev. slide

i

j

θ2=i

θ1=i+j

6

Legality of Tiling

Each tiling hyper-plane must satisfy:

What is difference from causality condition? note this is about affine transform, not

scheduleMust be weakly satisfied for each

dimension!

7

What does the condition mean? 1. Fully Permutable

recall θs define the transform all statements mapped to a common d-D

space let i1, ..., in be the new indices

Weakly satisfied in all dimensionsi1≥i’

1,...,in≥i’n for all dependences

Reformulation of the fully permutable condition

works for scheduling imperfect loop nests

8

What does the condition mean? 2. All statements are fused

somewhat implied by fully permutability what are possible dependences

from S1 to S2 from S2 to S1

Exception when S1 do not use value of S2

for i for j S1 for j S2

9

Selecting Tiling Hyper-planes Which is better?

i

j

i

j

10

Cost Functions in Pluto

Formulated as:

What does this capture? dep: (i,j->i+1,j-1) δ1 = (i+1+j-1) – (i+j) = 0 δ2 = (i+1) – (i) = 1

i

j

θ2=i

θ1=i+j

11

Cost Functions in Pluto

Formulated as:

What does this capture? dep: (i,j->i+1,j-1) δ1 = (i+1+j-1) – (i+j) = 0 δ2 = (i+1-(j-1)) – (i-j) = 2

i

j

θ2=i-j

θ1=i+j

12

Reuse Distance

When the θ corresponds to sequential loop

Two dependences (i,j->i+1,j) (i,0->i,j) : j>0 what are the δs?

δ represents #iterations in the loop

(corresponding to θ)until reuse via e

i

j

θ1=i

θ2=j

13

Communication Volume

When the θ corresponds to parallel loop Let si, sj be the tile sizes Horizontal dependence

sj values to the horizontal neighbor

Vertical dependence si valeus to N/sj tiles

Constant is better 0 is even better!

i

j

14

Iterative Search

We need d-hyper-planes for a d-D space note that we are not looking for

parallelism parallelism comes with the tile wave-

fronts

Approach: find one θ for each statement constraint the space to be linearly

independent of the θs already found repeat

15

Tilable Band

Band of Loops/Schedules consecutive sequence of dimensions

Tilable band a band that satisfies the legality

condition for a common set of dependences

PLuTo tiles the outermost tilable band

16

So, which is better?

What are the θs and δs? what is the order?

i

j

i

j

17

Solving with ILP

Farkas Lemma again we had enough of Farkas last time

There is a problem when the constraint is:

18

The “Practical” Choice

Given the schedule prototype:

Constraint the coefficients to:

What does this mean? Relaxed recently by a paper on PLUTO+

and

19

Example 1: Jacobi 1D

One example implementation:

but it is rather contrived due to limitations in polyhedral compilers

The dependences are simple

for t = 0 .. T for i = 1 .. N-1S1: B[i] = foo(A[i], A[i-1], A[i+1]); for i = 1 .. N-1S2: A[i] = foo(B[i], B[i-1], B[i+1]);

for t = 0 .. T for i = 1 .. N-1S1: A[t,i] = foo(A[t-1,i], A[t-1,i-1], A[t-1,i+1]);

20


Prototype: θS1(t,i) = a1t+a2i+a0

δ1=a1(t+1)+a2i+a0-(a1t+a2i+a0)=a1

δ2=a1(t+1)+a2(i+1)+a0-(a1t+a2i+a0)=a1+a2

δ3=a1(t+1)+a2(i-1)+a0-(a1t+a2i+a0)=a1-a2

S1[t,i] -> S1[t+1,i]

S1[t,i] -> S1[t+1,i+1]

S1[t,i] -> S1[t+1,i-1]

δ1=θS1(t+1,i)-θS1(t,i)

δ2=θS1(t+1,i+1)-θS1(t,i)

δ3=θS1(t+1,i-1)-θS1(t,i)

21


Prototype: θS1(t,i) = a1t+a2i+a0

δ1=a1(t+1)+a2i+a0-(a1t+a2i+a0)=a1

δ2=a1(t+1)+a2(i+1)+a0-(a1t+a2i+a0)=a1+a2

δ3=a1(t+1)+a2(i-1)+a0-(a1t+a2i+a0)=a1-a2

linearly independent with the previous

S1[t,i] -> S1[t+1,i]

S1[t,i] -> S1[t+1,i+1]

S1[t,i] -> S1[t+1,i-1]

δ1=θS1(t+1,i)-θS1(t,i)

δ2=θS1(t+1,i+1)-θS1(t,i)

δ3=θS1(t+1,i-1)-θS1(t,i)

22


We have a set of hyper-planes θS1(t,i) = (t,t+i)

t

i

t

i

23

Example 2: 2mm

Simplified a bitfor i = 0 .. N for j = 0 .. N for k = 0 .. NS1: C[i,j] += A[i,k] * B[k,j];

for i = 0 .. N for j = 0 .. N for k = 0 .. NS2: E[i,j] += C[i,k] * D[k,j];

S1[i,j,k] -> S1[i,j,k+1]

S2[i,j,k] -> S2[i,j,k+1]

S1[i,j,N] -> S2[i’,j’,k’]: i=i’ and j=k’

24

Example 2: 2mm (dim 1)

Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0

θS2(x,y,z) = b1x+b2y+b3z+b0

Easy ones:

Interesting case is the inter-statement dep.

S1[i,j,k] -> S1[i,j,k+1]

S2[x,y,z] -> S2[x,y,z+1]

S1[i,j,N] -> S2[x,y,z]: i=x and j=z

a3=0

b3=0

S2[i,j,k] -> S1[i,k,N]:or S2[x,y,z] -> S1[x,z,N]:

25




Interesting case is the inter-statement dep.

b1x+b2y+b3z+b0 - a1x+a2z+a3N+a0


or S2[x,y,z] -> S1[x,z,N]:

(b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0

(b1-a1)x+b2y-a2za3=b3=0

26




Minimize: subject to a1+a2+b1+b2=0 a,b≥0 (plus weakly satisfied)

We get θS1(i,j,k) = i

θS2(x,y,z) = x

(b1-a1)x+b2y-a2z

27




Minimize:subject to (b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0

linearly independent with the previous We get

θS1(i,j,k) = j

θS2(x,y,z) = z

(b1-a1)x+b2y-a2z

28





linearly independent with the previous does θS1=k and θS2=y work?

a3=1, b2=1, rest 0

(b1-a1)x+b2y-a2z


29





linearly independent with the previous does θS1=k and θS2=y work?

a3=1, b2=1, rest 0

(b1-a1)x+b2y-a2z


or S2[x,y,z] -> S1[x,z,N]:

30





linearly independent with the previous we have to split here

θS1=0 and θS2=1

(b1-a1)x+b2y-a2z


or S2[x,y,z] -> S1[x,z,N]:

31


Proceed to the 4th dimension because the 3rd dimension is only for statement ordering Now solve the problem independently

for each statement Case S1:

linearly independent with [i] and [j] Case S2:

linearly independent with [x] and [z] We get [k] and [y]

S1[i,j,k] -> S1[i,j,k+1]

S2[x,y,z] -> S2[x,y,z+1]

32

Example 2: 2mm

Finally, we have a set of hyper-planes θS1(i,j,k) = (i,j,0,k)

θS2(i,j,k) = (i,k,1,j)

Tilable Bandfor i = 0 .. N for j = 0 .. N for k = 0 .. NS1: C[i,j] += A[i,k] * B[k,j];

for i = 0 .. N for j = 0 .. N for k = 0 .. NS2: E[i,j] += C[i,k] * D[k,j];

for i = 0 .. N for j = 0 .. N { for k = 0 .. NS1: C[i,j] += A[i,k] * B[k,j]; for k = 0 .. NS2: E[i,k] += C[i,j] * D[j,k]; }

33

Example 2: 2mm

Output of Pluto

34

Summary of Pluto

Paper in 2008 huge impact: 350+ citations already

Works very well as the default strategy

But, it is far from perfect!

cr18: advanced compilers l05: scheduling for locality tomofumi yuki 1

Documents