cr18: advanced compilers l05: scheduling for locality tomofumi yuki 1

34
CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

Upload: brianne-berry

Post on 21-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

1

CR18: Advanced Compilers

L05: Scheduling for Locality

Tomofumi Yuki

Page 2: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

2

Recap

Last time we saw scheduling techniques max. parallelism != best performance

This time how can we do better?

Page 3: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

3

Pluto Strategy

We want only 1D parallelism coarse-grained (outer) parallelism good data locality

We want Tiling wave-front parallelism is guaranteed each tile can be executed atomically good for sequential performance

Page 4: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

4

Intuition of Pluto Algorithm

Skew and Tile

i

j

i

j

Page 5: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

5

Tiling Hyper-Planes

Another name for 1D schedule θ set of θs define tiling

Defines the transform (i,j->i+j,i) corresponds to the

skew in prev. slide

i

j

θ2=i

θ1=i+j

Page 6: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

6

Legality of Tiling

Each tiling hyper-plane must satisfy:

What is difference from causality condition? note this is about affine transform, not

scheduleMust be weakly satisfied for each

dimension!

Page 7: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

7

What does the condition mean? 1. Fully Permutable

recall θs define the transform all statements mapped to a common d-D

space let i1, ..., in be the new indices

Weakly satisfied in all dimensionsi1≥i’

1,...,in≥i’n for all dependences

Reformulation of the fully permutable condition

works for scheduling imperfect loop nests

Page 8: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

8

What does the condition mean? 2. All statements are fused

somewhat implied by fully permutability what are possible dependences

from S1 to S2 from S2 to S1

Exception when S1 do not use value of S2

for i for j S1 for j S2

Page 9: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

9

Selecting Tiling Hyper-planes Which is better?

i

j

i

j

Page 10: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

10

Cost Functions in Pluto

Formulated as:

What does this capture? dep: (i,j->i+1,j-1) δ1 = (i+1+j-1) – (i+j) = 0 δ2 = (i+1) – (i) = 1

i

j

θ2=i

θ1=i+j

Page 11: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

11

Cost Functions in Pluto

Formulated as:

What does this capture? dep: (i,j->i+1,j-1) δ1 = (i+1+j-1) – (i+j) = 0 δ2 = (i+1-(j-1)) – (i-j) = 2

i

j

θ2=i-j

θ1=i+j

Page 12: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

12

Reuse Distance

When the θ corresponds to sequential loop

Two dependences (i,j->i+1,j) (i,0->i,j) : j>0 what are the δs?

δ represents #iterations in the loop

(corresponding to θ)until reuse via e

i

j

θ1=i

θ2=j

Page 13: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

13

Communication Volume

When the θ corresponds to parallel loop Let si, sj be the tile sizes Horizontal dependence

sj values to the horizontal neighbor

Vertical dependence si valeus to N/sj tiles

Constant is better 0 is even better!

i

j

Page 14: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

14

Iterative Search

We need d-hyper-planes for a d-D space note that we are not looking for

parallelism parallelism comes with the tile wave-

fronts

Approach: find one θ for each statement constraint the space to be linearly

independent of the θs already found repeat

Page 15: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

15

Tilable Band

Band of Loops/Schedules consecutive sequence of dimensions

Tilable band a band that satisfies the legality

condition for a common set of dependences

PLuTo tiles the outermost tilable band

Page 16: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

16

So, which is better?

What are the θs and δs? what is the order?

i

j

i

j

Page 17: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

17

Solving with ILP

Farkas Lemma again we had enough of Farkas last time

There is a problem when the constraint is:

Page 18: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

18

The “Practical” Choice

Given the schedule prototype:

Constraint the coefficients to:

What does this mean? Relaxed recently by a paper on PLUTO+

and

Page 19: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

19

Example 1: Jacobi 1D

One example implementation:

but it is rather contrived due to limitations in polyhedral compilers

The dependences are simple

for t = 0 .. T for i = 1 .. N-1S1: B[i] = foo(A[i], A[i-1], A[i+1]); for i = 1 .. N-1S2: A[i] = foo(B[i], B[i-1], B[i+1]);

for t = 0 .. T for i = 1 .. N-1S1: A[t,i] = foo(A[t-1,i], A[t-1,i-1], A[t-1,i+1]);

Page 20: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

20

Example 1: Jacobi 1D

Prototype: θS1(t,i) = a1t+a2i+a0

δ1=a1(t+1)+a2i+a0-(a1t+a2i+a0)=a1

δ2=a1(t+1)+a2(i+1)+a0-(a1t+a2i+a0)=a1+a2

δ3=a1(t+1)+a2(i-1)+a0-(a1t+a2i+a0)=a1-a2

S1[t,i] -> S1[t+1,i]

S1[t,i] -> S1[t+1,i+1]

S1[t,i] -> S1[t+1,i-1]

δ1=θS1(t+1,i)-θS1(t,i)

δ2=θS1(t+1,i+1)-θS1(t,i)

δ3=θS1(t+1,i-1)-θS1(t,i)

Page 21: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

21

Example 1: Jacobi 1D

Prototype: θS1(t,i) = a1t+a2i+a0

δ1=a1(t+1)+a2i+a0-(a1t+a2i+a0)=a1

δ2=a1(t+1)+a2(i+1)+a0-(a1t+a2i+a0)=a1+a2

δ3=a1(t+1)+a2(i-1)+a0-(a1t+a2i+a0)=a1-a2

linearly independent with the previous

S1[t,i] -> S1[t+1,i]

S1[t,i] -> S1[t+1,i+1]

S1[t,i] -> S1[t+1,i-1]

δ1=θS1(t+1,i)-θS1(t,i)

δ2=θS1(t+1,i+1)-θS1(t,i)

δ3=θS1(t+1,i-1)-θS1(t,i)

Page 22: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

22

Example 1: Jacobi 1D

We have a set of hyper-planes θS1(t,i) = (t,t+i)

t

i

t

i

Page 23: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

23

Example 2: 2mm

Simplified a bitfor i = 0 .. N for j = 0 .. N for k = 0 .. NS1: C[i,j] += A[i,k] * B[k,j];

for i = 0 .. N for j = 0 .. N for k = 0 .. NS2: E[i,j] += C[i,k] * D[k,j];

S1[i,j,k] -> S1[i,j,k+1]

S2[i,j,k] -> S2[i,j,k+1]

S1[i,j,N] -> S2[i’,j’,k’]: i=i’ and j=k’

Page 24: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

24

Example 2: 2mm (dim 1)

Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0

θS2(x,y,z) = b1x+b2y+b3z+b0

Easy ones:

Interesting case is the inter-statement dep.

S1[i,j,k] -> S1[i,j,k+1]

S2[x,y,z] -> S2[x,y,z+1]

S1[i,j,N] -> S2[x,y,z]: i=x and j=z

a3=0

b3=0

S2[i,j,k] -> S1[i,k,N]:or S2[x,y,z] -> S1[x,z,N]:

Page 25: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

25

Example 2: 2mm (dim 1)

Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0

θS2(x,y,z) = b1x+b2y+b3z+b0

Interesting case is the inter-statement dep.

b1x+b2y+b3z+b0 - a1x+a2z+a3N+a0

S1[i,j,N] -> S2[x,y,z]: i=x and j=z

or S2[x,y,z] -> S1[x,z,N]:

(b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0

(b1-a1)x+b2y-a2za3=b3=0

Page 26: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

26

Example 2: 2mm (dim 1)

Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0

θS2(x,y,z) = b1x+b2y+b3z+b0

Minimize: subject to a1+a2+b1+b2=0 a,b≥0 (plus weakly satisfied)

We get θS1(i,j,k) = i

θS2(x,y,z) = x

(b1-a1)x+b2y-a2z

Page 27: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

27

Example 2: 2mm (dim 2)

Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0

θS2(x,y,z) = b1x+b2y+b3z+b0

Minimize:subject to (b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0

linearly independent with the previous We get

θS1(i,j,k) = j

θS2(x,y,z) = z

(b1-a1)x+b2y-a2z

Page 28: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

28

Example 2: 2mm (dim 3)

Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0

θS2(x,y,z) = b1x+b2y+b3z+b0

Minimize:subject to (b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0

linearly independent with the previous does θS1=k and θS2=y work?

a3=1, b2=1, rest 0

(b1-a1)x+b2y-a2z

S1[i,j,N] -> S2[x,y,z]: i=x and j=z

Page 29: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

29

Example 2: 2mm (dim 3)

Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0

θS2(x,y,z) = b1x+b2y+b3z+b0

Minimize:subject to (b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0

linearly independent with the previous does θS1=k and θS2=y work?

a3=1, b2=1, rest 0

(b1-a1)x+b2y-a2z

S1[i,j,N] -> S2[x,y,z]: i=x and j=z

or S2[x,y,z] -> S1[x,z,N]:

Page 30: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

30

Example 2: 2mm (dim 3)

Prototype: θS1(i,j,k) = a1i+a2j+a3k+a0

θS2(x,y,z) = b1x+b2y+b3z+b0

Minimize:subject to (b1-a1)x+b2y+(b3-a2)z+b0+a3N+a0

linearly independent with the previous we have to split here

θS1=0 and θS2=1

(b1-a1)x+b2y-a2z

S1[i,j,N] -> S2[x,y,z]: i=x and j=z

or S2[x,y,z] -> S1[x,z,N]:

Page 31: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

31

Example 2: 2mm (dim 4)

Proceed to the 4th dimension because the 3rd dimension is only for statement ordering Now solve the problem independently

for each statement Case S1:

linearly independent with [i] and [j] Case S2:

linearly independent with [x] and [z] We get [k] and [y]

S1[i,j,k] -> S1[i,j,k+1]

S2[x,y,z] -> S2[x,y,z+1]

Page 32: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

32

Example 2: 2mm

Finally, we have a set of hyper-planes θS1(i,j,k) = (i,j,0,k)

θS2(i,j,k) = (i,k,1,j)

Tilable Bandfor i = 0 .. N for j = 0 .. N for k = 0 .. NS1: C[i,j] += A[i,k] * B[k,j];

for i = 0 .. N for j = 0 .. N for k = 0 .. NS2: E[i,j] += C[i,k] * D[k,j];

for i = 0 .. N for j = 0 .. N { for k = 0 .. NS1: C[i,j] += A[i,k] * B[k,j]; for k = 0 .. NS2: E[i,k] += C[i,j] * D[j,k]; }

Page 33: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

33

Example 2: 2mm

Output of Pluto

Page 34: CR18: Advanced Compilers L05: Scheduling for Locality Tomofumi Yuki 1

34

Summary of Pluto

Paper in 2008 huge impact: 350+ citations already

Works very well as the default strategy

But, it is far from perfect!