loop transformations and locality

40
Wei Li 1 Stanford University CS243 Winter 2006 Loop Loop Transformations and Transformations and Locality Locality

Upload: nerys

Post on 11-Feb-2016

51 views

Category:

Documents


0 download

DESCRIPTION

Loop Transformations and Locality . Agenda. Introduction Loop Transformations Affine Transform Theory. Memory Hierarchy. CPU. C. C. Cache. Memory. Cache Locality. Suppose array A has column-major layout. for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_for end_for. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Loop Transformations and Locality

Wei Li 1Stanford University

CS243 Winter 2006

Loop Transformations Loop Transformations and Locality and Locality

Page 2: Loop Transformations and Locality

2CS243 Winter 2006Stanford University

AgendaAgenda IntroductionIntroduction Loop TransformationsLoop Transformations Affine Transform TheoryAffine Transform Theory

Page 3: Loop Transformations and Locality

3CS243 Winter 2006Stanford University

Memory HierarchyMemory Hierarchy

CPUC

C

Memory

Cache

Page 4: Loop Transformations and Locality

4CS243 Winter 2006Stanford University

Cache LocalityCache Locality

for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for

• Suppose array A has column-major layout

A[1,1]A[1,1] A[2,1]A[2,1] …… A[1,2]A[1,2] A[2,2]A[2,2] …… A[1,3]A[1,3] ……

• Loop nest has poor spatial cache locality.

Page 5: Loop Transformations and Locality

5CS243 Winter 2006Stanford University

Loop InterchangeLoop Interchange

for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for

• Suppose array A has column-major layout

A[1,1]A[1,1] A[2,1]A[2,1] …… A[1,2]A[1,2] A[2,2]A[2,2] …… A[1,3]A[1,3] ……

• New loop nest has better spatial cache locality.

for j = 1, 200

for i = 1, 100 A[i, j] = A[i, j] + 3 end_forend_for

Page 6: Loop Transformations and Locality

6CS243 Winter 2006Stanford University

Interchange Loops?Interchange Loops?

for i = 2, 100 for j = 1, 200 A[i, j] = A[i-1, j+1]+3 end_forend_for

• e.g. dependence from (3,3) to (4,2)

i

j

Page 7: Loop Transformations and Locality

7CS243 Winter 2006Stanford University

Dependence VectorsDependence Vectors

i

j

Distance vector (1,-1) Distance vector (1,-1) = (4,2)-(3,3)= (4,2)-(3,3)

Direction vector (+, -) Direction vector (+, -) from the signs of from the signs of distance vectordistance vector

Loop interchange is Loop interchange is not legal if there exists not legal if there exists dependence (+, -)dependence (+, -)

Page 8: Loop Transformations and Locality

8CS243 Winter 2006Stanford University

AgendaAgenda IntroductionIntroduction Loop TransformationsLoop Transformations Affine Transform TheoryAffine Transform Theory

Page 9: Loop Transformations and Locality

9CS243 Winter 2006Stanford University

Loop FusionLoop Fusion

for i = 1, 1000 A[i] = B[i] + 3end_for

for j = 1, 1000 C[j] = A[j] + 5end_for

for i = 1, 1000 A[i] = B[i] + 3 C[i] = A[i] + 5end_for

Better reuse between A[i] and A[i]Better reuse between A[i] and A[i]

Page 10: Loop Transformations and Locality

10CS243 Winter 2006Stanford University

Loop DistributionLoop Distributionfor i = 1, 1000 A[i] = A[i-1] + 3end_for

for i = 1, 1000 C[i] = B[i] + 5end_for

for i = 1, 1000 A[i] = A[i-1] + 3 C[i] = B[i] + 5end_for

22ndnd loop is parallel loop is parallel

Page 11: Loop Transformations and Locality

11CS243 Winter 2006Stanford University

Register BlockingRegister Blocking

for j = 1, 2*m for i = 1, 2*n A[i, j] = A[i-1, j] + A[i-1, j-1] end_forend_for

for j = 1, 2*m, 2 for i = 1, 2*n, 2 A[i, j] = A[i-1,j] + A[i-1,j-1] A[i, j+1] = A[i-1,j+1] + A[i-1,j] A[i+1, j] = A[i, j] + A[i, j-1] A[i+1, j+1] = A[i, j+1] + A[i, j] end_forend_for

Better reuse between A[i,j] and A[i,j]Better reuse between A[i,j] and A[i,j]

Page 12: Loop Transformations and Locality

12CS243 Winter 2006Stanford University

Virtual Register AllocationVirtual Register Allocationfor j = 1, 2*M, 2 for i = 1, 2*N, 2 r1 = A[i-1,j] r2 = r1 + A[i-1,j-1] A[i, j] = r2 r3 = A[i-1,j+1] + r1 A[i, j+1] = r3 A[i+1, j] = r2 + A[i, j-1] A[i+1, j+1] = r3 + r2 end_forend_for

Memory operations reduced to register load/store

8MN loads to 4MN loads

Page 13: Loop Transformations and Locality

13CS243 Winter 2006Stanford University

Scalar ReplacementScalar Replacement

for i = 2, N+1 = A[i-1]+1 A[i] =end_for

t1 = A[1]for i = 2, N+1 = t1 + 1 t1 = A[i] = t1end_for

Eliminate loads and stores for array references

Page 14: Loop Transformations and Locality

14CS243 Winter 2006Stanford University

Unroll-and-JamUnroll-and-Jam

for j = 1, 2*M for i = 1, N A[i, j] = A[i-1, j] + A[i-1, j-1] end_forend_for

for j = 1, 2*M, 2 for i = 1, N A[i, j]=A[i-1,j]+A[i-1,j-1] A[i, j+1]=A[i-1,j+1]+A[i-1,j] end_forend_for

Expose more opportunity for scalar replacement

Page 15: Loop Transformations and Locality

15CS243 Winter 2006Stanford University

Large ArraysLarge Arrays

for i = 1, 1000 for j = 1, 1000 A[i, j] = A[i, j] + B[j, i] end_forend_for

• Suppose arrays A and B have row-major layout

B has poor cache locality. Loop interchange will not help.

Page 16: Loop Transformations and Locality

16CS243 Winter 2006Stanford University

Loop BlockingLoop Blocking

for v = 1, 1000, 20 for u = 1, 1000, 20 for j = v, v+19 for i = u, u+19 A[i, j] = A[i, j] + B[j, i] end_for end_for end_forend_for

Access to small blocks of the arrays has good cache locality.

Page 17: Loop Transformations and Locality

17CS243 Winter 2006Stanford University

Loop Unrolling for ILPLoop Unrolling for ILP

for i = 1, 10 a[i] = b[i]; *p = ... end_for

for I = 1, 10, 2 a[i] = b[i]; *p = … a[i+1] = b[i+1]; *p = …end_for

Large scheduling regions. Fewer dynamic branches Increased code size

Page 18: Loop Transformations and Locality

18CS243 Winter 2006Stanford University

AgendaAgenda IntroductionIntroduction Loop TransformationsLoop Transformations Affine Transform TheoryAffine Transform Theory

Page 19: Loop Transformations and Locality

19CS243 Winter 2006Stanford University

ObjectiveObjective Unify a large class of program Unify a large class of program

transformations.transformations. Example:Example:

float Z[100];for i = 0, 9 Z[i+10] = Z[i];end_for

Page 20: Loop Transformations and Locality

20CS243 Winter 2006Stanford University

Iteration SpaceIteration Space A d-deep loop nest has d index variables, A d-deep loop nest has d index variables,

and is modeled by a d-dimensional space. and is modeled by a d-dimensional space. The space of iterations is bounded by the The space of iterations is bounded by the lower and upper bounds of the loop lower and upper bounds of the loop indices.indices.

Iteration space i = 0,1, …9 Iteration space i = 0,1, …9 for i = 0, 9 Z[i+10] = Z[i];end_for

Page 21: Loop Transformations and Locality

21CS243 Winter 2006Stanford University

Matrix FormulationMatrix Formulation The iterations in a d-deep loop nest can be The iterations in a d-deep loop nest can be

represented mathematically asrepresented mathematically as

Z is the set of integersZ is the set of integers B is a d B is a d xx d integer matrix d integer matrix b is an integer vector of length d, andb is an integer vector of length d, and 0 is a vector of d 0’s.0 is a vector of d 0’s.

}0|{ biBdZi

Page 22: Loop Transformations and Locality

22CS243 Winter 2006Stanford University

ExampleExample

for i = 0, 5 for j = i, 7 Z[j,i] = 0;

E.g. the 3rd row –i+j ≥ 0 is from the lower bound j ≥ i for loop j.

0000

7050

10110101

ji

Page 23: Loop Transformations and Locality

23CS243 Winter 2006Stanford University

Symbolic ConstantsSymbolic Constants

for i = 0, n Z[i] = 0;

E.g. the 1st row –i+n ≥ 0 is from the upper bound i ≤ n.

00

0111|{niZi

Page 24: Loop Transformations and Locality

24CS243 Winter 2006Stanford University

Data SpaceData Space An n-dimensional array is modeled by an An n-dimensional array is modeled by an

n-dimensional space. The space is n-dimensional space. The space is bounded by the array bounds.bounded by the array bounds.

Data space a = 0,1, …99 Data space a = 0,1, …99

float Z[100]for i = 0, 9 Z[i+10] = Z[i];end_for

Page 25: Loop Transformations and Locality

25CS243 Winter 2006Stanford University

Processor SpaceProcessor Space Initially assume unbounded number of Initially assume unbounded number of

virtual processors (vp1, vp2, …) organized virtual processors (vp1, vp2, …) organized in a multi-dimensional space. in a multi-dimensional space. (iteration 1, vp1), (iteration 2, vp2),…(iteration 1, vp1), (iteration 2, vp2),…

After parallelization, map to physical After parallelization, map to physical processors (p1, p2).processors (p1, p2). (vp1, p1), (vp2, p2), (vp3, p1), (vp4, p2),… (vp1, p1), (vp2, p2), (vp3, p1), (vp4, p2),…

Page 26: Loop Transformations and Locality

26CS243 Winter 2006Stanford University

Affine Array Index FunctionAffine Array Index Function Each array access in the code specifies a Each array access in the code specifies a

mapping from an iteration in the iteration mapping from an iteration in the iteration space to an array element in the data space to an array element in the data spacespace

Both i+10 and i are affine.Both i+10 and i are affine.float Z[100]for i = 0, 9 Z[i+10] = Z[i];end_for

Page 27: Loop Transformations and Locality

27CS243 Winter 2006Stanford University

Array Affine AccessArray Affine Access The bounds of the loop are expressed as The bounds of the loop are expressed as

affine expressions of the surrounding loop affine expressions of the surrounding loop variables and symbolic constants, andvariables and symbolic constants, and

The index for each dimension of the array The index for each dimension of the array is also an affine expression of surrounding is also an affine expression of surrounding loop variables and symbolic constantsloop variables and symbolic constants

Page 28: Loop Transformations and Locality

28CS243 Winter 2006Stanford University

Matrix FormulationMatrix Formulation Array access maps a vector i within the Array access maps a vector i within the

bounds to array element location Fi+f.bounds to array element location Fi+f.

E.g. access X[i-1] in loop nest i,j E.g. access X[i-1] in loop nest i,j

}0|{ biBdZi

101

ji

Page 29: Loop Transformations and Locality

29CS243 Winter 2006Stanford University

Affine PartitioningAffine Partitioning An affine function to assign iterations in an An affine function to assign iterations in an

iteration space to processors in the iteration space to processors in the processor space.processor space.

E.g. iteration i to processor 10-i.E.g. iteration i to processor 10-i.

float Z[100]for i = 0, 9 Z[i+10] = Z[i];end_for

Page 30: Loop Transformations and Locality

30CS243 Winter 2006Stanford University

Data Access RegionData Access Region An affine function to assign iterations in an An affine function to assign iterations in an

iteration space to processors in the iteration space to processors in the processor space.processor space.

Region for Z[i+10] is {a | 10 ≤ a ≤ 20}.Region for Z[i+10] is {a | 10 ≤ a ≤ 20}.

float Z[100]for i = 0, 9 Z[i+10] = Z[i];end_for

Page 31: Loop Transformations and Locality

31CS243 Winter 2006Stanford University

Data DependencesData Dependences Solution to linear constraints as shown in Solution to linear constraints as shown in

the last lecture.the last lecture. There exist iThere exist irr, i, iww, such that, such that 0 0 ≤ ≤ iirr,, i iw w ≤≤ 9, 9, iiw w + 10 = i+ 10 = irr

float Z[100]for i = 0, 9 Z[i+10] = Z[i];end_for

Page 32: Loop Transformations and Locality

32CS243 Winter 2006Stanford University

Affine TransformAffine Transform

i

j

u

v

bji

Bvu

Page 33: Loop Transformations and Locality

33CS243 Winter 2006Stanford University

Locality OptimizationLocality Optimization

for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for

for u = 1, 200

for v = 1, 100 A[v,u] = A[v,u]+ 3 end_forend_for

ji

vu

0110

Page 34: Loop Transformations and Locality

34CS243 Winter 2006Stanford University

Old Iteration SpaceOld Iteration Space

ji

vu

0110

0000

2001

1001

10100101

ji

for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for

0000

2001

1001

1

0110

10100101

vu

Page 35: Loop Transformations and Locality

35CS243 Winter 2006Stanford University

New Iteration SpaceNew Iteration Space

0000

2001

1001

010110

10

vu

0000

2001

1001

1

0110

10100101

vu

for u = 1, 200

for v = 1, 100 A[v,u] = A[v,u]+ 3 end_forend_for

Page 36: Loop Transformations and Locality

36CS243 Winter 2006Stanford University

Old Array AccessesOld Array Accesses

ji

vu

0110

]10,01[

ji

jiA

for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for

]1

011010,

1

011001[

vu

vu

A

Page 37: Loop Transformations and Locality

37CS243 Winter 2006Stanford University

New Array AccessesNew Array Accesses

]1

011010,

1

011001[

vu

vu

A

for u = 1, 200

for v = 1, 100 A[v,u] = A[v,u]+ 3 end_forend_for

]01,10[

vu

vu

A

Page 38: Loop Transformations and Locality

38CS243 Winter 2006Stanford University

Interchange Loops?Interchange Loops?

for i = 2, 1000 for j = 1, 1000 A[i, j] = A[i-1, j+1]+3 end_forend_for

• e.g. dependence vector dold = (1,-1)

i

j

Page 39: Loop Transformations and Locality

39CS243 Winter 2006Stanford University

Interchange Loops?Interchange Loops?

ji

vu

0110

11

11

0110

0110

oldnew dd

A transformation is legal, if the new dependence A transformation is legal, if the new dependence is lexicographically positive, i.e. the leading non-is lexicographically positive, i.e. the leading non-zero in the dependence vector is positive.zero in the dependence vector is positive.

Page 40: Loop Transformations and Locality

40CS243 Winter 2006Stanford University

SummarySummary Locality OptimizationsLocality Optimizations Loop TransformationsLoop Transformations Affine Transform TheoryAffine Transform Theory