improving locality through loop transformations copyright 2011, keith d. cooper & linda torczon,...

Improving Locality through Loop Transformations

Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved.

Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use.

Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved.

Comp 512Spring 2011

1COMP 512, Rice University

COMP 512, Rice University

2

The Opportunities

• Compilers have always focused on loops Higher execution counts Repeated, related operations

• Much of real work takes place in loops (linear algebra)

Several effects to attack• Overhead

Decrease control-structure cost per iteration • Locality

Spatial locality use of co-resident data Temporal locality reuse of same data

• Parallelism Move loop w/independent iterations to inner (outer)

position

Remember Fortran H

*


3

Eliminating Overhead

Loop unrolling (the oldest trick in the book)

• To reduce overhead, replicate the loop body

Sources of Improvement

• Less overhead per useful operation

• Longer basic blocks for local optimization

do i = 1 to 100 by 1 a(i) = a(i) + b(i) end

do i = 1 to 100 by 4 a(i) = a(i) + b(i) a(i+1) = a(i+1) + b(i+1) a(i+2) = a(i+2) + b(i+2) a(i+3) = a(i+3) + b(i+3) end

becomes

(unroll by 4)


4


Loop unrolling with unknown bounds

• Generate guard loops

• While loop needs an explicit update

• Used in this form in the BLAS & in BitBlt

do i = 1 to n by 1 a(i) = a(i) + b(i) end

i = 1do while (i+3 < n) a(i) = a(i) + b(i) a(i+1) = a(i+1) + b(i+1) a(i+2) = a(i+2) + b(i+2) a(i+3) = a(i+3) + b(i+3) i = i + 4 end

do while (i < n) a(i) = a(i) + b(i) i = i + 1 end

becomes

(unroll by 4)


5


One other use for loop unrolling • Eliminate copies at the end of a loop

More complex cases• Multiple cross-iteration copy cycles use LCM of cycle lengths• Result has been rediscovered many times [Ken’s

thesis]

t1 = b(0)do i = 1 to 100 t2 = b(i) a(i) = a(i) + t1 + t2

t1 = t2 end

becomes

(unroll + rename)

t1 = b(0)do i = 1 to 100 by 2 t2 = b(i) a(i) = a(i) + t1 + t2

t1 = b(i+1) a(i+1) = a(i+1) + t2 + t1

end


6


Loop unswitching

• Hoist invariant control-flow out of loop nest

• Replicate the loop & specialize it

• No tests, branches in loop body

• Longer segments of straight-line code

• Does this happen in real code? If so, its worth doing

do i = 1 to 100 a(i) = a(i) + b(i) if (expression) then d(i) = 0 end

becomes

(unswitch)

if (expression) then do i = 1 to 100 a(i) = a(i) + b(i) d(i) = 0 endelse do i = 1 to 100 a(i) = a(i) + b(i) end

See also: Cytron, Lowry, & Zadeck in 13th POPL (1986)

*


7

Eliminating Overhead & Helping Locality

Loop fusion

• Two loops over same iteration space one loop

Advantages

• Fewer total operations (statically & dynamically )

• Longer basic blocks for local optimization & scheduling

• Can convert inter-loop reuse to intra-loop reuse

do i = 1 to n c(i) = a(i) + b(i) end

do j = 1 to n d(j) = a(j) * e(j) end

becomes

(fuse)

do i = 1 to n c(i) = a(i) + b(i) d(i) = a(i) * e(i) end

This is safe if it does not change the values used or defined by any statement in either loop

*

a(i) will be found in the cache

For big enough arrays, a(i) will not be in the cache


8

Increasing Overhead & Helping Locality

Loop distribution (or fission)

• Single loop with independent statements multiple loops

Advantages

• Enables other transformations (e.g., vectorization)

• Each resulting loop has a smaller cache footprint More reuse hits in the cache

do i = 1 to n a(i) = b(i) + c(i) end

do i = 1 to n d(i) = e(i) * f(i) end

do i = 1 to n g(i) = h(i) - k(i) end

becomes

(fission)

do i = 1 to n a(i) = b(i) + c(i) d(i) = e(i) * f(i) g(i) = h(i) - k(i) end

}Reads b & c

Writes a

}Reads e & f

Writes d

}Reads h & k

Writes g

{Reads b, c,e, f, h, & k

Writes a, d, & g

*

It is safe if all the statements that form a cycle in the dependence graph end up in the same loop (see COMP 515 from Ken)


9

do i = 1 to 50 do j = 1 to 100 a(i,j) = b(i,j) * c(i,j) end end

do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) end end

becomes

(interchange)

Reordering Loops for Locality

Loop interchange• Swap inner & outer loops to rearrange iteration space

Effect• Improves reuse by using more elements per cache line• Goal is to get as much reuse into inner loop as possible

After interchange, direction of Iteration is changed

cache line

Runs down cache line

In row-major order, the opposite loop ordering causes the same effects

In Fortran’s column-major order,a(4,4) would lay out as

cache line

As little as 1 used element per line

*


10


Loop permutation• Interchange is degenerate case

Two perfectly nested loops • More general problem is called permutation

Safety• Permutation is safe iff no data dependences are reversed

The flow of data from definitions to uses is preserved

Effects• Change order of access & order of computation• Move accesses closer in time increase temporal locality• Move computations farther apart cover pipeline latencies


11


Strip mining

• Splits a loop into two loops

Effects (may slow down the code)

• Used to match loop bounds to vector length

• Used as prelude to loop permutation (for tiling)

do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) end end

becomes

(strip mine)

do j = 1 to 100 do ii = 1 to 50 by 8 do i = ii to min(ii+7,50) a(i,j) = b(i,j) * c(i,j) end end end

This is always safe


12


Loop tiling (or blocking)

• Combination of strip mining and interchange

Effects

• Reduces volume of data between reuses Works on one “tile” at a time (tile size is tk by tj)

• Choice of tile size is crucial

do i = 1 to m do k = 1 to n do j = 1 to n c(j,i) = c(j,i) + a(k,i) * b(j,k) end end end

do kk = 1 to n by tk do jj =1 to n by tj do i = 1 to m do k = kk to min(kk+tk-1,n) do j = jj to min(jj+tj-1,n) c(j,i) = c(j,i) + a(k,i) * b(j,k) end end end end endend

becomes

(tiling)

Interchange must be safe

Strip mine & interchangeStrip mine & interchange


13

Rewriting Loops for Better Register Allocation

Scalar Replacement

• Allocators never keep c(i) in a register

• We can trick the allocator by rewriting the references

The plan

• Locate patterns of consistent reuse

• Make loads and stores use temporary scalar variable

• Replace references with temporary’s name

• May need copies at end of loop to keep reused values straight If reuse spans more than one iteration, need to “pipeline”

it


14


Scalar Replacement

Effects

• Decreases number of loads and stores

• Keeps reused values in names that can be allocated to registers

• In essence, this exposes the reuse of a(i) to subsequent passes

do i = 1 to n do j = 1 to n a(i) = a(i) + b(j) end end

do i = 1 to n t = a(i) do j = 1 to n t = t + b(j) end a(i) = t end

becomes

(scalar replacement)

Almost any register allocator can get t into a register


15


What if we are not in Fortran? What about C?

Register Promotion (PLDI: Lu 1997, Sastry & Ju 1998, Chow et al 1998)

• Promote pointer-based references into scalar temporaries

• Requires data-flow information on pointers + a transformation

• Equivalent of scalar promotion for pointer-based values

Lu’s formulation

• Perform interprocedural analysis to disambiguate pointers

• Find loops & solve intraprocedural problem for each loop

• Rewrite code based on results of analysisHis work relied on ILOC’s memory tags


16


Register Promotion

• Every memory operation has a set of tags -- textual names that describe which memory locations it can address | tag set | = 1 means the reference is unambiguous | tag set | > 1 means the reference is ambiguous Interprocedural pointer analysis shrinks tag sets …

Initial data-flow information

• B_Explicit(b) contains all tags referenced by an explicit memory operation in b

• B_Ambiguous(b) contains all tags referenced by a memory operation with multiple tags or by a procedure call in b

Algorithm computes B_Explicit and B_Ambiguous for each block


17


Register Promotion

• Solve the equations

• For each promotable tag, t, in a loop:

• Create a virtual register, v

• Rewrite each reference to t with a copy from v

L_Explicit(L) = b in loop L B_Explicit(b)

L_Ambiguous(L) = b in loop L B_Ambiguous(b)

L_Promotable(L) = L_Explicit(L) - L_Ambiguous(L)

L_Lift(L) = L_Promotable(L) if L is an outermost loopL_Promotable(L) - L_Promotable(S), S surrounds L

Loop-by-loop approach motivated by fact that we really care about loops more than other parts of the program


18


Register Promotion - Example from the paper

With this simple scheme

• 0 to 16% of loads removed in test codes

• 0 to 50% of stores removed in test codes

• Other authors tried PRE-based extensions to Lu’s work

for (i= 0; i < DIM_X; i++) { B[i] = 0; for (j=0; j < DIM_Y; j++) { B[i] += A[i][j]; }}

for (i= 0; i < DIM_X; i++) { rb = 0; for (j=0; j < DIM_Y; j++) { rb += A[i][j] ; } B[i] = rb;}

B has 1 tag


19

Balancing Memory Bound Loops

Balance is the ratio of memory accesses to flops

• Machine balance is M =

• Loop balance is L =

Loops run better if they are balanced or compute bound

• If L > M , the loop is memory bound

• If L = M , the loop is balanced

• If L < M , the loop is compute bound

Making memory bound loops into balanced or compute-bound loops

• Need more reuse to decrease number of accesses• Combine scalar replacement, unrolling, and fusion

accesses/cycleflops/cycle

accesses/iterationflops/iteration


20


Example of Loop Balance


Original loop nest

do i = 1 to n t = a(i) do j = 1 to n t = t + b(j) end a(i) = t endAfter scalarreplacement

L =3 accesses/iteration

1 flop/iteration L =1 access/iteration

1 flop/iteration

As memory accesses cost more, this gets better !


21

Balancing Memory Loops

Unroll and Jam

• Unroll the outer loop

• Fuse the resulting inner loops

Effect

• Increases reuse in the inner loop*

• Decreases overhead, too

• Combine with scalar replacement for full benefits


do i = 1 to n by 2 do j = 1 to n a(i) = a(i) + b(j) a(i+1) = a(i+1) + b(j) end end

becomes

(unroll & jam)

*

1 access

2 flops

Impact

• Original loop had L =

• Scalar replacement got L down to

• Unroll & jam + scalar replacement produced L =

3 accesses/iteration1 flop/iteration

1 access/iteration1 flop/iteration

1 access/iteration2 flop/iteration


22


Adding Scalar Replacement

• Rewrites all three cases of reuse

do i = 1 to n by 2 t1 = a(i) t2 = a(i+1)

do j = 1 to n t3 = b(j) t1 = t1 + t3

t2 = t2 + t3

end a(i) = t1

a(i+1) = t2

end

becomes

(scalar replacement)

do i = 1 to n by 2 do j = 1 to n a(i) = a(i) + b(j) a(i+1) = a(i+1) + b(j) end end

captures the reuse


23


The Big Picture

• Scalar replacement + unroll & jam helps

• Factors of 2 to 6 for matrix multiply (matrix300)

What does it take to do this in a real compiler?

• Data dependence analysis (see COMP 515)

• Method to discover consistent reuse patterns (see Carr)

• Way to choose the unroll amount Constrained by available registers Need heuristics to predict allocator’s behavior

improving locality through loop transformations copyright 2011, keith d. cooper & linda torczon,...

Documents

ai bi ai

ai bi i

rice universitycomp

ai bi enddo

ai bi endi

loop unrolling

fissionsingle loop

loop nestreplicate