improving locality through loop transformations copyright 2011, keith d. cooper & linda torczon,...
TRANSCRIPT
Improving Locality through Loop Transformations
Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved.
Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use.
Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved.
Comp 512Spring 2011
1COMP 512, Rice University
COMP 512, Rice University
2
The Opportunities
• Compilers have always focused on loops Higher execution counts Repeated, related operations
• Much of real work takes place in loops (linear algebra)
Several effects to attack• Overhead
Decrease control-structure cost per iteration • Locality
Spatial locality use of co-resident data Temporal locality reuse of same data
• Parallelism Move loop w/independent iterations to inner (outer)
position
Remember Fortran H
*
COMP 512, Rice University
3
Eliminating Overhead
Loop unrolling (the oldest trick in the book)
• To reduce overhead, replicate the loop body
Sources of Improvement
• Less overhead per useful operation
• Longer basic blocks for local optimization
do i = 1 to 100 by 1 a(i) = a(i) + b(i) end
do i = 1 to 100 by 4 a(i) = a(i) + b(i) a(i+1) = a(i+1) + b(i+1) a(i+2) = a(i+2) + b(i+2) a(i+3) = a(i+3) + b(i+3) end
becomes
(unroll by 4)
COMP 512, Rice University
4
Eliminating Overhead
Loop unrolling with unknown bounds
• Generate guard loops
• While loop needs an explicit update
• Used in this form in the BLAS & in BitBlt
do i = 1 to n by 1 a(i) = a(i) + b(i) end
i = 1do while (i+3 < n) a(i) = a(i) + b(i) a(i+1) = a(i+1) + b(i+1) a(i+2) = a(i+2) + b(i+2) a(i+3) = a(i+3) + b(i+3) i = i + 4 end
do while (i < n) a(i) = a(i) + b(i) i = i + 1 end
becomes
(unroll by 4)
COMP 512, Rice University
5
Eliminating Overhead
One other use for loop unrolling • Eliminate copies at the end of a loop
More complex cases• Multiple cross-iteration copy cycles use LCM of cycle lengths• Result has been rediscovered many times [Ken’s
thesis]
t1 = b(0)do i = 1 to 100 t2 = b(i) a(i) = a(i) + t1 + t2
t1 = t2 end
becomes
(unroll + rename)
t1 = b(0)do i = 1 to 100 by 2 t2 = b(i) a(i) = a(i) + t1 + t2
t1 = b(i+1) a(i+1) = a(i+1) + t2 + t1
end
COMP 512, Rice University
6
Eliminating Overhead
Loop unswitching
• Hoist invariant control-flow out of loop nest
• Replicate the loop & specialize it
• No tests, branches in loop body
• Longer segments of straight-line code
• Does this happen in real code? If so, its worth doing
do i = 1 to 100 a(i) = a(i) + b(i) if (expression) then d(i) = 0 end
becomes
(unswitch)
if (expression) then do i = 1 to 100 a(i) = a(i) + b(i) d(i) = 0 endelse do i = 1 to 100 a(i) = a(i) + b(i) end
See also: Cytron, Lowry, & Zadeck in 13th POPL (1986)
*
COMP 512, Rice University
7
Eliminating Overhead & Helping Locality
Loop fusion
• Two loops over same iteration space one loop
Advantages
• Fewer total operations (statically & dynamically )
• Longer basic blocks for local optimization & scheduling
• Can convert inter-loop reuse to intra-loop reuse
do i = 1 to n c(i) = a(i) + b(i) end
do j = 1 to n d(j) = a(j) * e(j) end
becomes
(fuse)
do i = 1 to n c(i) = a(i) + b(i) d(i) = a(i) * e(i) end
This is safe if it does not change the values used or defined by any statement in either loop
*
a(i) will be found in the cache
For big enough arrays, a(i) will not be in the cache
COMP 512, Rice University
8
Increasing Overhead & Helping Locality
Loop distribution (or fission)
• Single loop with independent statements multiple loops
Advantages
• Enables other transformations (e.g., vectorization)
• Each resulting loop has a smaller cache footprint More reuse hits in the cache
do i = 1 to n a(i) = b(i) + c(i) end
do i = 1 to n d(i) = e(i) * f(i) end
do i = 1 to n g(i) = h(i) - k(i) end
becomes
(fission)
do i = 1 to n a(i) = b(i) + c(i) d(i) = e(i) * f(i) g(i) = h(i) - k(i) end
}Reads b & c
Writes a
}Reads e & f
Writes d
}Reads h & k
Writes g
{Reads b, c,e, f, h, & k
Writes a, d, & g
*
It is safe if all the statements that form a cycle in the dependence graph end up in the same loop (see COMP 515 from Ken)
COMP 512, Rice University
9
do i = 1 to 50 do j = 1 to 100 a(i,j) = b(i,j) * c(i,j) end end
do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) end end
becomes
(interchange)
Reordering Loops for Locality
Loop interchange• Swap inner & outer loops to rearrange iteration space
Effect• Improves reuse by using more elements per cache line• Goal is to get as much reuse into inner loop as possible
After interchange, direction of Iteration is changed
cache line
Runs down cache line
In row-major order, the opposite loop ordering causes the same effects
In Fortran’s column-major order,a(4,4) would lay out as
cache line
As little as 1 used element per line
*
COMP 512, Rice University
10
Reordering Loops for Locality
Loop permutation• Interchange is degenerate case
Two perfectly nested loops • More general problem is called permutation
Safety• Permutation is safe iff no data dependences are reversed
The flow of data from definitions to uses is preserved
Effects• Change order of access & order of computation• Move accesses closer in time increase temporal locality• Move computations farther apart cover pipeline latencies
COMP 512, Rice University
11
Reordering Loops for Locality
Strip mining
• Splits a loop into two loops
Effects (may slow down the code)
• Used to match loop bounds to vector length
• Used as prelude to loop permutation (for tiling)
do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) end end
becomes
(strip mine)
do j = 1 to 100 do ii = 1 to 50 by 8 do i = ii to min(ii+7,50) a(i,j) = b(i,j) * c(i,j) end end end
This is always safe
COMP 512, Rice University
12
Reordering Loops for Locality
Loop tiling (or blocking)
• Combination of strip mining and interchange
Effects
• Reduces volume of data between reuses Works on one “tile” at a time (tile size is tk by tj)
• Choice of tile size is crucial
do i = 1 to m do k = 1 to n do j = 1 to n c(j,i) = c(j,i) + a(k,i) * b(j,k) end end end
do kk = 1 to n by tk do jj =1 to n by tj do i = 1 to m do k = kk to min(kk+tk-1,n) do j = jj to min(jj+tj-1,n) c(j,i) = c(j,i) + a(k,i) * b(j,k) end end end end endend
becomes
(tiling)
Interchange must be safe
Strip mine & interchangeStrip mine & interchange
COMP 512, Rice University
13
Rewriting Loops for Better Register Allocation
Scalar Replacement
• Allocators never keep c(i) in a register
• We can trick the allocator by rewriting the references
The plan
• Locate patterns of consistent reuse
• Make loads and stores use temporary scalar variable
• Replace references with temporary’s name
• May need copies at end of loop to keep reused values straight If reuse spans more than one iteration, need to “pipeline”
it
COMP 512, Rice University
14
Rewriting Loops for Better Register Allocation
Scalar Replacement
Effects
• Decreases number of loads and stores
• Keeps reused values in names that can be allocated to registers
• In essence, this exposes the reuse of a(i) to subsequent passes
do i = 1 to n do j = 1 to n a(i) = a(i) + b(j) end end
do i = 1 to n t = a(i) do j = 1 to n t = t + b(j) end a(i) = t end
becomes
(scalar replacement)
Almost any register allocator can get t into a register
COMP 512, Rice University
15
Rewriting Loops for Better Register Allocation
What if we are not in Fortran? What about C?
Register Promotion (PLDI: Lu 1997, Sastry & Ju 1998, Chow et al 1998)
• Promote pointer-based references into scalar temporaries
• Requires data-flow information on pointers + a transformation
• Equivalent of scalar promotion for pointer-based values
Lu’s formulation
• Perform interprocedural analysis to disambiguate pointers
• Find loops & solve intraprocedural problem for each loop
• Rewrite code based on results of analysisHis work relied on ILOC’s memory tags
COMP 512, Rice University
16
Rewriting Loops for Better Register Allocation
Register Promotion
• Every memory operation has a set of tags -- textual names that describe which memory locations it can address | tag set | = 1 means the reference is unambiguous | tag set | > 1 means the reference is ambiguous Interprocedural pointer analysis shrinks tag sets …
Initial data-flow information
• B_Explicit(b) contains all tags referenced by an explicit memory operation in b
• B_Ambiguous(b) contains all tags referenced by a memory operation with multiple tags or by a procedure call in b
Algorithm computes B_Explicit and B_Ambiguous for each block
COMP 512, Rice University
17
Rewriting Loops for Better Register Allocation
Register Promotion
• Solve the equations
• For each promotable tag, t, in a loop:
• Create a virtual register, v
• Rewrite each reference to t with a copy from v
L_Explicit(L) = b in loop L B_Explicit(b)
L_Ambiguous(L) = b in loop L B_Ambiguous(b)
L_Promotable(L) = L_Explicit(L) - L_Ambiguous(L)
L_Lift(L) = L_Promotable(L) if L is an outermost loopL_Promotable(L) - L_Promotable(S), S surrounds L
Loop-by-loop approach motivated by fact that we really care about loops more than other parts of the program
COMP 512, Rice University
18
Rewriting Loops for Better Register Allocation
Register Promotion - Example from the paper
With this simple scheme
• 0 to 16% of loads removed in test codes
• 0 to 50% of stores removed in test codes
• Other authors tried PRE-based extensions to Lu’s work
for (i= 0; i < DIM_X; i++) { B[i] = 0; for (j=0; j < DIM_Y; j++) { B[i] += A[i][j]; }}
for (i= 0; i < DIM_X; i++) { rb = 0; for (j=0; j < DIM_Y; j++) { rb += A[i][j] ; } B[i] = rb;}
B has 1 tag
COMP 512, Rice University
19
Balancing Memory Bound Loops
Balance is the ratio of memory accesses to flops
• Machine balance is M =
• Loop balance is L =
Loops run better if they are balanced or compute bound
• If L > M , the loop is memory bound
• If L = M , the loop is balanced
• If L < M , the loop is compute bound
Making memory bound loops into balanced or compute-bound loops
• Need more reuse to decrease number of accesses• Combine scalar replacement, unrolling, and fusion
accesses/cycleflops/cycle
accesses/iterationflops/iteration
COMP 512, Rice University
20
Rewriting Loops for Better Register Allocation
Example of Loop Balance
do i = 1 to n do j = 1 to n a(i) = a(i) + b(j) end end
Original loop nest
do i = 1 to n t = a(i) do j = 1 to n t = t + b(j) end a(i) = t endAfter scalarreplacement
L =3 accesses/iteration
1 flop/iteration L =1 access/iteration
1 flop/iteration
As memory accesses cost more, this gets better !
COMP 512, Rice University
21
Balancing Memory Loops
Unroll and Jam
• Unroll the outer loop
• Fuse the resulting inner loops
Effect
• Increases reuse in the inner loop*
• Decreases overhead, too
• Combine with scalar replacement for full benefits
do i = 1 to n do j = 1 to n a(i) = a(i) + b(j) end end
do i = 1 to n by 2 do j = 1 to n a(i) = a(i) + b(j) a(i+1) = a(i+1) + b(j) end end
becomes
(unroll & jam)
*
1 access
2 flops
Impact
• Original loop had L =
• Scalar replacement got L down to
• Unroll & jam + scalar replacement produced L =
3 accesses/iteration1 flop/iteration
1 access/iteration1 flop/iteration
1 access/iteration2 flop/iteration
COMP 512, Rice University
22
Balancing Memory Loops
Adding Scalar Replacement
• Rewrites all three cases of reuse
do i = 1 to n by 2 t1 = a(i) t2 = a(i+1)
do j = 1 to n t3 = b(j) t1 = t1 + t3
t2 = t2 + t3
end a(i) = t1
a(i+1) = t2
end
becomes
(scalar replacement)
do i = 1 to n by 2 do j = 1 to n a(i) = a(i) + b(j) a(i+1) = a(i+1) + b(j) end end
captures the reuse
COMP 512, Rice University
23
Balancing Memory Loops
The Big Picture
• Scalar replacement + unroll & jam helps
• Factors of 2 to 6 for matrix multiply (matrix300)
What does it take to do this in a real compiler?
• Data dependence analysis (see COMP 515)
• Method to discover consistent reuse patterns (see Carr)
• Way to choose the unroll amount Constrained by available registers Need heuristics to predict allocator’s behavior