recursion unrolling for divide and conquer programs

Recursion Unrolling for Divide and Conquer Programs

Radu Rugina and Martin RinardLaboratory for Computer Science

Massachusetts Institute of Technology

What This Talk Is About

•Automatic generation of efficient large base cases for divide and conquer programs

Outline1. Motivating Example2. Computation Structure3. Transformations4. Related Work5. Conclusion

1. Motivating Example

Divide and Conquer Matrix Multiply

• Divide matrices into sub-matrices: A0 , A1, A2 etc

• Use blocked matrix multiply equations

A0B0+A1

A0B1+A1

A2B0+A3

A2B1+A3

A B = R

• Recursively multiply sub-matrices

A0B0+A1

A0B1+A1

A2B0+A3

A2B1+A3

A B = R

• Terminate recursion with a simple base case

A B = R

a0 b0 a0 b0

void matmul(int *A, int *B, int *R, int n) { if (n == 1) {

(*R) += (*A) * (*B); } else {

matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);

Implements R += A B

Divide matrices in sub-matrices andrecursively multiplysub-matrices

(*R) += (*A) * (*B); } else {

Identify sub-matrices with pointers

(*R) += (*A) * (*B); } else {

Use a simple algorithm for the base case

(*R) += (*A) * (*B); } else {

• Advantage of small base case: simplicity

• Code is easy to:• Write• Maintain • Debug• Understand

(*R) += (*A) * (*B); } else {

• Disadvantage: inefficiency

• Large control flow overhead:

• Most of the time is spent in dividing the matrix in sub-matrices

(*R) += (*A) * (*B); } else {

Hand Coded Implementationvoid serialmul(block *As, block *Bs, block *Rs){ int i, j; DOUBLE *A = (DOUBLE *) As; DOUBLE *B = (DOUBLE *) Bs; DOUBLE *R = (DOUBLE *) Rs; for (j = 0; j < 16; j += 2) { DOUBLE *bp = &B[j]; for (i = 0; i < 16; i += 2) { DOUBLE *ap = &A[i * 16]; DOUBLE *rp = &R[j + i * 16]; register DOUBLE s0_0 = rp[0], s0_1 = rp[1]; register DOUBLE s1_0 = rp[16], s1_1 = rp[17]; s0_0 += ap[0] * bp[0]; s0_1 += ap[0] * bp[1]; s1_0 += ap[16] * bp[0]; s1_1 += ap[16] * bp[1]; s0_0 += ap[1] * bp[16]; s0_1 += ap[1] * bp[17]; s1_0 += ap[17] * bp[16]; s1_1 += ap[17] * bp[17]; s0_0 += ap[2] * bp[32]; s0_1 += ap[2] * bp[33]; s1_0 += ap[18] * bp[32]; s1_1 += ap[18] * bp[33]; s0_0 += ap[3] * bp[48]; s0_1 += ap[3] * bp[49]; s1_0 += ap[19] * bp[48]; s1_1 += ap[19] * bp[49]; s0_0 += ap[4] * bp[64]; s0_1 += ap[4] * bp[65]; s1_0 += ap[20] * bp[64]; s1_1 += ap[20] * bp[65];

s0_0 += ap[5] * bp[80]; s0_1 += ap[5] * bp[81]; s1_0 += ap[21] * bp[80]; s1_1 += ap[21] * bp[81]; s0_0 += ap[6] * bp[96]; s0_1 += ap[6] * bp[97]; s1_0 += ap[22] * bp[96]; s1_1 += ap[22] * bp[97]; s0_0 += ap[7] * bp[112]; s0_1 += ap[7] * bp[113]; s1_0 += ap[23] * bp[112]; s1_1 += ap[23] * bp[113]; s0_0 += ap[8] * bp[128]; s0_1 += ap[8] * bp[129]; s1_0 += ap[24] * bp[128]; s1_1 += ap[24] * bp[129]; s0_0 += ap[9] * bp[144]; s0_1 += ap[9] * bp[145]; s1_0 += ap[25] * bp[144]; s1_1 += ap[25] * bp[145]; s0_0 += ap[10] * bp[160]; s0_1 += ap[10] * bp[161]; s1_0 += ap[26] * bp[160]; s1_1 += ap[26] * bp[161]; s0_0 += ap[11] * bp[176]; s0_1 += ap[11] * bp[177]; s1_0 += ap[27] * bp[176]; s1_1 += ap[27] * bp[177]; s0_0 += ap[12] * bp[192]; s0_1 += ap[12] * bp[193]; s1_0 += ap[28] * bp[192]; s1_1 += ap[28] * bp[193]; s0_0 += ap[13] * bp[208]; s0_1 += ap[13] * bp[209]; s1_0 += ap[29] * bp[208];

s1_1 += ap[29] * bp[209]; s0_0 += ap[14] * bp[224]; s0_1 += ap[14] * bp[225]; s1_0 += ap[30] * bp[224]; s1_1 += ap[30] * bp[225]; s0_0 += ap[15] * bp[240]; s0_1 += ap[15] * bp[241]; s1_0 += ap[31] * bp[240]; s1_1 += ap[31] * bp[241]; rp[0] = s0_0; rp[1] = s0_1; rp[16] = s1_0; rp[17] = s1_1; } }}

cilk void matrixmul(long nb, block *A, block *B, block *R){ if (nb == 1) { flops = serialmul(A, B, R); } else if (nb >= 4) {

spawn matrixmul(nb/4, A, B, R); spawn matrixmul(nb/4, A, B+(nb/4), R+(nb/4)); spawn matrixmul(nb/4, A+2*(nb/4), B+(nb/4), R+2*(nb/4)); spawn matrixmul(nb/4, A+2*(nb/4), B, R+3*(nb/4)); sync; spawn matrixmul(nb/4, A+(nb/4), B+2*(nb/4), R); spawn matrixmul(nb/4, A+(nb/4), B+3*(nb/4), R+(nb/4)); spawn matrixmul(nb/4, A+3*(nb/4), B+3*(nb/4), R+2*(nb/4)); spawn matrixmul(nb/4, A+3*(nb/4), B+3*(nb/4), R+3*(nb/4)); sync;

• The programmer writes simple code with small base cases

• The compiler automatically generates efficient code with large base cases

2. Computation Structure

Running Example – Array Increment

void f(char *p, int n) if (n == 1) {

/* base case: increment one element */(*p) += 1;

} else {f(p, n/2); /* increment first half */f(p+n/2, n/2); /* increment second

half */}

Dynamic Call Tree for n=4Execution of f(p,4)

Dynamic Call Tree for n=4

Test n=1Call f Call

Execution of f(p,4)

Test n=1Call f Call

Execution of f(p,4)Activation Frame

on the Stack

Test n=1Call f Call

Execution of f(p,4)Executed

Instructions

Test n=1Call f Call

Execution of f(p,4)

Test n=1Call f Call

Execution of f(p,4)

Test n=1Call f Call

Test n=1

Inc *p

Test n=1

Inc *p

Test n=1Call f Call

Test n=1

Inc *p

Test n=1

Inc *p

Execution of f(p,4)

Control Flow Overhead

Test n=1Call f Call

Test n=1

Inc *p

Test n=1

Inc *p

Test n=1Call f Call

Test n=1

Inc *p

Test n=1

Inc *p

Execution of f(p,4) Call

overhead

Control Flow Overhead

Test n=1Call f Call

Test n=1

Inc *p

Test n=1

Inc *p

Test n=1Call f Call

Test n=1

Inc *p

Test n=1

Inc *p

Execution of f(p,4) Call overhead

+ Test overhead

Computation

Test n=1Call f Call

Test n=1

Inc *p

Test n=1

Inc *p

Test n=1Call f Call

Test n=1

Inc *p

Test n=1

Inc *p

Execution of f(p,4) Call overhead

+ Test overhead

Computation

Large Base Cases = Reduced Overhead

Test n=2Call f Call

Execution of f(p,4)

Test n=2Inc *p

Inc *(p+1)

Test n=2Inc *p

Inc *(p+1)

3. Transformations

Transformation 1: Recursion Inlining

void f (char *p, int n) if (n == 1) {

(*p) += 1; } else {

f(p, n/2);

f(p+n/2, n/2); }

Start with the original recursive procedure

void f1(char *p, int n) if (n == 1) {

(*p) += 1; } else {

f1(p, n/2);

f1(p+n/2, n/2); }

(*p) += 1; } else {

f2(p, n/2);

f2(p+n/2, n/2); }

Make two copies of the original procedure

(*p) += 1; } else {

f2(p, n/2);

f2(p+n/2, n/2); }

(*p) += 1; } else {

f1(p, n/2);

f1(p+n/2, n/2); }

Transform direct recursion to mutual recursion

(*p) += 1; } else {

f2(p, n/2);

f2(p+n/2, n/2); }

(*p) += 1; } else {

f1(p, n/2);

f1(p+n/2, n/2); }

Inline procedure f2 at call sites in f1

void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { if (n/2 == 1) { *p += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); } if (n/2 == 1) { *(p+n/2) += 1; } else { f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } }

• Reduced procedure call overhead

• More code exposed at the intra-procedural level

• Opportunities to simplify control flow in the inlined code

• Reduced procedure call overhead

• More code exposed at the intra-procedural level

• Opportunities to simplify control flow in the inlined code:

• identical condition expressions

Transformation 2: Conditional Fusion

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); }

Merge if statements with identical conditions

Transformation 2: Conditional Fusion

Merge if statements with identical conditions

• Reduced branching overhead and bigger basic blocks

• Larger base case for n/2 = 1

Unrolling Iterations

Repeatedly apply inlining and conditional fusion

Second Unrolling Iteration

void f2(char *p, int n) if (n == 1) { *p += 1; } else { f2(p, n/2); f2(p+n/2, n/2); }

Second Unrolling Iteration

void f2(char *p, int n) if (n == 1) { *p += 1; } else { f1(p, n/2); f1(p+n/2, n/2); }

Result of Second Unrolling Iteration

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; }

else { f1(p, n/2/2/2); f1(p+n/2/2/2, n/2/2/2); f1(p+n/2/2, n/2/2/2); f1(p+n/2/2+n/2/2/2, n/2/2/2); f1(p+n/2, n/2/2/2); f1(p+n/2+n/2/2/2, n/2/2/2); f1(p+n/2+n/2/2, n/2/2/2); f1(p+n/2+n/2/2+n/2/2/2,

n/2/2/2);}

Unrolling Iterations• The unrolling process stops when the number

of iterations reaches the desired unrolling factor

• The unrolled recursive procedure:• Has base cases for larger problem sizes• Divides the given problem into more sub-

problems of smaller sizes

• In our example:• Base cases for n=1, n=2, and n=4• Problems are divided into 8 problems of 1/8

Speedup for Matrix MultiplyMatrix of 512 x 512 elements

1 2unrolling factor

inline inline+fusion

1 2unrolling factor

Efficiency of Unrolled Recursive Part

• Because the recursive part is also unrolled, recursion may not exercise the large base

• Which base case is executed depends on the size of the input problem

• In our example:• For a problem of size n=8, the base case for n=1 is

executed• For a problem of size n=16, the base case for n=2 is

executed• The efficient base case for n=4 is not executed in

these cases

Solution: Recursion Re-Rolling

• Roll back the recursive part of the unrolled procedure after the large base cases are generated

• Re-Rolling ensures that larger base cases are always executed, independent of the input problem size

• The compiler unrolls the recursive part only temporarily, to generate the base cases

Transformation 3: Recursion Re-Rolling

n/2/2/2);}

Identify the recursive part

n/2/2/2);}

Replace with the recursive part of the original procedure

else { f1(p, n/2); f1(p+n/2, n/2);}

Final Result

else { f1(p, n/2); f1(p+n/2, n/2);}

1 2 3unrolling factor

inline inline+fusion inline+fusion+reroll

Other Optimizations

• Inlining moves code from the inter-procedural level to the intra-procedural level

• Conditional fusion brings code from the inter-basic-block level to the intra-basic-block level

• Together, inlining and conditional fusion give subsequent compiler passes the opportunity to perform more aggressive optimizations

Comparison to Hand Coded Programs

• Two applications: Matrix multiply, LU decomposition

• Three machines: Pentium III, Origin 2000, PowerPC

• Two different problem sizes

• Compare automatically unrolled programs to optimized, hand coded versions from the Cilk benchmarks

• Best automatically unrolled version performs:• Between 2.2 and 2.9 times worse for matrix

multiply• As good as hand coded version for LU

• Procedure Inlining:• Scheifler (1977)• Richardson, Ghanapathi (1989)• Chambers, Ungar (1989)• Cooper, Hall, Torczon (1991)• Appel (1992)• Chang, Mahlke, Chen, Hwu (1992)

Related Work

Conclusion• Recursion Unrolling

• analogous to the loop unrolling transformation

• Divide and Conquer Programs• The programmer writes simple base cases• The compiler automatically generates large base

• Key Techniques• Inlining: conceptually inline recursive calls• Conditional Fusion: simplify intra-procedural

control flow• Re-Rolling: ensure that large base cases are

executed

Comparison to Hand Coded Programs

• Matrix multiply 512 x 512 elements:• Best automatically unrolled program: 2.55

sec.• Hand coded with three nested loops: 3.46

sec.• Hand coded Cilk program: 1.16

• Matrix multiply for 1024 x 1024 elements:• Best automatically unrolled program:

20.47 sec.• Hand coded with three nested loops:

27.40 sec.• Hand coded Cilk program: 9.19

Correctness

• Recursion unrolling preserves the semantics of the program:

• The unrolled program terminates if and only if the original recursive program terminates

• When both the original and the unrolled program terminate, the yield the same result

Speedup for Matrix MultiplyPentium III, Matrix of 512 x 512

elements

Speedup for Matrix MultiplyPentium III, Matrix of 1024 x 1024

elements

Speedup for Matrix MultiplyPower PC, Matrix of 512 x 512

elements

Speedup for Matrix MultiplyPower PC, Matrix of 1024 x 1024

elements

Speedup for Matrix MultiplyOrigin 2000, Matrix of 512 x 512

elements

Speedup for Matrix MultiplyOrigin 2000, Matrix of 1024 x 1024

elements

Speedup for LUPentium III, Matrix of 512 x 512

elements

Speedup for LUPentium III, Matrix of 1024 x 1024

elements

Speedup for LUPower PC, Matrix of 512 x 512

elements

Speedup for LUPower PC, Matrix of 1024 x 1024

elements

Speedup for LUOrigin 2000, Matrix of 1024 x 1024

elements

Speedup for LUOrigin 2000, Matrix of 512 x 512

elements

recursion unrolling for divide and conquer programs

Documents

1 chapter 11 l basics of recursion l programming with...

induction & recursion weiss: ch 7.1. recursion –a...

induction & recursion - cornell university · • recursion...

recursion recursion is the process of ... -...

statically unrolling recursion to improve opportunities...

algorithms 2005 ramesh hariharan. divide and...

recursion, divide & conquer, text processing · cse 3401 f...

divide and conquer - ccs.neu.edu · divide and conquer...

the unrolling of the scroll

biostatistics 615/815 lecture 4: classes and libraries...

conquer pediatric chiari - conquer chiari

insight through computing 26. divide and conquer algorithms...

recursion in this chapter we study: recursion divide and...

7 8 - recursion · 2008-10-15 · recursion) to end...

unrolling the truth about paper towel

chapter 13 recursion. topics simple recursion recursion with...

loop unrolling & predication

divide & conquer themes reasoning about code (correctness...

recursion recursion recursion recursion recursion recursion...

recursion, divide and conquer lam chi kit (george) hkoi2007