Transcript
Page 1: Recursion Unrolling  for Divide and Conquer Programs

Recursion Unrolling for Divide and Conquer Programs

Radu Rugina and Martin RinardLaboratory for Computer Science

Massachusetts Institute of Technology

Page 2: Recursion Unrolling  for Divide and Conquer Programs

What This Talk Is About

•Automatic generation of efficient large base cases for divide and conquer programs

Page 3: Recursion Unrolling  for Divide and Conquer Programs

Outline1. Motivating Example2. Computation Structure3. Transformations4. Related Work5. Conclusion

Page 4: Recursion Unrolling  for Divide and Conquer Programs

1. Motivating Example

Page 5: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

• Divide matrices into sub-matrices: A0 , A1, A2 etc

• Use blocked matrix multiply equations

A0 A1

A2 A3

B0 B1

B2 B3

A0B0+A1

B2

A0B1+A1

B3

A2B0+A3

B2

A2B1+A3

B3

=

A B = R

Page 6: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

• Recursively multiply sub-matrices

A0 A1

A2 A3

B0 B1

B2 B3

A0B0+A1

B2

A0B1+A1

B3

A2B0+A3

B2

A2B1+A3

B3

=

A B = R

Page 7: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

• Terminate recursion with a simple base case

=

A B = R

a0 b0 a0 b0

Page 8: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

void matmul(int *A, int *B, int *R, int n) { if (n == 1) {

(*R) += (*A) * (*B); } else {

matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);

}

Implements R += A B

Page 9: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

Divide matrices in sub-matrices andrecursively multiplysub-matrices

void matmul(int *A, int *B, int *R, int n) { if (n == 1) {

(*R) += (*A) * (*B); } else {

matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);

}

Page 10: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

Identify sub-matrices with pointers

void matmul(int *A, int *B, int *R, int n) { if (n == 1) {

(*R) += (*A) * (*B); } else {

matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);

}

Page 11: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

Use a simple algorithm for the base case

void matmul(int *A, int *B, int *R, int n) { if (n == 1) {

(*R) += (*A) * (*B); } else {

matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);

}

Page 12: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

• Advantage of small base case: simplicity

• Code is easy to:• Write• Maintain • Debug• Understand

void matmul(int *A, int *B, int *R, int n) { if (n == 1) {

(*R) += (*A) * (*B); } else {

matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);

}

Page 13: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

• Disadvantage: inefficiency

• Large control flow overhead:

• Most of the time is spent in dividing the matrix in sub-matrices

void matmul(int *A, int *B, int *R, int n) { if (n == 1) {

(*R) += (*A) * (*B); } else {

matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);

}

Page 14: Recursion Unrolling  for Divide and Conquer Programs

Hand Coded Implementationvoid serialmul(block *As, block *Bs, block *Rs){ int i, j; DOUBLE *A = (DOUBLE *) As; DOUBLE *B = (DOUBLE *) Bs; DOUBLE *R = (DOUBLE *) Rs; for (j = 0; j < 16; j += 2) { DOUBLE *bp = &B[j]; for (i = 0; i < 16; i += 2) { DOUBLE *ap = &A[i * 16]; DOUBLE *rp = &R[j + i * 16]; register DOUBLE s0_0 = rp[0], s0_1 = rp[1]; register DOUBLE s1_0 = rp[16], s1_1 = rp[17]; s0_0 += ap[0] * bp[0]; s0_1 += ap[0] * bp[1]; s1_0 += ap[16] * bp[0]; s1_1 += ap[16] * bp[1]; s0_0 += ap[1] * bp[16]; s0_1 += ap[1] * bp[17]; s1_0 += ap[17] * bp[16]; s1_1 += ap[17] * bp[17]; s0_0 += ap[2] * bp[32]; s0_1 += ap[2] * bp[33]; s1_0 += ap[18] * bp[32]; s1_1 += ap[18] * bp[33]; s0_0 += ap[3] * bp[48]; s0_1 += ap[3] * bp[49]; s1_0 += ap[19] * bp[48]; s1_1 += ap[19] * bp[49]; s0_0 += ap[4] * bp[64]; s0_1 += ap[4] * bp[65]; s1_0 += ap[20] * bp[64]; s1_1 += ap[20] * bp[65];

s0_0 += ap[5] * bp[80]; s0_1 += ap[5] * bp[81]; s1_0 += ap[21] * bp[80]; s1_1 += ap[21] * bp[81]; s0_0 += ap[6] * bp[96]; s0_1 += ap[6] * bp[97]; s1_0 += ap[22] * bp[96]; s1_1 += ap[22] * bp[97]; s0_0 += ap[7] * bp[112]; s0_1 += ap[7] * bp[113]; s1_0 += ap[23] * bp[112]; s1_1 += ap[23] * bp[113]; s0_0 += ap[8] * bp[128]; s0_1 += ap[8] * bp[129]; s1_0 += ap[24] * bp[128]; s1_1 += ap[24] * bp[129]; s0_0 += ap[9] * bp[144]; s0_1 += ap[9] * bp[145]; s1_0 += ap[25] * bp[144]; s1_1 += ap[25] * bp[145]; s0_0 += ap[10] * bp[160]; s0_1 += ap[10] * bp[161]; s1_0 += ap[26] * bp[160]; s1_1 += ap[26] * bp[161]; s0_0 += ap[11] * bp[176]; s0_1 += ap[11] * bp[177]; s1_0 += ap[27] * bp[176]; s1_1 += ap[27] * bp[177]; s0_0 += ap[12] * bp[192]; s0_1 += ap[12] * bp[193]; s1_0 += ap[28] * bp[192]; s1_1 += ap[28] * bp[193]; s0_0 += ap[13] * bp[208]; s0_1 += ap[13] * bp[209]; s1_0 += ap[29] * bp[208];

s1_1 += ap[29] * bp[209]; s0_0 += ap[14] * bp[224]; s0_1 += ap[14] * bp[225]; s1_0 += ap[30] * bp[224]; s1_1 += ap[30] * bp[225]; s0_0 += ap[15] * bp[240]; s0_1 += ap[15] * bp[241]; s1_0 += ap[31] * bp[240]; s1_1 += ap[31] * bp[241]; rp[0] = s0_0; rp[1] = s0_1; rp[16] = s1_0; rp[17] = s1_1; } }}

cilk void matrixmul(long nb, block *A, block *B, block *R){ if (nb == 1) { flops = serialmul(A, B, R); } else if (nb >= 4) {

spawn matrixmul(nb/4, A, B, R); spawn matrixmul(nb/4, A, B+(nb/4), R+(nb/4)); spawn matrixmul(nb/4, A+2*(nb/4), B+(nb/4), R+2*(nb/4)); spawn matrixmul(nb/4, A+2*(nb/4), B, R+3*(nb/4)); sync; spawn matrixmul(nb/4, A+(nb/4), B+2*(nb/4), R); spawn matrixmul(nb/4, A+(nb/4), B+3*(nb/4), R+(nb/4)); spawn matrixmul(nb/4, A+3*(nb/4), B+3*(nb/4), R+2*(nb/4)); spawn matrixmul(nb/4, A+3*(nb/4), B+3*(nb/4), R+3*(nb/4)); sync;

}}

Page 15: Recursion Unrolling  for Divide and Conquer Programs

Goal

• The programmer writes simple code with small base cases

• The compiler automatically generates efficient code with large base cases

Page 16: Recursion Unrolling  for Divide and Conquer Programs

2. Computation Structure

Page 17: Recursion Unrolling  for Divide and Conquer Programs

Running Example – Array Increment

void f(char *p, int n) if (n == 1) {

/* base case: increment one element */(*p) += 1;

} else {f(p, n/2); /* increment first half */f(p+n/2, n/2); /* increment second

half */}

}

Page 18: Recursion Unrolling  for Divide and Conquer Programs

Dynamic Call Tree for n=4Execution of f(p,4)

Page 19: Recursion Unrolling  for Divide and Conquer Programs

Dynamic Call Tree for n=4

Test n=1Call f Call

f

Execution of f(p,4)

Page 20: Recursion Unrolling  for Divide and Conquer Programs

Dynamic Call Tree for n=4

Test n=1Call f Call

f

Execution of f(p,4)Activation Frame

on the Stack

Page 21: Recursion Unrolling  for Divide and Conquer Programs

Dynamic Call Tree for n=4

Test n=1Call f Call

f

Execution of f(p,4)Executed

Instructions

Page 22: Recursion Unrolling  for Divide and Conquer Programs

Dynamic Call Tree for n=4

Test n=1Call f Call

f

Execution of f(p,4)

Page 23: Recursion Unrolling  for Divide and Conquer Programs

Dynamic Call Tree for n=4

Test n=1Call f Call

f

Test n=1Call f Call

f

Test n=1Call f Call

f

n=4

n=2

Execution of f(p,4)

Page 24: Recursion Unrolling  for Divide and Conquer Programs

Dynamic Call Tree for n=4

Test n=1Call f Call

f

Test n=1Call f Call

f

Test n=1

Inc *p

Test n=1

Inc *p

Test n=1Call f Call

f

Test n=1

Inc *p

Test n=1

Inc *p

n=4

n=2

n=1

Execution of f(p,4)

Page 25: Recursion Unrolling  for Divide and Conquer Programs

Control Flow Overhead

Test n=1Call f Call

f

Test n=1Call f Call

f

Test n=1

Inc *p

Test n=1

Inc *p

Test n=1Call f Call

f

Test n=1

Inc *p

Test n=1

Inc *p

n=4

n=2

n=1

Execution of f(p,4) Call

overhead

Page 26: Recursion Unrolling  for Divide and Conquer Programs

Control Flow Overhead

Test n=1Call f Call

f

Test n=1Call f Call

f

Test n=1

Inc *p

Test n=1

Inc *p

Test n=1Call f Call

f

Test n=1

Inc *p

Test n=1

Inc *p

n=4

n=2

n=1

Execution of f(p,4) Call overhead

+ Test overhead

Page 27: Recursion Unrolling  for Divide and Conquer Programs

Computation

Test n=1Call f Call

f

Test n=1Call f Call

f

Test n=1

Inc *p

Test n=1

Inc *p

Test n=1Call f Call

f

Test n=1

Inc *p

Test n=1

Inc *p

n=4

n=2

n=1

Execution of f(p,4) Call overhead

+ Test overhead

Computation

Page 28: Recursion Unrolling  for Divide and Conquer Programs

Large Base Cases = Reduced Overhead

Test n=2Call f Call

fn=4

n=2

Execution of f(p,4)

Test n=2Inc *p

Inc *(p+1)

Test n=2Inc *p

Inc *(p+1)

Page 29: Recursion Unrolling  for Divide and Conquer Programs

3. Transformations

Page 30: Recursion Unrolling  for Divide and Conquer Programs

Transformation 1: Recursion Inlining

void f (char *p, int n) if (n == 1) {

(*p) += 1; } else {

f(p, n/2);

f(p+n/2, n/2); }

Start with the original recursive procedure

Page 31: Recursion Unrolling  for Divide and Conquer Programs

Transformation 1: Recursion Inlining

void f1(char *p, int n) if (n == 1) {

(*p) += 1; } else {

f1(p, n/2);

f1(p+n/2, n/2); }

void f2(char *p, int n) if (n == 1) {

(*p) += 1; } else {

f2(p, n/2);

f2(p+n/2, n/2); }

Make two copies of the original procedure

Page 32: Recursion Unrolling  for Divide and Conquer Programs

Transformation 1: Recursion Inlining

void f1(char *p, int n) if (n == 1) {

(*p) += 1; } else {

f2(p, n/2);

f2(p+n/2, n/2); }

void f2(char *p, int n) if (n == 1) {

(*p) += 1; } else {

f1(p, n/2);

f1(p+n/2, n/2); }

Transform direct recursion to mutual recursion

Page 33: Recursion Unrolling  for Divide and Conquer Programs

Transformation 1: Recursion Inlining

void f1(char *p, int n) if (n == 1) {

(*p) += 1; } else {

f2(p, n/2);

f2(p+n/2, n/2); }

void f2(char *p, int n) if (n == 1) {

(*p) += 1; } else {

f1(p, n/2);

f1(p+n/2, n/2); }

Inline procedure f2 at call sites in f1

Page 34: Recursion Unrolling  for Divide and Conquer Programs

Transformation 1: Recursion Inlining

void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { if (n/2 == 1) { *p += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); } if (n/2 == 1) { *(p+n/2) += 1; } else { f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } }

Page 35: Recursion Unrolling  for Divide and Conquer Programs

Transformation 1: Recursion Inlining

void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { if (n/2 == 1) { *p += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); } if (n/2 == 1) { *(p+n/2) += 1; } else { f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } }

• Reduced procedure call overhead

• More code exposed at the intra-procedural level

• Opportunities to simplify control flow in the inlined code

Page 36: Recursion Unrolling  for Divide and Conquer Programs

Transformation 1: Recursion Inlining

void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { if (n/2 == 1) { *p += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); } if (n/2 == 1) { *(p+n/2) += 1; } else { f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } }

• Reduced procedure call overhead

• More code exposed at the intra-procedural level

• Opportunities to simplify control flow in the inlined code:

• identical condition expressions

Page 37: Recursion Unrolling  for Divide and Conquer Programs

Transformation 2: Conditional Fusion

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); }

Merge if statements with identical conditions

Page 38: Recursion Unrolling  for Divide and Conquer Programs

Transformation 2: Conditional Fusion

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); }

Merge if statements with identical conditions

• Reduced branching overhead and bigger basic blocks

• Larger base case for n/2 = 1

Page 39: Recursion Unrolling  for Divide and Conquer Programs

Unrolling Iterations

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); }

Repeatedly apply inlining and conditional fusion

Page 40: Recursion Unrolling  for Divide and Conquer Programs

Second Unrolling Iteration

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); }

void f2(char *p, int n) if (n == 1) { *p += 1; } else { f2(p, n/2); f2(p+n/2, n/2); }

Page 41: Recursion Unrolling  for Divide and Conquer Programs

Second Unrolling Iteration

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f2(p, n/2/2); f2(p+n/2/2, n/2/2); f2(p+n/2, n/2/2); f2(p+n/2+n/4, n/2/2); }

void f2(char *p, int n) if (n == 1) { *p += 1; } else { f1(p, n/2); f1(p+n/2, n/2); }

Page 42: Recursion Unrolling  for Divide and Conquer Programs

Result of Second Unrolling Iteration

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; }

else { f1(p, n/2/2/2); f1(p+n/2/2/2, n/2/2/2); f1(p+n/2/2, n/2/2/2); f1(p+n/2/2+n/2/2/2, n/2/2/2); f1(p+n/2, n/2/2/2); f1(p+n/2+n/2/2/2, n/2/2/2); f1(p+n/2+n/2/2, n/2/2/2); f1(p+n/2+n/2/2+n/2/2/2,

n/2/2/2);}

Page 43: Recursion Unrolling  for Divide and Conquer Programs

Unrolling Iterations• The unrolling process stops when the number

of iterations reaches the desired unrolling factor

• The unrolled recursive procedure:• Has base cases for larger problem sizes• Divides the given problem into more sub-

problems of smaller sizes

• In our example:• Base cases for n=1, n=2, and n=4• Problems are divided into 8 problems of 1/8

size

Page 44: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyMatrix of 512 x 512 elements

0

2

4

6

8

10

1 2unrolling factor

spee

dup

inline inline+fusion

Page 45: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyMatrix of 512 x 512 elements

0

2

4

6

8

10

1 2unrolling factor

spee

dup

inline inline+fusion

Page 46: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyMatrix of 1024 x 1024 elements

0

2

4

6

8

10

1 2unrolling factor

spee

dup

inline inline+fusion

Page 47: Recursion Unrolling  for Divide and Conquer Programs

Efficiency of Unrolled Recursive Part

• Because the recursive part is also unrolled, recursion may not exercise the large base

cases

• Which base case is executed depends on the size of the input problem

• In our example:• For a problem of size n=8, the base case for n=1 is

executed• For a problem of size n=16, the base case for n=2 is

executed• The efficient base case for n=4 is not executed in

these cases

Page 48: Recursion Unrolling  for Divide and Conquer Programs

Solution: Recursion Re-Rolling

• Roll back the recursive part of the unrolled procedure after the large base cases are generated

• Re-Rolling ensures that larger base cases are always executed, independent of the input problem size

• The compiler unrolls the recursive part only temporarily, to generate the base cases

Page 49: Recursion Unrolling  for Divide and Conquer Programs

Transformation 3: Recursion Re-Rolling

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; }

else { f1(p, n/2/2/2); f1(p+n/2/2/2, n/2/2/2); f1(p+n/2/2, n/2/2/2); f1(p+n/2/2+n/2/2/2, n/2/2/2); f1(p+n/2, n/2/2/2); f1(p+n/2+n/2/2/2, n/2/2/2); f1(p+n/2+n/2/2, n/2/2/2); f1(p+n/2+n/2/2+n/2/2/2,

n/2/2/2);}

Page 50: Recursion Unrolling  for Divide and Conquer Programs

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; }

Identify the recursive part

else { f1(p, n/2/2/2); f1(p+n/2/2/2, n/2/2/2); f1(p+n/2/2, n/2/2/2); f1(p+n/2/2+n/2/2/2, n/2/2/2); f1(p+n/2, n/2/2/2); f1(p+n/2+n/2/2/2, n/2/2/2); f1(p+n/2+n/2/2, n/2/2/2); f1(p+n/2+n/2/2+n/2/2/2,

n/2/2/2);}

Transformation 3: Recursion Re-Rolling

Page 51: Recursion Unrolling  for Divide and Conquer Programs

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; }

Replace with the recursive part of the original procedure

else { f1(p, n/2); f1(p+n/2, n/2);}

Transformation 3: Recursion Re-Rolling

Page 52: Recursion Unrolling  for Divide and Conquer Programs

Final Result

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; }

else { f1(p, n/2); f1(p+n/2, n/2);}

Page 53: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyMatrix of 512 x 512 elements

0

2

4

6

8

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 54: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyMatrix of 1024 x 1024 elements

0

2

4

6

8

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 55: Recursion Unrolling  for Divide and Conquer Programs

Other Optimizations

• Inlining moves code from the inter-procedural level to the intra-procedural level

• Conditional fusion brings code from the inter-basic-block level to the intra-basic-block level

• Together, inlining and conditional fusion give subsequent compiler passes the opportunity to perform more aggressive optimizations

Page 56: Recursion Unrolling  for Divide and Conquer Programs

Comparison to Hand Coded Programs

• Two applications: Matrix multiply, LU decomposition

• Three machines: Pentium III, Origin 2000, PowerPC

• Two different problem sizes

• Compare automatically unrolled programs to optimized, hand coded versions from the Cilk benchmarks

• Best automatically unrolled version performs:• Between 2.2 and 2.9 times worse for matrix

multiply• As good as hand coded version for LU

Page 57: Recursion Unrolling  for Divide and Conquer Programs

• Procedure Inlining:• Scheifler (1977)• Richardson, Ghanapathi (1989)• Chambers, Ungar (1989)• Cooper, Hall, Torczon (1991)• Appel (1992)• Chang, Mahlke, Chen, Hwu (1992)

Related Work

Page 58: Recursion Unrolling  for Divide and Conquer Programs

Conclusion• Recursion Unrolling

• analogous to the loop unrolling transformation

• Divide and Conquer Programs• The programmer writes simple base cases• The compiler automatically generates large base

cases

• Key Techniques• Inlining: conceptually inline recursive calls• Conditional Fusion: simplify intra-procedural

control flow• Re-Rolling: ensure that large base cases are

executed

Page 59: Recursion Unrolling  for Divide and Conquer Programs
Page 60: Recursion Unrolling  for Divide and Conquer Programs
Page 61: Recursion Unrolling  for Divide and Conquer Programs
Page 62: Recursion Unrolling  for Divide and Conquer Programs

Comparison to Hand Coded Programs

• Matrix multiply 512 x 512 elements:• Best automatically unrolled program: 2.55

sec.• Hand coded with three nested loops: 3.46

sec.• Hand coded Cilk program: 1.16

sec.

• Matrix multiply for 1024 x 1024 elements:• Best automatically unrolled program:

20.47 sec.• Hand coded with three nested loops:

27.40 sec.• Hand coded Cilk program: 9.19

sec.

Page 63: Recursion Unrolling  for Divide and Conquer Programs

Correctness

• Recursion unrolling preserves the semantics of the program:

• The unrolled program terminates if and only if the original recursive program terminates

• When both the original and the unrolled program terminate, the yield the same result

Page 64: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyPentium III, Matrix of 512 x 512

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 65: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyPentium III, Matrix of 1024 x 1024

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 66: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyPower PC, Matrix of 512 x 512

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 67: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyPower PC, Matrix of 1024 x 1024

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 68: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyOrigin 2000, Matrix of 512 x 512

elements

0

2

4

6

8

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 69: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyOrigin 2000, Matrix of 1024 x 1024

elements

0

2

4

6

8

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 70: Recursion Unrolling  for Divide and Conquer Programs

Speedup for LUPentium III, Matrix of 512 x 512

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 71: Recursion Unrolling  for Divide and Conquer Programs

Speedup for LUPentium III, Matrix of 1024 x 1024

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 72: Recursion Unrolling  for Divide and Conquer Programs

Speedup for LUPower PC, Matrix of 512 x 512

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 73: Recursion Unrolling  for Divide and Conquer Programs

Speedup for LUPower PC, Matrix of 1024 x 1024

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 74: Recursion Unrolling  for Divide and Conquer Programs

Speedup for LUOrigin 2000, Matrix of 1024 x 1024

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 75: Recursion Unrolling  for Divide and Conquer Programs

Speedup for LUOrigin 2000, Matrix of 512 x 512

elements

0

2

4

6

8

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll


Top Related