enhancing fine- grained parallelism chapter 5 of allen and kennedy mirit & haim

Enhancing Fine-Grained Parallelism

Chapter 5 of Allen and Kennedy

Mirit & Haim

Overview

Motivation Loop Interchange Scalar Expansion Scalar and Array Renaming

Motivational Example Matrix Multiplication:

DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L

T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDO ENDDO

Codegen: tries to find parallelism using transformations of loop distribution and statement reordering

codegen will not uncover any vector operations because of the use of the temporary scalar T which creates several dependences carried by each of the loops.

Motivational Example I

Replacing the scalar T with the vector T$ will eliminate most of the dependences:

DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO

Distribution of the I-loop gives us: DO J = 1, M DO I = 1, N T$(I) = 0.0 ENDDO DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO DO I = 1, N C(I,J) = T$(I) ENDDO ENDDO

Motivational Example II

T$(1:N) = 0.0

C(1:N,J) = T$(1:N)

Motivational Example III

Interchanging I and K loops, we get: DO J = 1, M

T$(1:N) = 0.0

DO I = 1,N DO K = 1,L

DO K = 1,L DO I = 1,N

T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO

ENDDO

C(1:N,J) = T$(1:N)

ENDDO

Motivational Example IV

Finally, Vectorizing the I-Loop: DO J = 1, M T$(1:N) = 0.0

DO K = 1,L DO I = 1,N

T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J)

ENDDO C(1:N,J) = T$(1:N) ENDDO

Motivational Example - Summary

We used two new transformations:

Loop interchange (replacing I,K loops)

Scalar Expansion (T => T$)

Loop Interchange

Loop Interchange - Switching the nesting order of two loops

DO I = 1, N DO J = 1, MS A(I,J+1) = A(I,J) + B ENDDO ENDDO

Applying loop interchange: DO J = 1, M DO I = 1, NS A(I,J+1) = A(I,J) + B ENDDO ENDDO

leads to: DO J = 1, MS A(1:N,J+1) = A(1:N,J) + B ENDDO

Impossible to vectorize since there is a cyclic dependence from S to itself, carried by the J-Loop.

Loop Interchange: Safety

Safety: not all loop interchanges are safe. S(I,J) – the instance of statement S with the

parameters (iteration numbers) I and J

DO J = 1, M DO I = 1, N S: A(I,J+1) = A(I+1,J) ENDDO ENDDO

S(1,1)

S(1,3)

S(1,2)

S(1,4)

S(2,1)

S(2,3)

S(2,2)

S(2,4)

S(3,1)

S(3,3)

S(3,2)

S(3,4)

S(4,1)

S(4,3)

S(4,2)

S4,4)

J = 1

J = 2

J = 3

J = 4

I = 1 I = 2 I = 3 I = 4

A(2,2) is assigned at S(2,1) and used at S(1,2).


A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence.

Distance Vector - Reminder

Consider a dependence in a loop nest of n loops Statement S1 on iteration i is the source

of the dependence Statement S2 on iteration j is the sink of

the dependence The distance vector is a vector of

length n d(i,j) such that: d(i,j)k = jk – ik

Direction Vector - Reminder

Suppose that there is a dependence from statement S1 on iteration i of a loop nest of n loops and statement S2 on iteration j, then the dependence direction vector is D(i,j) is defined as a vector of length n such that

”>“ if d(i,j)k > 0 ”=“ if d(i,j)k = 0

”>“ if d(i,j)k < 0

D(i,j)k=


DO I = 1, N

DO J = 1, M

DO K = 1, L

A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1)

ENDDO

ENDDO

ENDDO

(I+1,J+1,K) - (I,J,K) = (1,1,0) => (<,<,=)

(I+1,J+1,K) - (I,J+1,K+1) = (1,0,-1) => (<,=,>)


We say that the I’th loop carries a dependence with direction-vector d, if the leftmost non-”=“ in d is in the I’th place.

(<,<,=) (<,=,>) It is valid to convert a sequential loop

to a parallel loop if the loop carries no dependence.

Loop Interchange: Safety Theorem 5.1 Let D(i,j) be a direction vector for a

dependence in a perfect nest of n loops. Then the direction vector for the same dependence after a permutation of the loops is determined by applying the same permutation to the elements of D(i,j).

DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO ENDDO ENDDO (I+1,J+1,K) - (I,J,K) = (1,1,0) => (<,<,=) =>

(<,=,<)

Loop Interchange: Safety The direction matrix for a nest of loops is a matrix in which

each row is a direction vector for some dependence between statements contained in the nest and every such direction vector is represented by a row.

DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO ENDDO ENDDO (1,1,0) => (<,<,=) (1,0,-1) => (<,=,>)

< < =

< = >


Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non- "=" direction in any row.

i j k< < =

< = >

j i k< < =

= < >

j k i< = <

= > <

Loop Interchange : profitabilityIn general, our goal is to enhance vectorization by increasing the number

of innermost loops that do not carry any dependence:

DO I = 1, N DO J = 1, MS A(I,J+1) = A(I,J) + A(I-1,J) + B ENDDO ENDDO

Applying loop interchange: DO J = 1, M DO I = 1, NS A(I,J+1) = A(I,J) + A(I-1,J) + B ENDDO ENDDO

leads to: DO J = 1, MS A(1:N,J+1) = A(1:N,J) + A(0:N-1,J) + B ENDDO

I J = < < <

J I < = < <

Loop Interchange: profitability

Profitability depends on architecture DO I = 1, N DO J = 1, M DO K = 1, LS A(I+1,J+1,K) = A(I,J,K) + B ENDDO ENDDO ENDDO For SIMD machines with large number of FU’s: DO I = 1, NS A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B ENDDO

i j k< < =


Since Fortran stores in column-major order we would like to vectorize the I-loop

Thus, after Interchanging the I-loop to be the inner loop, we can transform to:

DO J = 1, M DO K = 1, LS A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO ENDDO

j k i< = <

A(I,1:M,1:L)


MIMD machines with vector execution units want to cut down synchronization costs, therefore, shifting the K-loop to outermost level will get:

PARALLEL DO K = 1, L DO J = 1, M A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO END PARALLEL DO

k j i = < <

Loop Shifting

Motivation: Identify loops which can be moved and move them to “optimal” nesting levels.

Trying all the permutations is not practical. Theorem 5.3 In a perfect loop nest, if loops at level i, i+1,...,i+n carry no dependence, it is always legal to shift these loops inside of loop i+n+1. Furthermore, these loops will not carry any dependences in their new position.

> = > = = == > > > > == = = = > >= = = = = >

> = = > = == > > > > == = > = = >= = = = = >

Loop Shifting

Heuristic – move the loops that carry no dependence to the innermost position:

DO I = 1, N DO J = 1, N DO K = 1, N A(I,J,k) = A(I,J,k+1) ENDDO ENDDO ENDDO

DO K= 1, N DO I = 1, N DO J = 1, N A(I,J,k) = A(I,J,k+1) ENDDO ENDDO ENDDO

I J K = = <

K I J < = =

A(1:N,1:N,k) = A(1:N,1:N,k+1)

Loop Shifting

The heuristic fail in:

DO I = 1, N

DO J = 1, M

A(I+1,J+1) = A(I,J) + A(I+1,J)

ENDDO

ENDDO But we still want to get:

DO J = 1, M

A(2:N+1,J+1) = A(1:N,J) + A(2:N+1,J)

ENDDO

I J < < = <

J I < < < =

Loop Shifting

A recursive heuristec: If the outermost loop carries no dependence, then let

the outermost loop that carries a dependence to be the outermost loop. (the former heuristic.)

If the outermost loop carries a dependence, find the outermost loop L that: L can be safely shifted to the outermost position L carries a dependence d whose direction vector contains

an "=" in every position but the Lth.

And move L to the outermost position.

Loop Shifting

< < = == = < <= < = == = = <

< < = = = = < < < = = = = = = <

< = < = = < = << = = == = = <

< = = < = < < =< = = == = < =

Scalar Expansion DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO Scalar Expansion : DO I = 1, N T$(I) = A(I) A(I) = B(I) B(I) = T$(I) ENDDO T = T$(N) leads to: T$(1:N) = A(1:N) A(1:N) = B(1:N) B(1:N) = T$(1:N) T = T$(N)

Scalar Expansion

However, not always profitable. Consider: DO I = 1, N T = T + A(I) + A(I-1) A(I) = T ENDDO Scalar expansion gives us: T$(0) = T DO I = 1, N T$(I) = T$(I-1) + A(I) + A(I-1) A(I) = T$(I) ENDDO T = T$(N)

Scalar Expansion – How to

Naïve approach: Expand all scalars, vectorize, shrink all unnecessary expansions.

Dependences due to reuse of memory location vs. reuse of values: Dependences due to reuse of values must be

preserved Dependences due to reuse of memory location

can be deleted by expansion We will try to find the removable dependence

Reminder - SSA

X= X= X=

=X =X =X

1S 2S 3S

4S

5S 6S 7S

1S 2S 3S

4S

5S 6S 7S

1X 2X 3X

),,( 3214 XXXX

4X 4X 4X

Covering definition A definition X of a scalar S is a covering definition for loop

L if a definition of S placed at the beginning of L reaches no uses of S that occur past X.

DO I = 1, 100S1 T = X(I)S2 Y(I) = T ENDDO DO I = 1, 100 IF (A(I) .GT. 0) THENS1 T = X(I)S2 Y(I) = T ENDIF ENDDO

T=X(I)

Y(I)=T

1S

2S

)(T

Covering definition A covering definition does not always exist:

DO I = 1, 100 IF (A(I) .GT. 0) THENS2 T = X(I) ENDIFS3 Y(I) = T ENDDO

In SSA terms: There does not exist a covering definition for a variable T if the edge out of the first assignment to T goes to a -function later in the loop which merges its values with those for another control flow path through the loop

T=X(I)

Y(I)=T

2S

)(T

3S

Covering definition

Expanded definition of covering definition:

A collection C of covering definitions for T in a loop in which every control path has a covering definition.

DO I = 1, 100 IF (A(I) .GT. 0) THENS1 T = X(I) ELSES2 T = -X(I) ENDIFS3 Y(I) = T ENDDO

T=X(I) T=-X(I)

Y(I)=T

2S

)(T

3S

1 S

Find Covering definitions DO I = 1, 100

IF (A(I) .GT. 0) THENS1 T = X(I) ENDIFS2 Y(I) = T ENDDO To form a collection of covering definitions, we can insert dummy

assignments: DO I = 1, 100 IF (A(I) .GT. 0) THENS1 T = X(I) ELSES2 T = T ENDIFS3 Y(I) = T ENDDO

Algorithm to insert dummy assignments and compute the collection, C, of covering definitions: Central idea: Look for parallel paths to a -

function following the first assignment, and add dummy assignments.

Find Covering definition

T=X(I)

Y(I)=T

2S

)(T

3S

T=X(I)

Y(I)=T

2S

)(T

3S

T=T DO I = 1, 100 IF (A(I) .GT. 0) THEN T = X(I) ELSE T = T ENDIF Y(I) = TENDDO

Scalar Expansion

Given the collection of covering definitions, we can carry out scalar expansion for a normalized loop: Create an array T$ of appropriate length For each S in the covering definition collection C, replace the T on

the left-hand side by T$(I). For every other definition of T and every use of T in the loop body

reachable by SSA edges that do not pass through S0, the -function at the beginning of the loop, replace T by T$(I).

For every use prior to a covering definition (direct successors of S0 in the SSA graph), replace T by T$(I-1).

If S0 is not null, then insert T$(0) = T before the loop. If there is an SSA edge from any definition in the loop to a use

outside the loop, insert T = T$(U) after the loop, were U is the loop upper bound.

T$(0) = TDO I = 1, 100 IF (A(I) .GT. 0) THENS1 T$(I) = X(I) ELSE T$(I) = T$(I-1) ENDIFS2 Y(I) = T$(I)ENDDO

Deletable dependence: Given the covering Definitions, we can identify

deletable dependence: Backward carried antidependences:

Writing to T[i] after reading from T[i-1] at the previous iteration Backward carried output dependences

Writing to T[i] after writing to T[i-1] at the previous iteration Forward carried output dependences

Writing to T[i] and writing to T[i+1] at the next iteration Loop-independent antidependences into the covering

definition Writing to T[i] after reading from T[i-1] at the same iteration

Loop-carried true dependences from a covering definition reading from T[i] after writing to T[i-1] at the previous iteration

Scalar Expansion: Drawbacks Expansion increases memory requirements Solutions:

Expand in a single loopStrip mine loop before expansion:

Forward substitution:

DO I = 1, N T = A(I) + A(I+1) A(I) = T + B(I) ENDDO

DO I = 1, N,64 DO J = 0,63 T = A(I+J) +A(I+J+1) A(I) = T + B(I+J) ENDDO ENDDO

DO I = 1, N A(I) = A(I) + A(I+1) + B(I) ENDDO

DO I = 1, N T = A(I) + A(I+1) A(I) = T + B(I) ENDDO

Scalar Renaming

S3 T2$(1:100) = D(1:100) - B(1:100)

S4 A(2:101) = T2$(1:100) * T2$(1:100)

S1 T1$(1:100) = A(1:100) + B(1:100)

S2 C(1:100) = T1$(1:100) + T1$(1:100)

T = T2$(100)

DO I = 1, 100S1 T1 = A(I) + B(I)S2 C(I) = T1 + T1S3 T2 = D(I) - B(I)S4 A(I+1) = T2 * T2 ENDDO

DO I = 1, 100S1 T = A(I) + B(I)S2 C(I) = T + TS3 T = D(I) - B(I)S4 A(I+1) = T * T ENDDO

Scalar Renaming Renaming algorithm partitions all definitions

and uses into equivalent classes using the definition-use graph:Pick definitionAdd all uses that the definition reaches to the

equivalence classAdd all definitions that reach any of the

uses… ..until fixed point is reached

Scalar Renaming: Profitability

Scalar renaming will remove a loop-independent output dependence or antidependence.

(Writing to the same scalar at the same iteration or writing to the scalar after reading from it.)

Relatively cheap to use scalar renaming. Usually done by compilers when calculating live

ranges for register allocation.

Array Renaming

DO I = 1, NS1 A(I) = A(I-1) + XS2 Y(I) = A(I) + ZS3 A(I) = B(I) + C ENDDO

Rename A(I) to A$(I): DO I = 1, NS1 A$(I) = A(I-1) + XS2 Y(I) = A$(I) + ZS3 A(I) = B(I) + C ENDDO

A(1:N) =B(1:N) +C

A$(1:N)=A(0:N-1)+X

Y(1:N) =A$(1:N) +Z

Array Renaming: Profitability determine edges that are removed by array renaming

and analyze effects on dependence graph procedure array_partition:

Assumes no control flow in loop body identifies collections of references to

arrays which refer to the same value identifies deletable output dependences

and antidependences Use this procedure to generate code

Minimize amount of copying back to the “original” array at the beginning and the end

enhancing fine- grained parallelism chapter 5 of allen and kennedy mirit & haim

Documents

j enddo t

iteration j

i loop

j enddo s1

j enddo ci

j enddo c1

t ai

iteration i