aug 15-18, montreal, canada1 recurrence chain partitioning of non-uniform dependences yijun yu erik...
TRANSCRIPT
Aug 15-18, Montreal, Canada 1
Recurrence Chain Partitioning of Non-Uniform Dependences
Yijun YuErik H. D’Hollander
Aug 15-18, Montreal, Canada 2
Overview
1. Dependence and Parallelism2. Non-Uniform Loop Dependences3. Recurrence Chains Partitioning 4. Related work5. Implementations6. Experiment Results7. Summary
Aug 15-18, Montreal, Canada 3
1. Background Dependence vs. Parallelism
DO I = 1,3
A(I) = A(I-1)
ENDDO
DOALL I = 1,3
A(I) = A(I-1)
ENDDO
A(2) = A(1)
A(1) = A(0)
A(3) = A(2)
1 2 3
1 1 3
0 1 3
0 1 1
0
0
0
0
program
A(1) = A(0)
A(2) = A(1)
A(3) = A(2)
execution trace
1 2 3
0 2 3
0 0 3
0 0 0
0
0
0
0
shared memory
Aug 15-18, Montreal, Canada 4
The CFD application @ WTCM
Computation Fluid Dynamics CFDNavier-Stokes equationsSuccessive Over-Relaxation SOR
temperature3D geometry + 1D time
Aug 15-18, Montreal, Canada 5
The visualized Uniform dependences and transformations for the 4D loop
Before transformation After transformation
A 3-D unimodular transformation is found after visualizing the 4D loop nest which has 177 array references at run-time for each iteration. Here we use a regularshape. The transformation makes it possible to speed-up the program around N2/6 times where N is the diameter of the geometry. (Yu, Parco99)
Aug 15-18, Montreal, Canada 6
2. Non-uniform dependences
Uniform loop dependences Dependent iterations are apart at a uniform
distance in the iteration space: a set of distance vector can predict the dependences and indicate the affine index loop transformation to reveal the maximal loop parallelism.
Non-uniform dependences Irregular, can be caused by complex subscripts,
compile-time unknowns, etc. But not rare: in SPECfp95 benchmarks 46%
nested loops and 12.8% of the coupled subscripts
Aug 15-18, Montreal, Canada 8
Irregular dependence
Dependences have non-uniform distance
Parallelism Analysis:200 iterations over 15 data flow steps
Speedup:13.3
Problem: How to exploit it?
Aug 15-18, Montreal, Canada 9
3. Recurrence Chain Partitioning
Research objectives
If DO loops fail to reveal the optimal parallelism for irregular dependences, can one use WHILE loops?
WHEN can one apply WHILE loops? HOW to construct WHILE loops? WHAT to do when one can not apply
WHILE loops? HOW MUCH can be achieved by an
evaluation purposes?
Aug 15-18, Montreal, Canada 10
3.1 How to Generate code? DOALL I = INIT(I)
WHILE !TERMINATE(I) DO S(I) I = NEXT(I) END DOENDDOALL
INIT(I) =? TERMINATE(I)=? NEXT(I) =?
Aug 15-18, Montreal, Canada 11
3.2 Solving recurrence equations in the unified iteration space Dependence equations: iA + a = jB + b Recurrence equations: j = i T + t or i = (j – t) T-1 = jT-1
+ tT-1
T = AB-1
t = (a – b)B-1
A recurrence chain is a sequence of dependent iterations, such that
iK+1 = iKT + t, or iK+1= (iK-t)T-1
i0 =
{ i | not exist j such that iA+a = jB+b or iB+b = jA+a} We have variable dependence distance dk=ik+1-ik:
dk+1 = dkT or dk=dk+1T-1
d is not constant and exponential to a=max(1/|T|, |T|), thus the dependence chain length is O(loga L), where L is the diameter of the iteration space
When |T| is negative, one can cut recurrence chain to 2 iterations by lexicographical ordering
Aug 15-18, Montreal, Canada 12
3.3 Generate code ?
DOALL I = i0 WHILE ( I is in Iteration Space) DO S(I) I = IT+t or I = (I-t)T-1
ENDDO ENDDOALL Problem: How to tell which index
update respects the dependency order?
Aug 15-18, Montreal, Canada 13
iteration space
i0
i0
i0i0
i2
i4
i1
independent
cyclic
integer
non-integer
integer
non-integer
I1
I2
i1
i3
initial set final set
intermediateset
R1
R2
R3 R4
Aug 15-18, Montreal, Canada 14
3.3 Generate code !
DOALL I in P1 IF (IT+t < I) T = T-1; t = tT ENDIF WHILE ( I is in Iteration Space) DO S(I) I = IT+t ENDDO
ENDDOALL
Aug 15-18, Montreal, Canada 15
4. Related work Strength of REC
(1) Scalability
LEN = length of the chain In comparison, unique-set oriented
methods have to deal with LEN = 2, 3, … differently…
In REC, the WHILE loops adjust their steps automatically…
Aug 15-18, Montreal, Canada 16
4. Related work Strength of REC
(2) Outermost loop parallelism
Set-oriented:DOALL I in P1 S(I)DOALL I in P2 S(I)…DOALL I in Pn D(I)
Recurrence ChainDOALL I in P1 IF (I > IT+t) T = T-1; t = tT WHILE ( I in IS) DO S(I) I = IT+t ENDDO
ENDDOALL
Aug 15-18, Montreal, Canada 17
4. Related work
Shortcoming and alternatives
Restriction in number of dep. Equations
Fall back to the following algorithms: A recursive 3-sets partitioning (3P)
(similar to unique-sets partitioning, but more accurate): can reuse the calculations for P1, P2, P3.
PDM and other uniformization techniques PDM is light-weight and can apply first, then apply 3P.
Aug 15-18, Montreal, Canada 18
Non-uniform Dependence
Loop Parallelization
Uniformization
DOALL
DOACROSS
Set-Oriented
Unique Sets splitting
Recurrence Chain partitioning
Outermostparallelism
Maximal parallelism
Minimal synchronization
Maximal Coverage
Affine Loop bounds
Multiple references
Affine Array subscripts
Non-perfectly Nested Loop
Multiple dimension of
loop nests
Statement-level
parallelism
Finest Partitioning
Load Balancing
Schedule Regularity
Make
Hurt
Help
Make
Make
Make
Make
Uniform Dependence
Loop Parallelization
Hurt
Help
Help
Hurt
Hel
pHurt
Hurt
Hurt
Yu & D’Hollander 04
Ju & Chaudhary, 97Cho & , 97
Pean & Chen, 01Yu & D’Hollander, 04
Tzen & Ni, 91Chen & Yew, 96
Punyamurtula et al 99
Shang et al 96Lim & Lam 99
Yu & Dollander 00
Wolfe, 87Wolf, 91
Banerjee, 93D’Hollander, 92
Hurt
Hurt
Mak
e
Mak
e
Help
GoalTask Softgoal
Make
Make
Aug 15-18, Montreal, Canada 19
Non-uniform Dependence
Loop Parallelization
Uniformization
DOALL
DOACROSS
Set-Oriented
Unique Sets splitting
Recurrence Chain partitioning
Outermostparallelism
Maximal parallelism
Minimal synchronization
Maximal Coverage
Affine Loop bounds
Multiple references
Affine Array subscripts
Non-perfectly Nested Loop
Multiple dimension of
loop nests
Statement-level
parallelism
Finest Partitioning
Load Balancing
Schedule Regularity
Make
Hurt
Help
Make
Make
Make
Make
Uniform Dependence
Loop Parallelization
Hurt
Help
Help
Hurt
Hel
pHurt
Hurt
Hurt
Yu & D’Hollander 04
Ju & Chaudhary, 97Cho & , 97
Pean & Chen, 01Yu & D’Hollander, 04
Tzen & Ni, 91Chen & Yew, 96
Punyamurtula et al 99
Shang et al 96Lim & Lam 99
Yu & Dollander 00
Wolfe, 87Wolf, 91
Banerjee, 93D’Hollander, 92
Hurt
Hurt
Mak
e
Mak
e
Help
GoalTask Softgoal
Make
Make
sat
den
partlyfully
Aug 15-18, Montreal, Canada 20
Non-uniform Dependence
Loop Parallelization
Uniformization
DOALL
DOACROSS
Set-Oriented
Unique Sets splitting
Recurrence Chain partitioning
Outermostparallelism
Maximal parallelism
Minimal synchronization
Maximal Coverage
Affine Loop bounds
Multiple references
Affine Array subscripts
Non-perfectly Nested Loop
Multiple dimension of
loop nests
Statement-level
parallelism
Finest Partitioning
Load Balancing
Schedule Regularity
Make
Hurt
Help
Make
Make
Make
Make
Uniform Dependence
Loop Parallelization
Hurt
Help
Help
Hurt
Hel
pHurt
Hurt
Hurt
Yu & D’Hollander 04
Ju & Chaudhary, 97Cho & , 97
Pean & Chen, 01Yu & D’Hollander, 04
Tzen & Ni, 91Chen & Yew, 96
Punyamurtula et al 99
Shang et al 96Lim & Lam 99
Yu & Dollander 00
Wolfe, 87Wolf, 91
Banerjee, 93D’Hollander, 92
Hurt
Hurt
Mak
e
Mak
e
Help
GoalTask Softgoal
Make
Make
sat
den
partlyfully
Aug 15-18, Montreal, Canada 21
Non-uniform Dependence
Loop Parallelization
Uniformization
DOALL
DOACROSS
Set-Oriented
Unique Sets splitting
Recurrence Chain partitioning
Outermostparallelism
Maximal parallelism
Minimal synchronization
Maximal Coverage
Affine Loop bounds
Multiple references
Affine Array subscripts
Non-perfectly Nested Loop
Multiple dimension of
loop nests
Statement-level
parallelism
Finest Partitioning
Load Balancing
Schedule Regularity
Make
Hurt
Help
Make
Make
Make
Make
Uniform Dependence
Loop Parallelization
Hurt
Help
Help
Hurt
Hel
pHurt
Hurt
Hurt
Yu & D’Hollander 04
Ju & Chaudhary, 97Cho & , 97
Pean & Chen, 01Yu & D’Hollander, 04
Tzen & Ni, 91Chen & Yew, 96
Punyamurtula et al 99
Shang et al 96Lim & Lam 99
Yu & Dollander 00
Wolfe, 87Wolf, 91
Banerjee, 93D’Hollander, 92
Hurt
Hurt
Mak
e
Mak
e
Help
GoalTask Softgoal
Make
Make
sat
den
partlyfully
Aug 15-18, Montreal, Canada 22
4. Implementations
Front end: source to source transformations
PDM/PL in FPT Set-oriented algorithms in FPT <->
XML/XSLT <-> OCBack end Intel Fortran compiler + OPENMP
directivesExperiments on an EPICMP 4-CPU server
Aug 15-18, Montreal, Canada 23
5. Results
5.1 Yu, ICPP00
DO I1=1,N1 DO I2=1,N2 a(3*I1+1,2*I1+I2-1) =a(I1+3,I2+1) ENDDO ENDDO
Aug 15-18, Montreal, Canada 26
5.2 Ju, 1997’s example
DO I=1,N DO J=1,N a(2*I+3,J+1) =
… =a(I+2*J+1,I+J+3) ENDDO ENDDOdet(PDM) = 2
Aug 15-18, Montreal, Canada 28
Ju’s Example
Comparison
We corrected the loop bounds flaw in the Ju’s 97 paper and 5 unique sets were derived for this case when N = 12.
But theoretically O(2^(log2 N)) = O(N) UNIQUE sets are needed
In REC partitioning, just one set P1 needs to be calculated for the initial i0
Aug 15-18, Montreal, Canada 30
5.3 Chen, 96’s Example DO I=1,N DO J=1,I DO K=J,I ... = a(I+2*K+5,4*K-J) ENDDO a(I-J,I+J)= ... ENDDO ENDDO
Aug 15-18, Montreal, Canada 31
Chen’s Example
A special case It is a non-perfectedly nested loop First convert it into the unified iteration
space Then symbolically calculate P1, P2, P3
and finds P2 = empty Therefore the recurrence chains are at
most 1 iteration long, regardless to the loop bounds
Both REC and Three-region partitioning lead to the same optimal solution
Aug 15-18, Montreal, Canada 33
5.4 Cholesky kernel (I,K,J,L) DO 1 I = 0,NRHS DO 1 K = 0,2*N+1 IF (K.LE.N) THEN I0 = MIN(M,N-K) ELSE I0 = MIN(M,2*N-K+1) ENDIF DO 1 J = 0,I0C$DOISV DO 1 L = 0,NMAT IF (K.LE.N) THEN IF (J.EQ.0) THEN 8 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 7 B(I,L,K+J)=B(I,L,K+J)-A(L,-J,K+J)*B(I,L,K) ENDIF ELSE IF (J.EQ.0) THEN 9 B(I,L,K)=B(I,L,K)*A(L,0,K) ELSE 6 B(I,L,K-J)=B(I,L,K-J)-A(L,-J,K)*B(I,L,K) ENDIF ENDIF1 CONTINUE
C THE ORIGINAL KERNEL DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 J = 1, MIN (M, N-K) DO 7 L = 0, NMAT7 B(I,L,K+J) = B(I,L,K+J) - A(L,-J,K+J) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 J = 1, MIN (M, K) DO 6 L = 0, NMAT6 B(I,L,K-J) = B(I,L,K-J) - A(L,-J,K) * B(I,L,K)
Loop Fusion
Aug 15-18, Montreal, Canada 3429
Cholesky Kernel
29
(I,K,J ,L)
IK
J
Plane: L=0
I
KL
Loop Projections
Aug 15-18, Montreal, Canada 36
6. Summary Recurrence Chain partitioning is scalable to
any size of the iteration space REC partitioning reveals outermost parallelism,
no synchronization between partitioned regions
The limitation of REC partitioning and its compensation: we provide fall back alternatives, if REC can not apply (1) PDM + Minimal distance (always applicable) (2) Recursive three-region partitioning (applicable for constant loop bounds, in some cases (e.g. Chen’s example) any loop bounds)
PDM
3RREC