a loop splitting method for single loops with non-uniform dependences

J Comput Virol Hack Tech (2014) 10:137–143DOI 10.1007/s11416-014-0208-9

CORRESPONDENCE

A loop splitting method for single loops with non-uniformdependences

Sam Jin Jeong

Received: 27 December 2013 / Accepted: 31 January 2014 / Published online: 12 March 2014© Springer-Verlag France 2014

1 Introduction

Computationally expensive programs spend most of theirtime in the execution of DO-loops [1,2]. Therefore, an effi-cient approach for exploiting potential parallelism is to con-centrate on the parallelism available in loops in ordinary pro-grams and has a considerable effect on the speedup [3,4].

Through some loop transformations using a dependencedistance in single loops [6,10], a loop can be splitted intopartial loops to be executed in parallel without violating datadependence relations, that is, the size of distance can be usedas a reduction factor [6], which is the number of iterationsthat can be executed in parallel. In the case of non-constantdistance such that it varies between different instances of thedependence, it is much more difficult to maximize the degreeof parallelism from a loop.

Partitioning of loops requires efficient and exact datadependence analysis. A precise dependence analysis helpsin identifying dependent/independent iterations of a loop.And it is important that appropriate dependence analysis beapplied to exploit maximum parallelism within loops. Wecan consider some tests that examine the dependence of one-dimensional subscripted variables— the separability test, theGCD test and the Banerjee test [8,11]. In general, the GCDtest is applied first because of its simplicity, even if it is anapproximate test. Next, for the case that the gcd test is true,the separability test is attempted again, and through this exact

This paper is a contribution to the special issue on MobileCommunication Systems selected topic from the SDPMand it is coordinated by Sangyeob Oh, K. Chung, Supratip Ghose.

S. J. Jeong (B)Division of Information and Communication Engineering,Baekseok University, 115 Anseo-dong,Cheonan City 330-704, Koreae-mail: [email protected]

test, it can be obtained additional information such as solutionset, and minimum and maximum distances of dependence,as well as whether the existence of dependence or not.

When we consider the approach for single loops, wecan review two partitioning techniques proposed in [6]which are fixed partitioning with minimum distance andvariable partitioning with ceil(d(i)). However, these leavesome parallelism unexploited, and the second case has someconstraints.

The rest of this paper is organized as follows. Section 2describes our loop model, and introduces the concept of data-dependence computation in actual programs. In Sect. 3, wereview some partitioning techniques of single loops such asloop splitting method by thresholds and Polychronopoulos’loop splitting method. In Sect. 4, we propose a generalizedand optimal method to make the iteration space of a loopinto partitions with variable sizes. Sect. 5 shows comparisonwith related works. Finally, we conclude in Sect. 6 with thedirection to enhance this work.

2 Program model and data dependence computation

For data-dependence computation in actual programs, themost common situation occurs when we are comparing twovariables in a single loop and those variables are elementsof a one-dimensional array, with subscripts linear in the loopindex variable. Then this kind of loop has a general form isshown if Fig. 1. Here, l, u, a1, a2, b1 and b2, are integerconstants known at compile time.

For dependence between statements S1 and S2, to exist,we must have an integer solution (i, j) to equation (1) that isa linear diophantine equation in two variables. The methodfor solving such equations is well known and is based on theextended Euclid’s algorithm [7].

123

138 S. J. Jeong

DO I = l, uS1 : A(a1*I + a2) = ···S2 : ··· = A(b1*I + b2)END

Fig. 1 A single loop model

a1i + a2 = b1 j + b2 where l ≤ i, j ≤ u. (1)

This equation may have infinitely many solutions (i, j) givenby a formula of the form:

(i, j) = ((b1/g) t + i1, (a1/g) t + j1

where (i1, j1) = ((b2 − a2) i0/g, (b2 − a2) j0/g) (2)

i0, j0 are any two integers such that a1i0 − b1 j0 =g(gcd(a1, b1)) and t is an arbitrary integer [8,9]. Accept-able solutions are those for which l ≤ i, j ≤ u, and in thiscase, the range for t is given by

max (min (α, β), min (γ, δ)) ≤ t ≤ min (max (α, β) , max (γ, δ))

where α = − (l − i1)/(b1/g) , β = − (u − i1) / (b1/g) ,

γ = − (l − j1)/(a1/g) , δ = − (u − i1)/(a1/g) . (3)

3 Related works

Now, we review some partitioning techniques of single loops.We can exploit any parallelism available in such a single loopin Fig. 1, by classifying the four possible cases for a1 and b1,coefficients of the index variable I , as given by (4).

(a) a1 = b1 = 0(b) a1 = 0, b1 �= 0 or a1 �= 0, b1 = 0(c) a1 = b1 �= 0(d) a1 �= 0, b1 �= 0, a1 �= b1

(4)

In case 4(a), because there is no cross-iteration dependence,the resulting loop can be directly parallelized. In the fol-lowing subsections, we briefly review several loop splittingmethods for the cases of 4(b) through 4(d).

3.1 Loop splitting by thresholds

A threshold indicates the number of times the loop may beexecuted without creating the dependence. In case 4(b), fora dependence to exist, there must be an integer value i ofindex variable I such that b∗

1i + b2 = a2 (if a1 = 0)or a∗

1i + a2 = b2 (if b1 = 0) and l ≤ i ≤ u. Ifthere is no solution, then there is no cross-iteration depen-dence and the loop can also be parallelized. And if inte-ger i exists, then there exist a flow dependence (or anti-dependence) in the range of I, [l, u] and an anti–dependence(or flow dependence) in [i, u]. In this case, be breaking theloop at the iteration I = i (called turning threshold), the

two partial loops can be transformed into parallel loops.Figure 2a shows an example of loop splitting by the turn-ing threshold.

In case 4(c), let (i, j) be an integer solution to (1), thenthere exists a dependence in the range of I and the depen-dence distance (d) is | j − i | = |(a2 − b2)/a1|. Here, theloop can be transformed into two perfectly nested loops; aserial outer loop with stride d (called constant threshold) anda parallel inner loop [8]. Figure 2b shows the general formof loop splitting by the constant threshold.

In case 4(d), an existing dependence is non-uniform sincethere is a non-constant distance, that is, such that it variesbetween different instances of the dependence. And wecan consider exploiting any parallelism for two cases whena∗

1b1 < 0 and a∗1b1 > 0. Suppose now that a∗

1b1 < 0. If(i , i) is a solution to (1), then there may be all dependencesources in (l, i) and all dependence sinks in [i, u]. Therefore,by splitting the loop at the iteration I = i (called crossingthreshold), the two partial loops can be directly parallelized[8]. Figure 2c shows the general form of loop splitting by thecrossing threshold.

3.2 Polychronopoulos’ loop splitting

We can also consider exploiting any parallelism for the case4(d) when a∗

1b1 >= 0. We will consider three cases whetherit exists only flow dependence, anti-dependence, or both inthe range of I . First, let (i, j) be an integer solution to (1). Ifthe distance, d(i) depending on i , as given by (5), has a pos-itive value, then there exists a flow dependence, and if da( j)depending on j , as given by (6), has a positive value, thenthere exists an anti-dependence. Next, if (x, x) is a solutionto (1) (x may not be an integer), then d(x) = da(x) = 0and there may exist a flow (or anti-) and an anti-dependence(or flow) before and after I = ceil(x), and if x is an inte-ger, then there exists a loop-independent dependence atI = x. Here, suppose that Then for each value of I , the ele-ment A(a∗

1 I + a2) defined by that iteration cannot be con-sumed before ceil (d(i)) iterations later, and this indicatesthat ceil(d(i)) iterations can execute in parallel.

d (i) = j − i = D (i) /b1,

where D (i) = (a2 − b1)∗ i + (a2 − b1) (5)

da ( j)= i − j = Da ( j) /a1,

where Da ( j)=(b1 − a1)∗ j +(b2 − a2) (6)

Consider the loop, as given in Fig. 3, in which there existflow dependences. d(i) = D(i)/b1 = (i +5)/2 > 0 for eachvalue of I and d(i) have integer values, 3, 4, 5, … as the valueof I is incremented.

Figure 4 shows the results of applying two transformationsusing minimum distance and ceil(d(i)) proposed in [6] to theloop in Fig. 3, respectively. However these leave some par-

123

A loop splitting method for single loops 139

Fig. 2 The loop splittingby thresholds

(a)

(b)

(c)

allelism unexploited. Moreover, the transformed loop in Fig.4b has some constraints: It must be only a flow dependencein the original loop, the first iteration of the original loopmust be the iteration at which a dependence source exists,and d(i) ≥ 1 for each value of I .

4 Maximizing parallelism for single loops

From a single loop with non-constant distance such that itsatisfies the case (d) in (4) and a∗

1b1 > 0, we can get the fol-lowing theorems. For convenience’ sake, suppose that thereis a flow dependence in the loop.

Theorem 1 The number of iterations between a dependencesource and the next source, sd is given by |b1|/g iterationswhere g = gcd(a1, b1).

Proof Let i, i ′ be iterations for a source and next source,respectively. Then from (2), i = (b1/g)t + i1 and i ′ =

Fig. 3 An exampleof a single loop

DO I =1, N S1: A(3I+1) = S2: = A(2I-4) ENDDO

(b1/g)(t +1)+i1 for i1, g and t , as defined in (2). Therefore,sd = |i ′ − i | = |b1|/g. ��

And we can know the facts that if we obtain the iterationfor the source of the first dependence, then we can computethe others easily, and i ≡ j (mod sd) for i, j are arbitraryiterations for all sources.

Theorem 2 The dependence distance, that is, the number ofiterations between the source and the sink of a dependence,is D(i)/b1 where D(i) = (a1 − b1)

∗i + (a2 − b2), and theincreasing rate of a distance per one iteration, d’ is givenby (a1 − b1)/b1. And the difference between the distanceof a dependence and that of the next dependence, dinc is|a1 − b1|/g.

Proof According to (5), we can know the distance and d’.And dinc = d

′∗ sd = (a1 − b1)/b∗1 |b1|/g = |a1 − b1|/g. ��

Hence, if we obtain the distance for the source of the firstdependence, then we can compute the others easily. Also forthe case of an anti-dependence, similarly, Theorems 1 and 2can be represented. Namely, sd is given by |a1|/g iterationswhere g = gcd(a1, b1), the distance is given by (6), and d’is (b1 − a1)/a1. And dinc = d

′∗ sd = (b1 − a1)/a∗1 |a1|/g =

|b1 − a1|/g.

123

140 S. J. Jeong

Fig. 4 Polychronopoulos’loop splitting

(a) (b)

By using the iteration and distance for the source of thefirst dependence, and concepts defined by Theorems 1 and2, we obtain the generalized and optimal algorithm to maxi-mize parallelism from single loops with non-uniform depen-dences, called the MPSL [5].

Procedure MaxSplit_1 shows the transformation of singleloops satisfying the case (d) in (4) and a∗

1b1 > 0 into partialparallel loops.

In step 2 in Procedure MaxSplit_1, the iteration for the sinkof the first dependence in each block, Sri + di, is selectedas the first iteration in the next block, Sti+1, to maximizeparallelism from a loop. And the iteration and distance for thesource of the first dependence in each of blocks are obtainedin step 3 and 4, respectively. The transformed loop will speedup by a factor of Sti+1 − Sti (reduction factor, which is thenumber of iterations that can be executed in parallel), thestride of the outer loop.

123


Applying Procedure MaxSplit_1 to the loop in Fig. 3, theunrolled version of this result is shown in Fig. 5c. Figure 5aand b show the unrolled versions of Fig. 4a and b, respec-tively. As shown in Fig. 5, we can know that Procedure MaxS-plit_1 is an algorithm to maximize parallelism from singleloops with non-constant distances.

The following procedure MaxSplit_2 generalizes how totransform general single loops, as shown in Fig. 1, into paral-lel loops by using Procedure MaxSplit_1 and the ideas men-tioned above.

In step 5–6, if there exist a flow (or anti-) and an anti-dependence (or flow) before and after I = ceil(x), as definedin step 3, then dividing the loop two parts, and each of partscan be transformed by Procedure MaxSplit_1. No depen-

dences are violated by the transformed block in step 7, sincethere may be different type of dependences before and afterI = ceil(x) and a loop-independent dependence may exist atI = x , even if it exists. An example of such case is shown inFig. 6a. In the loop of Fig. 6a, there exist both anti-and flowdependences before and after i = 7. The unrolled version ofthis loop is shown in Fig. 6b. The results of this loop trans-formed by step 5 and 6 are shown in Fig. 6c, and the resultof those merged by step 7 is the loop as in Fig. 6d.

5 Performance analysis

Theoretical speedup for performance analysis can be com-puted as follows. Ignoring the synchronization, scheduling

123

142 S. J. Jeong

Fig. 5 The unrolled versionsof transformed

I A(3*I+1) A(2*I-4) I A(3*I+1) A(2*I-4) I A(3*I+1) A(2*I-4)1 A(04) A(-2) 1 A(04) A(-2) 1 A(04) A(-2)2 A(07) A(00) 2 A(07) A(00) 2 A(07) A(00)3 A(10) A(02) 3 A(10) A(02) 3 A(10) A(02)4 A(13) A(04) 4 A(13) A(04) 4 A(13) A(04)5 A(16) A(06) 5 A(16) A(06) 5 A(16) A(06)6 A(19) A(08) 6 A(19) A(08) 6 A(19) A(08)7 A(22) A(10) 7 A(22) A(10) 7 A(22) A(10)8 A(25) A(12) 8 A(25) A(12) 8 A(25) A(12)9 A(28) A(14) 9 A(28) A(14) 9 A(28) A(14)10 A(31) A(16) 10 A(31) A(16) 10 A(31) A(16)11 A(34) A(18) 11 A(34) A(18) 11 A(34) A(18)12 A(37) A(20) 12 A(37) A(20) 12 A(37) A(20)13 A(40) A(22) 13 A(40) A(22) 13 A(40) A(22)14 A(43) A(24) 14 A(43) A(24) 14 A(43) A(24)15 A(46) A(26) 15 A(46) A(26) 15 A(46) A(26)16 A(49) A(28) 16 A(49) A(28) 16 A(49) A(28)

… … …(a) Using minimum distance. (b) Using ceil(d(i)). (c) Our proposed method.

(a)

(b)

(c) (d)

Fig. 6 The results of a loop with both flow and anti-dependencestransformed by Procedure MaxSplit_2

and variable renaming overheads, and assuming an unlim-ited number of processors, each partition can be executed inone time step. Hence, the execution time is equal to the num-ber of parallel regions, N . Generally, speedup is representedby the ratio of total sequential execution time to the executiontime on parallel computer system as follows:

Speedup = Ni/N where Ni is the size of loop i

Let’s consider the loop shown in Example 3.Applying the minimum distance tiling method, this loop

is tiled by six areas as shown in Fig. 5a. So, the speedup forthis method is 16/6 = 2.67.

Using variable partitioning with ceil(d(i)) method in thisexample, this loop is tiled by 4 areas as shown in Fig. 5b. So,the speedup for this method is 16/4 = 4.

Applying our proposed method in this example, this loopis tiled by 3 areas as shown in Fig. 5c. So, the speedup forthis method is 16/3 = 5.3.

Our proposed methods give more parallelism in compar-ison with some previous splitting methods.

6 Conclusions

In this paper, we have studied the parallelization of singleloop with non-uniform dependences for maximizing paral-lelism. For single loops, we can review two partitioning tech-niques which are fixed partitioning with minimum distanceand variable partitioning with ceil(d(i)). However, these leavesome parallelism unexploited, and the second case has someconstraints. Therefore, we propose a generalized and optimalmethod, and we call it as MPSL (Maximizing Parallelism forSingle Loops), to make the iteration space of a loop into par-titions with variable sizes.

Our algorithm generalizes how to transform general singleloops with flow and anti-dependences into parallel loops.

In comparison with some previous splitting methods, ourproposed methods give much better speedup and extract moreparallelism than other methods.

Our future research work is to consider the extension ofthe MPSL method to n-dimensional space.

123


References

1. Oprisa, C., Cabau, G., Colesa, A.: Automatic code features extrac-tion using bio-inspired algorithms. J. Comput. Virol. Hacking Tech.9(4), 171–178 (2013)

2. Bilar, D., Filiol, E.: On self-reproducing computer programs. J.Comput. Virol. 9(1), 9–87 (2009)

3. Midkiff, S.P.: Automatic parallelization : an overview of funda-mental compiler techniques. Synth. Lect. Comput. Archit. 1(19),1–10 (2012)

4. Li, W., Pingali, K.: A singular loop transformation frameworkbased on non-singular matrices. J. Parallel Program. 22(2),183–205 (1994)

5. Jeong, S.J.: A loop transformation for parallelism from singleloops. Int. J. Contents 2(4), 8–11 (2006)

6. Ploychronopoulos, C.D.: Compiler optimizations for enhancingparallelism and their impact on architecture design. IEEE TransComput. 37(8), 991–1004 (1988)

7. Knuth, D.E.: The art of computer programming, vol. 2: seminu-merical algorithms. Addison-Wesley, Reading (1981)

8. Zima, H., Chapman, B.: Supercompilers for parallel and vectorcomputers. Addison-Wesley, Reading (1991)

9. Banerjee, U., Eigenmann, R., Nicolau, A., Padua, D.A.: Automaticprogram parallelization. Proc. IEEE 81(2), 211–243 (1993)

10. Allen, J.R., Kennedy, K.: Automatic translation of Fortran pro-grams to vector form. ACM Trans. Program. Lang. Systems 9(4),491–542 (1987)

11. Wolfe, M.J.: Optimizing supercompilers for supercomputers. MITPress, Cambridge (1989)

123

a loop splitting method for single loops with non-uniform dependences

Documents