exploitation of parallelism to nested loops with dependence cycles
TRANSCRIPT
Journal of Systems Architecture 50 (2004) 729–742
www.elsevier.com/locate/sysarc
Exploitation of parallelism to nested loops withdependence cycles
Weng-Long Chang a,*, Chih-Ping Chu b,1, Michael (Shan-Hui) Ho a,2
a Department of Information Management, Southern Taiwan University of Technology, Tainan County 710, Taiwan, ROCb Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan City 701, Taiwan, ROC
Received 24 March 2001; received in revised form 6 June 2004; accepted 10 June 2004
Available online 25 August 2004
Abstract
In this paper, we analyze the recurrences from the breakability of the dependence links formed in general multi-state-
ments in a nested loop. The major findings include: (1) A sink variable renaming technique, which can reposition an
undesired anti-dependence and/or output-dependence link, is capable of breaking an anti-dependence and/or output-
dependence link. (2) For recurrences connected by only true dependences, a dynamic dependence concept and the
derived technique are powerful in terms of parallelism exploitation. (3) By the employment of global dependence test-
ing, link-breaking strategy, Tarjan�s depth-first search algorithm, and a topological sorting, an algorithm for resolving a
general multi-statement recurrence in a nested loop is proposed. Experiments with benchmark cited from Vector loops
showed that among 134 subroutines tested, 3 had their parallelism exploitation amended by our proposed method. That
is, our offered algorithm increased the rate of parallelism exploitation of Vector loops by approximately 2.24%.
� 2004 Published by Elsevier B.V.
Keywords: Parallelizing compilers; Vectorizing compilers; Loop optimization; Data dependence analysis; Dependence cycle; Paralle-
lism exploitation
1383-7621/$ - see front matter � 2004 Published by Elsevier B.V.
doi:10.1016/j.sysarc.2004.06.001
* Corresponding author. Tel.: +886 6 6891 421/2533131x
4300; fax: +886-6-2541621.
E-mail addresses: [email protected], changwl@
csie.ncku.edu.tw (W.-L. Chang), [email protected]
(C.-P. Chu), [email protected] (M. (Shan-Hui) Ho).1 Tel.: +886 6 2757575x62527; fax: +886 6 2747076.2 Tel.: +886 6 2533131x4300; fax: +886 6 2541621.
1. Introduction
In high speed computing [20], there are the two
most popular parallel computational models, dis-
tributed memory multiprocessors and shared
memory multiprocessors. Because the techniqueto memory hardware [20] is improved, therefore,
the access time of shared memory for a system of
multiprocessors is obviously decreased and the
730 W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742
system has been increasingly used for scientific and
engineering applications. However, the major
shortcoming of shared memory multiprocessors
is the difficulty in programming because program-
mers are responsible for analyzing data depend-ence relations among statements in programs and
exploiting the parallelism of statements in pro-
grams among shared memory multiprocessors.
A successful vectored/paralleled compiler is
capable of exploiting the parallelism of a program.
Three features of a vectored/paralleled compiler
determine its level of parallelism exploitation: (1)
an accurate data dependence testing, (2) efficientloop optimization and (3) efficient removals of
undesired data dependences.
Techniques for dependence analysis algorithms,
which directly support data dependence testing,
have been developed and used quite successfully
[2,5,7–10,20,22–27]. Computationally expensive
programs in general spend most of their time in
the execution of loops. Extracting parallelism fromloops in an ordinary program therefore has a con-
siderable effect on the speed-up. Many loop opti-
mizational methods have been developed and
broadly fallen into two classes: loop vectoriza-
tion and loop parallelization [3,4,20,21,28–30].
In terms of the reduction of data dependences,
most researches concentrate on loop optimiza-
tion by the front-end of vectored/paralleled com-pilers such as scalar renaming, scalar expansion,
scalar forward-substitution and dead code elimina-
tion [20]. Studies on the back-end of vectored/
paralleled compilers primarily deal with separation
of parallelism execution and sequential execu-
tion of the statements. Relatively less attention
has been given to data dependence elimination
[1,9,11].Recurrence is a type of p-blocks in a general
nested loop, which is extracted at the time of loop
distribution, a back-end phase [17]. Statement(s)
involved in a recurrence are strongly connected
via various dependence types. Famous techniques,
such as node splitting, thresholding, cycle shrinking,
etc., are able to eliminate data dependence of a
recurrence to a certain extent, depending uponspecific dependence types [1,17,18,20,21].
In this paper, we study a parallelism exploita-
tion for loops with dependence cycles on basis of
breaking dependence links. Formally the breaking
strategies for dependence cycles were surveyed in a
single loop [11]. Their method is extended to break
the dependence links in a nest of loops. In Section
2, the concept of data dependence is reviewed. InSection 3, an analysis of the formation of depend-
ence cycles is provided and three dependence links
on the breaking-strategy basis are derived. For
each link pattern, its features, breaking techniques
and applications are introduced in detail. An algo-
rithm is developed for the resolution of a general
multi-statement recurrence. Experimental results
showing the advantages of the proposed methodare given in Section 4. Finally, Section 5 contains
a conclusion.
2. Data dependence
It is assumed that there are two statements
within a general loop. The general loop is pre-sumed to contain n common loops. Statements
are postulated to be embedded in n common loops.
An array A is supposed to appear simultaneously
within statements and if a statement S2 uses the
element of the array A defined first by another
statement S1, then S2 is true-dependent on S1. If
a statement S2 defines the element of the array A
used first by another statement S1, then S2 isanti-dependent on S1. If a statement S2 redefines
the element of the array A defined first by another
statement S1, then S2 is output-dependent on S1.
Another dependence, control dependence, which
arises due to control statements, is not addressed
in this paper.
Each iteration of a general loop is identified by
an iteration vector whose elements are the valuesof the iteration variables for that iteration. For
example, the instance of the statement S1 during
iteration ~i ¼ ði1; . . . ; inÞ is denoted S1ð~iÞ; the in-
stance of the statement S2 during iteration~j ¼ ðj1; . . . ; jnÞ is denoted S2ð~jÞ. If (i1,. . .,in) is iden-
tical to (j1, . . ., jn) or (i1, . . ., in) precedes (j1, . . ., jn)lexicographically, then S1ð~iÞ is said to precede
S2ð~jÞ, denoted S1ð~iÞ < S2ð~jÞ. Otherwise, S2ð~jÞ issaid to precede S1ð~iÞ, denoted S1ð~iÞ > S2ð~jÞ. In
the following, Definitions 2.1–2.7, cited from
[3,4,11], will be used later.
W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742 731
Definition 2.1. Loop-independent dependence re-
fers to the dependence confined within each single
iteration. Loop-independent dependences include
loop-independent true-dependence (denoted dt),
loop-independent anti-dependence (denoted da)and loop-independent output-dependence (de-
noted do). These relations are represented by the
set (denoted D), i.e., D={dt,da,do}.
Definition 2.2. Consistent loop-carried depend-
ence refers to the dependence occurring across
the iteration boundaries. Consistent loop-carried
dependences include consistent loop-carried true-dependence (denoted [dt]), consistent loop-carried
anti-dependence (denoted [da]) and consistent
loop-carried output-dependence (denoted [do]).
These relations are represented by the set (denoted
[D]), i.e., [D]={[dt], [da], [do]}.
Definition 2.3. A vector of the form ~h ¼ðh1; . . . ; hnÞ is termed as a direction vector. Thedirection vector (h1, . . .,hn) is said to be the direc-
tion vector from S1ð~iÞ to S2ð~jÞ if for 1 6 k 6 n,
ikhk jk, i.e., the relation hk is defined by
hk ¼
< if ik < jk;¼ if ik ¼ jk;> if ik > jk�the relation of ik and jk can be ignored;
i:e:; can be any one off<;¼; >g:
8>>>><>>>>:
We remember S1 d~h S2, where d2D[ [D] and~h ¼ ðh1; h2; . . . ; hnÞ.
Definition 2.4. The dependence distance vector
from S1ð~iÞ to S2ð~jÞ is denoted by distð~i;~jÞ ¼ðj1 i1; . . . ; jn inÞ.
Definition 2.5. The dependence distance matrix of
nested loops is a matrix whose columns are the
dependence distance vectors of all the dependences
in nested loops.
Definition 2.6. For an inter-statement or intra-
statement dependence, the source variable of the
dependence refers to the instance of an indexed
variable to be accessed first; the sink variable of
the dependence refers to the instance of indexed
variable to be accessed later.
Definition 2.7. The direction vector matrix of
nested loops is a matrix whose columns are the
direction vectors of all the dependences in nested
loops.
In the following, Lemma 2.1, cited from [12],introduces a dependence relation for which direct
vectorization of the code is available. Lemma 2.2,
cited from [1,21], emphasizes an important fact.
That is, no matter how complicated dependence
relations in statements are, so long as depend-
ence relations do not form a cycle, the statements
can always be vectorized via statement reorder-
ing.
Lemma 2.1. A nest of loops without inconsistently
dependent statement(s) can be fully vectorized
directly if the following two conditions hold.
1. There does not exist a single statement S1 such
that S1ð~iÞ½dt�S1ð~jÞ.2. There does not exist a pair of statements S1 and
S2, where S1<S2, such that S2ð~jÞdS1ð~iÞ, where
d2[D].
Lemma 2.2. A nest of loops with two statements S1
and S2, where S1<S2, S2ð~jÞdS1ð~iÞ and d2[D], is the
only dependence relation in the statements, can be
vectorized via the statement reordering technique,
i.e., reorder S1 and S2 to become S2<S1.
The vectorization of statements in a nest of loops
is inhibited if dependence relations in statements
form a dependence cycle. Statements involved in a
dependence cycle are strongly connected by variousdependence edges. There exists at least one path be-
tween any pairs of statements in the dependence
graph. The vectorizability of the dependence cycle(s)
depends on whether the dependence cycle(s) can be
broken or the level of dependence links that can be
eliminated [1,6,11,21].
3. The breaking strategy and exploitation of
parallelism
In general, we can exploit the parallelism ofa nest of loops as long as one of the existing
732 W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742
dependence links in a dependence cycle is breaka-
ble. If we break one dependence cycle and new
dependence links satisfy the condition of Lemma
2.1, then these statements involved in the depend-
ence cycle can be vectorized. If we break onedependence cycle and new dependence links satisfy
the condition of Lemma 2.2, then these statements
involved in the dependence cycle can be vectorized
via statement reordering. So, to deal with the par-
allelism exploitation of a dependence cycle, we
only need to consider the breakability of its
dependence links in a dependence cycle.
If we examine the possible dependence linksfor a statement S2ð~jÞ on a statement S1ð~iÞ, then
we find seven dependence links which can exist:
(1) true-dependence, (2) anti-dependence, (3)
output-dependence, (4) true- and anti-depen-
dences, (5) true- and output-dependences, (6)
anti- and output-dependences and (7) true-, anti-
and output-dependences [11]. With respect to
breaking strategy of a dependence link of adependence cycle in a nest of loops, the depend-
ence types of a link can be classified as three
patterns: (1) anti-dependence link, (2) output-
dependence link and (3) true-dependence links
including any possible dependence link. The first
two link patterns can be broken while the third
pattern of dependence link is unbreakable. In or-
der to prove the correction of breaking strategy,we need to use a general dependence cycle to
study the breaking strategy of the first two link
patterns. Suppose we have n statements S0,S1,
. . .,Sn 1 in a nest of loops, where S0<S1<
� � �<Sn 1. These n statements are involved in a
dependence cycle in a nest of loops. If there exists
a pair of statements Sa and Sb, where 0 6 a,b 6
n1 and a5b, and the dependence link fromSa to Sb is one of the first two link patterns,
the breaking techniques and applications are de-
scribed below.
3.1. Pattern I––anti-dependence link
For an anti-dependence link of a dependence
cycle in a single loop, a sink variable-renamingalgorithm to break such a dependence cycle was
developed [11,12]. We extend that algorithm to
break such a dependence of a dependence cycle
in a nest of loops. The algorithm is described
below.
Algorithm 1. The Breaking Strategy of Link
Pattern I
Input:
(1) A nest of loops L={l1, l2, . . ., ln}.(2) A set of statements S={S0,S1, . . .,Sn 1} that
are involved in a dependence cycle.(3) The dependence link from Sa to Sb is an anti-
dependence link.
(4) A direction vector matrix D ¼ ð~d1;~d2; . . . ;~dpÞ.
Output: Generate the new dependence links not to
form a new dependence cycle.
Method:
/*
Let Ssrca and Ssin k
b represent the source and sink var-
iables in the dependence link from Sa to Sb,respectively.
Let~db represent the direction vector from Sa to Sb.
Let Sindexa and Sindex
b represent the indexed expres-
sion of the source and sink variables in the
dependence link from Sa to Sb, respectively.
*/
1. If one of elements in a direction vector matrix, D,
includes one direction vector �>�, then the trans-
formation is exited and all of the statements arepreserved. Otherwise, the two variables X and
index represent the sink variable name and the
indexed expression of the sink variable in anti-
dependence between one statement Sa and
another statement Sb, respectively. Simultane-
ously, one Boolean variable is set to a false value.
2. Determine which variables is true-dependent on
the sink variable X in the anti-dependence. If avariable Ssin k
d in a statement Sd is true-dependent
on the sink variable X and the variable Ssin kd can-
not be the sink variable of other true-depen-
dences, then the name of the variable Ssin kd is
changed and the Boolean variable is allocated
to a true value. If the variable Ssin kd is the sink
variable of another true-dependence, then such
a transformation is exited and all of the state-ments are preserved.
W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742 733
3. If the value of the Boolean variable is a true
value, then the name of the sink variable of
the anti-dependence relation is changed and
the value of each element within an extra tem-
porary variable T refers to the value of the sameelement within the sink variable X. Otherwise,
the name of the sink variable in this statement
Sb is changed.
4. Add an extra assignment statement Sc, where
Sc>Sa and Sc>Sb.
5. Sc carries the extra temporary variable T as
the input variable and assigns the values com-
puted in the nest of loops to the sink variableX.
Consider the example in Fig. 1(a). All of the
data dependences in this example include
S1[dt]S2[d
t]S3[da]S1 and all of the direction vectors
Fig. 1. (a) A dependence cycle with an anti-dependence link. (b
in this example contain (<,<),(<,<) and (<,<).
Algorithm 1 is applied to Fig. 1(a) to generate the
codes shown in Fig. 1(b). While all of the data de-
pendences in Fig. 1(b) are: (1) S1[dt]S2[d
t]S3[da]Sc
and (2) S1dtSc. The dependence relations in Fig.
1(b) satisfy the conditions of Lemma 2.1. There-
fore, the codes from Fig. 1(b) can be fully vector-
ized directly. An array section is a subset of the
elements of an array or is itself an array. We can
refer to an array section by specifying a range of
subscript values with a lower bound, an upper
bound and an increment. We use this notation to
describe the codes shown in Fig. 1(b). Fig. 1(c)shows the form of the vectorized codes. We thus
can derive the following theorem.
Theorem 3.1. An anti-dependence link of a depend-
ence cycle in a nest of loops is breakable viaAlgorithm 1.
) The codes of transformation. (c) The vectorized codes.
734 W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742
Proof. To prove the link is breakable we need to
show:
(1) All of the dependence relations are preserved
via Algorithm 1.(2) The original cycle does not exist and no new
cycle forms.
If Sb is anti-dependent on Sa, then by defini-
tion, there exists some common (indexed) varia-
ble(s) in Sa and Sb which are used first by Sa and
then defined by Sb. This relation is determined by
the index expressions and relative position (left-hand side or right-hand side of the assignment
operator) of the variables existing in these two
statements. It is also related to the lexical order of
statements if it is a loop-independent anti-
dependence.
If there does not exist any variables which are
true-dependent on the output variable of Sb, then
since Sc>Sa, Sc>Sb, and Sc uses the new outputvariable of Sb to define the original output vari-
able of Sb, the order of defining the common
variable(s) by Sa and Sb is actually unchanged.
Moreover, in terms of the new output variable
of Sb, the true-dependent relation, between Sb
and Sc also ensures that Sb uses the same
values to define the common variable(s). So, the
renaming technique preserves the original anti-dependence.
If there exist any variables which are true-
dependent on the output variable of Sb, then the
renaming technique in steps 2 and 3 of Algorithm
1 also preserves such a relationship. In case those
variables are themselves also the source variables
of other anti-dependences, renaming semantically
does not change the anti-dependences. Since theredo not exist any dead statements, we do not need
to consider the case that there is a variable which is
loop-independently output-dependent on the out-
put variable of Sb. Other dependences are obvi-
ously not changed via Algorithm 1. Therefore, (1)
holds.
The proof of (2) is straightforward. As there is
no dependence arc from Sc to Sb, the originaldependence cycle does not exist and no new cycle
forms. Both (1) and (2) hold, thus the link is
breakable. h
3.2. Pattern II––output-dependence or anti-
and output-dependence link
For an output-dependence link or anti- and
output-dependence links in a dependence cyclein a single loop, a sink variable-renaming algo-
rithm to break such a dependence cycle was
developed [11,12]. We extend that algorithm to
break such dependence links in a dependence
cycle in a nest of loops. The algorithm is de-
scribed below.
Algorithm 2. The Breaking Strategy of Link
Pattern II
Input:
(1) A nest of loops L={l1, l2, . . ., ln}.(2) A set of statements S={S0,S1, . . .,Sn 1} that
are involved in a dependence cycle.
(3) The dependence link from Sa to Sb may be
either an output-dependence link or anti-
and output-dependence links.
(4) A direction vector matrix D ¼ ð~d1~d2; . . . ;~dpÞ.
Output: Generate the new dependence links not to
form a new dependence cycle.
Method:
/*Let Ssrc
a and Ssin kb represent the source and sink var-
iables in the dependence link from Sa to Sb,
respectively.
Let~db represent the direction vector from Sa to Sb.
Let Sindexa and Sindex
b represent the indexed expres-
sion of the source and sink variables in the
dependence link from Sa to Sb, respectively.
*/
1. If one of elements in a direction vector matrix, D,
includes one direction vector �>�, then the trans-
formation is exited and all of the statements are
preserved. Otherwise, X and index represent the
output variable name and the indexed expressionof the sink variable in either an output-depend-
ence or anti- and output-dependences between
one statement Sa and another statement Sb,
respectively. Simultaneously, a Boolean variable
is assigned to a false value.
W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742 735
2. Determine which variables are true-dependent
on the variable X. If a variable Ssin kd in a state-
ment Sd is true-dependent on the variable Xand the variable Ssin k
d cannot be the sink varia-
ble of other true-dependences, then the name ofthe variable Ssin k
d is changed and the only Boo-
lean variable is allocated to a true value. If the
variable Ssin kd is the sink variable of another
true-dependence, then such a transformation is
exited and all of the statements are preserved.
3. If the value of the Boolean variable is a true
value, then the name of the output variable in
this statement Sb is changed and the value ofeach element within an extra temporary varia-
ble T refers to the value of the same element
Fig. 2. (a) A dependence cycle with an output-dependence links. (b
recording technique. (d) The vectorized codes.
within the output variable X. Otherwise, the
name of the output variable in this statement
Sb is changed.
4. Add an extra assignment statement Sc, Sc>Sa
where Sc>Sb.5. Sc carries the extra temporary variable T as the
input variable and assigns the values computed
in the nest of loops to the sink variable X.
Consider this example in Fig. 2(a). All of the
data dependences in this example contain (1)
S1[do]S2[d
a]S2 and (2) S2[da]S1, and all of the direc-
tion vectors in this example include (<,<),(<,<),
and (<,<). Algorithm 2 is applied to Fig. 2(a) toresult in the codes shown in Fig. 2(b). All of the
) The codes of transformation. (c) The result of the statement
736 W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742
data dependences in Fig. 2(b) consist of: (1)
S2[da]S1[d
o]Sc and (2) S2[da]Sc and S2d
tSc. The
dependence relations in Fig. 2(b) satisfy the condi-
tion of Lemma 2.2. Therefore, the codes from Fig.
2(b) can be vectorized via the statement reorderingtechnique, i.e., reorder S1 and S2 to become
S2<S1. Fig. 2(c) shows the result of such a trans-
formation. Fig. 2(d) shows the form of the vector-
ized code. We thus can derive the following
theorem.
Theorem 3.2. An output-dependence link or anti-
and output-dependence links of a dependence cycle
in a nest of loops is breakable via Algorithm 2.
Proof. Similarly, to prove the link is breakable we
need to show:
(1) All of the dependence relations are preserved
via Algorithm 2.
(2) The original cycle does not exist and no newcycle forms.
The output-dependence relation is proved here.
For the anti and output-dependent relations the
proof is similar.
If Sb is output-dependent on Sa, then by
definition, there exists some common (indexed)
variable(s) in Sa and Sb which are defined first bySa and then defined by Sb.This relation is deter-
mined by the index expressions and relative
position (left-hand side or right-hand side of the
assignment operator) of the variables existing in
these two statements, and is also related to the
lexical order of statements if it is a loop-independ-
ent output-dependence.
In case there do not exist any variables whichare true-dependent on the output variable of Sb,
then since Sc>Sa, and Sc>Sb, and Sc uses the new
output variable of Sb to define the original variable
of Sb, the order of defining the common variable(s)
by Sa and Sb is actually unchanged. Moreover, in
terms of the new output variable of Sb, the true-
dependent relation, between Sb and Sc also ensures
that Sb uses the same values to define the commonvariable(s). So, the renaming technique preserves
the original output-dependence.
If there exist any variables which are true-
dependent on the output variable of Sb, then the
renaming technique in steps 2 and 3 of Algorithm 2
also preserves such a relationship. In case those
variables are themselves also the source variables ofother anti-dependences, renaming semantically
preserves the anti-dependences. Since there do not
exist any dead statements, we do not need to
consider the case that there is a variable which is
loop-independently output-dependent on the out-
put variable of Sb. Other dependences are obviously
not changed via Algorithm 2. Therefore, (1) holds.
The proof of (2) is straightforward, becausethere does not exist a dependence arc from Sc to
Sb, the original dependence cycle does not exist
and no new cycle forms. Both (1) and (2) hold,
thus the link is breakable. h
3.3. Pattern III––true-dependence or other links
A dependence cycle formed by link of pattern III
can only be executed in partial vector mode. We
firstly discuss the resolving strategy for the depend-
ence cycle formed only by true-dependence, and
then further extend the results to the dependence cy-cle involving other types of link patterns III. The
partial vectorization based on static dependence dis-
tance is somewhat conservative in terms of parallel-
ism exploitation in a dependence cycle. To more
fully expose the parallelism we need to further incor-
porate the concept of dynamic dependence distance.
Definition 3.1. If a dependence cycle has no
dependence sub-cycle, then it is called a simple
dependence cycle.
Definition 3.2. Static dependence distance instatements refers to the dependence distance found
at compile-time; dynamic dependence distance in
statements refers to the dependence distance found
at run-time.
Let us use a n-statement simple dependence
cycle, i.e., a dependence cycle, which does not
contain any sub-cycles, to state the concept of
dynamic dependence distance. Suppose we have nstatements S0,S1, . . .,Sn 1 in a nest of loops,
S0<S1<� � �<Sn 1 and S0d+S1d
+� � �Sn 1d+S0,
where d+2D[[D]. The static dependence for
W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742 737
S(i +1)mod n on Si is ki for all i, where 0 6 i 6 n1.
Since the static dependence distance in a nest of
loops for S0 on Sn 1 is kn 1, S0 is free of
dependence for kn 1 iterations. So, S0 can be
executed in vector mode for kn 1 iterations at run-time. After S0 has been executed for kn 1 itera-
tions, the dynamic dependence distance for S1 on
S0 will be kn 1+k0, where k0 is the static depend-
ence distance for S1 on S0. That is, S1 is free of
dependence for kn 1+k0 iterations at run time
instead of k0 iterations at compile-time. Therefore,
in this manner at the first cycle of execution for the
simple dependence cycles the dynamic dependencedistance for Si will be ð
Pi1k¼0kk þ kn1Þ �
ðQn
b¼2Ub Lb þ 1Þ, where 0 6 i 6 n1. Since the
dynamic indirect dependence distance for each
statement on itself in the dependence cycle isPn1k¼0kk, after the first cycle of execution, every
statement can be executed in vector mode for
ðPn1
k¼0kkÞ � ðQn
b¼2Ub Lb þ 1Þ iterations sequen-
tially until the upper bound. These observationsare summarized in the following definition.
Definition 3.3. In a simple recurrence, the
dynamic indirect dependence distance for each
statement on itself is the sum of all static direct
dependence distance between each successive pair
of statements in the path. In the first execution,
the dynamic indirect dependence distance for eachstatement is the sum of all static direct statements
lexically in front of the particular statement.
Based on these descriptions, in order to par-
tially vectorize the statements in a simple depend-ence cycle via dynamic dependence strategy, the
statements must be aligned lexically in front of
strategy. We use the following algorithm to par-
tially vectorize a simple dependence cycle.
Algorithm 3. A Partially Vectorized Method
Input:
(1) A nest of loops L={l1, l2, . . ., ln}.(2) A set of statements S={S0,S1, . . .,Sn 1} that
are involved in a dependence cycle.
Output: Generate the partially vectorizable
codes.
Method:
1. Compute the index upperbound of the outer-
most loopPk¼n1
k¼0 kk.
Generate the outermost loop with a new indexupperbound
Pk¼n1
k¼0 kk.
2. Generate the other loops with the original
index bounds.
3. Align these statements in the above loops.
4. Adjust execution times for each state-
ment based on its dynamic indirect depend-
ence distance. Use control statements to do
this.5. Decompose the outermost loop into the inner
loop and outer loop.
6. The indexed upper bound of the inner loop
and outer loop arePk¼n1
k¼0 kk and
dN=Pk¼n1
k¼0 kke 1, respectively. (N is upper
bound of the original outermost loop.)
Generate the outer loop with a new index
upperbound dN=Pk¼n1
k¼0 kke 1.Generate the inner loop with a new index
upperboundPk¼n1
k¼0 kk.
7. Generate the other loops with the original
index bounds.
8. Modify the index expression for each indexed
variable in statements to get the same values
as the original index expression.
9. Compute the index upperbound of the outer-most
Pk¼n1
k¼0 kk kn1.
Generate the outermost loop with a new index
upper boundPk¼n1
k¼0 kk kn1.
10. Generate the other loops with the original
index bounds.
11. Adjust iteration times for each statement
based on its remaining iterations.
12. Use control statements to do this.
Consider this example in Fig. 3(a). Algorithm 3
is applied to Fig. 3(a) to generate the codes shown
in Fig. 3(b). While Fig. 3(c) shows the form of thevectorized codes. This approach is significantly
efficient, although it may include some difficulties
on the modification of the codes. The following
theorem thus can be derived.
Theorem 3.3. A simple dependence cycle can be
partially vectorized via Algorithm 3.
738 W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742
Proof. Refer to Theorems 3.1 and 3.2. h
3.4. An algorithm for resolution of a general
recurrence
All of the above results are aimed at developing
an efficient algorithm to deal with a general recur-
rence. In the algorithm presented below, the term
p-block is used. A p-block is a node of the depend-ence graph derived by using loop distribution algo-
Fig. 3. (a) A simple dependence cycle with true-dependences. (b) P
vectorized codes.
rithm [14] to general dependence graphs. The node
is either a single statement or a set of statement(s)
cyclically connected by data dependence.
Suppose there exists a general recurrence con-
sisting of n statements. A recursive algorithm usedto resolve it follows.
1. Starting from the first statement, examine one
link connecting it. If it is a pattern-I link, follow
the strategy mentioned in Section 3.1. If it is a
pattern-II link, use the sink variable renaming
artial vectorization via dynamic dependence distance. (c) The
Fig 3 (continued)
W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742 739
technique to break it. For pattern-III link,
ignore it temporarily and examine another link
connecting it. If there does not exist any break-able link is found, do the same thing for the suc-
ceeding link. Once a link is broken go to step 2.
If examine the last link of the last statement and
still do not find a breakable link, go to step 6.
2. After breaking a link, a transformed code with
new relation is generated. Use a global depend-
ence-testing algorithm to determine the new
dependence relations.3. Use Tarjan�s depth-first search technique [19] to
find p-block.
4. Use a topological sorting algorithm [16] to reor-
der the sequences of p-block.
5. Examine the reordered p-block. in order and do
the following steps.
(1) If it is a vectorizable p-block, output the
code and go back to step 5.
(2) If it is an unvectorizable p-block, go to step
1, then go back to step 5.
(3) If no other p-block are left, go to step 7.
6. Use the partial vectorization algorithm to trans-
form the code. For a simple recurrence, use avariation of cycle shrinking technique. For
other patterns of recurrence, use the general
partial vectorization strategy. After code trans-
formation, output the code.
7. Stop.
This algorithm may need to involve a code opti-
mization phase and a performance evaluationphase at the end stage. The former is used to re-
move redundant code in the transformed code.
For example, at the time of breaking anti-depen-
dences (or output-dependences), extra independentloops may be generated. Those loops may be able
to be combined to make the resulted code cleaner
and more efficient. The performance evaluation
phase is used to evaluate the performance of the
transformed code in vector mode, and to decide
whether the original code merits vectorization.
4. Experimental results
Both the original and transformed versions of
the program were tested to evaluate the perform-
ance of the proposed scheme. For the test machine,
we choose the DECmpp model 12000 which has
1024 data processors in our environment [13]. Vec-
tor loops [15] were used as benchmark. The origi-nal programs tested in the benchmark were
executed in scalar mode (sequence mode). The
transformed versions of the same original program
tested were run in parallel mode.
Suppose that kSM and kPM are the execution
time of the original programs and the transformed
programs, subsequently. The speed-up in Table 1
is defined to be the set of kSM/kPM. Each row inTable 1 shows how many times the execution time
of the original program took longer than the exe-
cution time of the transformed program. For
example, the first row shows that there was one
subroutine, called S231, in which the execution
time of the original codes took 41 times longer
Table 1
The performance of the proposed methods to loops tested in Vector loops
Benchmark Loop name Scalar mode (s) Parallel mode (s) Speed-up
Vector loops S231 3.485 0.085 41
Vector loops S232 4.656 0.097 48
Vector loops S221 5.559 0.109 51
Vector loops S212 6.435 0.117 55
Vector loops S211 6.897 0.121 57
Vector loops S243 9.258 0.149 62
Vector loops S241 8.875 0.136 65
Vector loops S244 11.243 0.17 66
Vector loops S222 14.697 0.213 69
740 W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742
than that of the transformed codes. For all of the
subroutines in our experiments, the execution time
of the original programs was indicated to take
from 41 to 69 times longer than the execution time
of the transformed programs. This indicates that
the proposed scheme is very significant in term of
speed-up, ranging from 41 to 69.
5. Conclusion
Parallelism exploitation for statements with
dependence cycles in a nest of loops is necessary.
The vectorizability of a dependence cycle depends
primarily on its dependence links. There exist
seven possible dependence relations for one state-ment on another statement. A dependence cycle
with an anti-dependence link is breakable via node
splitting [17]. In case the source variable for the
anti-dependence relation is itself the sink variable
of another true-dependence, a corrective strategy
should be incorporated into the node splitting
algorithm. An output-dependence link or an anti-
and output-dependence link in a dependence cycleis breakable via a sink variable renaming tech-
nique. This technique is also applicable to break
an anti-dependence link. In practice, node splitting
and sink variable renaming techniques should be
utilized in a complementary manner. For a
dependence cycle formed only by links of true-
dependence or all other dependences, a general,
simple, but less-efficient partial vectorization algo-rithm is available. To improve the efficiency, a dy-
namic dependence concept and its derived
technique are powerful. This approach is particu-
larly efficient to deal with a simple dependence
cycle. All of the recurrence-resolving strategies
can be integrated with the dependence testing tech-
nique, Tarjan�s depth-first search algorithm, and a
topological sorting to develop an automatic recur-
rence-resolving system.
References
[1] R. Allen, K. Kennedy, Automatic translation of Fortran
program to vector form, ACM Transactions on Program-
ming Languages and Systems 9 (4) (1987) 491–542.
[2] U. Banerjee, Dependence, Analysis, Kluwer Academic
Publishers, Norwell, MA, 1997.
[3] U. Banerjee, Loop, Parallelization, Kluwer Academic
Publishers, 1994.
[4] U. Banerjee, Loop, Transformation for Restructuring,
Compilers: The Foundations, Kluwer Academic Publish-
ers, 1993.
[5] W. Blume, R. Eigenmann, Nonlinear and symbolic data
dependence testing, IEEE Transaction on Parallel and
Distributed Systems 9 (12) (1998) 1180–1194.
[6] P.-Y. Calland, A. Darte, Y. Robert, Plugging anti and
output dependence removal techniques into loop parallel-
ization algorithm, Parallel Computing 23 (1997) 251–266.
[7] W.-L. Chang, C.-P. Chu, The extension of the interval test,
Parallel Computing 24 (14) (1998) 2101–2127.
[8] W.-L. Chang, C.-P. Chu, The generalized direction vector
I test, Parallel Computing 27 (8) (2001) 1117–1144.
[9] W.-L. Chang, C.-P. Chu, J. Wu, The generalized lambda
test: a multi-dimensional version of Banerjee�s algorithm,
International Journal of Parallel and Distributed Systems
and Networks 2 (2) (1999) 69–78.
[10] W.-L. Chang, C.-P. Chu, The infinity lambda test: a multi-
dimensional version of Banerjee infinity test, Parallel
Computing 26 (10) (2000) 1275–1295.
[11] C.-P. Chu, D.L. Carver, An analysis of recurrence relation
in Fortran do-loops for vector processing, in: Proc. Fifth
Parallel Processing Symp, IEEE CS Press, Los Alamities,
CA, 1991, pp. 619–625.
W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742 741
[12] C.-P. Chu, D.L. Carver, Exploitation of parallelism in
Fortran do-loops for vectoring processing, Technical
Report 91-004, Department of Computer Science, LSU,
1991.
[13] MasPar Group, Parallel, Programming, language, Digital
Equipment Corporation, Part number: AA-PV4QA-TE,
1993.
[14] D.J. Kuck, R.H. Kuhn, D.A. Padua, B. Leasure, M.
Wolfe, Dependence graphs and compiler optimizations, in:
Conference Record of 8th ACM Symposium on Princ. of
Prog. Lang., 1981, pp. 207–218.
[15] D. Levine, D. Callahan, J. Dongarra, A comparative study
of automatic vectorizing compilers, Parallel Computing 17
(1991) 1223–1244.
[16] D.E. Kunth, The, Art, of, Computer, Programming, vol. 1:
Fundamental Algorithm, Addison-Wesley, Reading, MA,
1973.
[17] C.D. Polychronopoulos, Advanced loop optimizations for
parallel computers, in: Proceedings of the 1987 Interna-
tional Conference on Supercomputing, 1987, pp. 255–277.
[18] W. Shang, M. O�Keefe, J. Fortes, On loop transformation
for generalized cycle shrinking, IEEE Transaction on
Parallel and Distributed Systems 5 (2) (1994) 193–204.
[19] R. Tarjan, Depth first search and linear graph algorithms,
SIAM Journal on Computing 1 (2) (1972) 146–160.
[20] M. Wolfe, High, Performance, Compilers, for, Parallel,
Computing, Addison-Wesley Publishing Company, Red-
wood City, CA, 1996.
[21] H. Zima, B. Chapman, Supercompilers for Parallel and
Vector Computers, Addision-Wesley Publishing Company,
1991.
[22] W.-L. Chang, C.-P. Chu, J.-H. Wu, A precise dependence
analysis for multi-dimensional arrays under specific
dependence direction, The Journal of Systems and Soft-
ware 63 (2) (2002) 99–107.
[23] W.-L. Chang, C.-P. Chu, J. Wu, A multi-dimensional
version of the I test, Parallel Computing 27 (13) (2001)
1783–1799.
[24] W.-L. Chang, C.-P. Chu, J.-H. Wu, A polynomial-time
dependence test for determininginteger-valued solutions in
multi-dimensional arrays under variable bounds, Journal
of Supercomputing, in press.
[25] M. Guo, W.-L. Chang, J. Cao, The non-continuous
direction vector i test, accepted on the International
Symposium on Parallel Architectures, Algorithms, and
Networks (I-SPAN2004), Hong Kong, 2004.
[26] W.-L. Chang, J.-W. Huang, C.-P. Chu, The non-continu-
ous i test: an improved dependence test for reducing
complexity of source level debugging for parallel compilers,
in: The Third International Conference on Parallel and
Distributed Computing, Applications and Technologies
(PDCAT�02), Kanazawa Bunka Hall, Kanazawa, Japan,
4–6 September 2002, pp. 455–462.
[27] W.-L. Chang, B.-H. Chen, A proof method for the
correctness of the interval test to be applied for determining
whether there are integer-valued solutions for one-dimen-
sional arrays with subscripts formed by induction variable,
in: The Third International Conference on Parallel and
Distributed Computing, Applications and Technologies
(PDCAT�02), Kanazawa Bunka Hall, Kanazawa, Japan, 4–
6 September 2002, pp. 52–57.
[28] A. Bik, M. Girkar, P. Grey, X. Tian, Efficient exploitation
of parallelism on Pentium III and Pentium 4 processor-
based systems, Intel Technology Journal Q1 (March)
(2001) 1–9.
[29] D. Petkov, R. Harr, S. Amarasinghe, Efficient pipelining of
nested loops: unroll-and-squash, in: 16th International
Parallel and Distributed Processing Symposium, Fort
Lauderdale, Florida, April, 2002.
[30] W.-L. Chang, C.-P. Chu, J.-H. Wu, A simple and general
approach to parallelize loops with arbitrary control flow
and uniform data dependence distance, The Journal of
Systems and Software 63 (2) (2002) 91–98.
Weng-Long Chang received his Ph.D.degrees in Computer Science andInformation Engineering fromNational Cheng Kung University,Taiwan, Republic of China, in 1999.He is currently an Assistant Professorat the Department of InformationManagement, Southern Taiwan Uni-versity of Technology, Tainan County710, Taiwan, Republic of China. Dr.Chang�s research interests includemolecular computing (including bio-molecular cryptography, bio-molecu-
lar computational complexity and bio-molecular computational
theory), parallel and distributed processing and parallelizingcompilers.Chih-Ping Chu received a B.S. degreein agricultural chemistry fromNational Chung Hsing University,Taiwan, an M.S. degree in computerscience from the University of Cali-fornia, Riverside, and a Ph.D. degreein computer science from LouisianaState University. He is currently aprofessor in the Department of Com-puter Science and Information Engi-neering of National Cheng KungUniversity, Taiwan. His researchinterests include parallelizing compil-
ers, parallel computing, parallel processing, internet computing,
and software engineering.742 W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742
Michael (Shan-Hui) Ho, AssociateProfessor of Southern Taiwan Uni-versity of Technology, has 25 yearsindustrial and academic experience inthe computing field. He had worked asa Sr. Software Engineer and ProjectManager developing B2B/B2C/C2Cweb and multimedia applications anda Sr. database administrator for SQLclustered servers, Oracle, and DB2databases including network/systemsLAN/WAN system administration toUS major corporations and govern-
ment organizations such as the GAS Company, Fox, Fidelity,
ARCO, US LA Hall Records, . . .etc. He is also certified inOracle DBA/Developer (OCP), CCNA/CCNP, MCSE, UnixSolaris systems SA, and Dell server/computer technician engi-neer. Dr. Ho had more than 10 years of college teaching/research experience as an Assistant Professor and ResearchAnalyst at Central Missouri State University, the University ofTexas at Austin, and BPC International Institute. He earned aPh.D. in IS/CS with minors Management and Accounting fromthe University of Texas at Austin and a M.S. degree from St.Mary�s University. His research interests include algorithm andcomputation theories, software engineer, database and datamining, parallel computing, quantum computing and DNAcomputing.