exploitation of parallelism to nested loops with dependence cycles

14
Exploitation of parallelism to nested loops with dependence cycles Weng-Long Chang a, * , Chih-Ping Chu b,1 , Michael (Shan-Hui) Ho a,2 a Department of Information Management, Southern Taiwan University of Technology, Tainan County 710, Taiwan, ROC b Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan City 701, Taiwan, ROC Received 24 March 2001; received in revised form 6 June 2004; accepted 10 June 2004 Available online 25 August 2004 Abstract In this paper, we analyze the recurrences from the breakability of the dependence links formed in general multi-state- ments in a nested loop. The major findings include: (1) A sin k variable renaming technique, which can reposition an undesired anti-dependence and/or output-dependence link, is capable of breaking an anti-dependence and/or output- dependence link. (2) For recurrences connected by only true dependences, a dynamic dependence concept and the derived technique are powerful in terms of parallelism exploitation. (3) By the employment of global dependence test- ing, link-breaking strategy, TarjanÕs depth-first search algorithm, and a topological sorting, an algorithm for resolving a general multi-statement recurrence in a nested loop is proposed. Experiments with benchmark cited from Vector loops showed that among 134 subroutines tested, 3 had their parallelism exploitation amended by our proposed method. That is, our offered algorithm increased the rate of parallelism exploitation of Vector loops by approximately 2.24%. Ó 2004 Published by Elsevier B.V. Keywords: Parallelizing compilers; Vectorizing compilers; Loop optimization; Data dependence analysis; Dependence cycle; Paralle- lism exploitation 1. Introduction In high speed computing [20], there are the two most popular parallel computational models, dis- tributed memory multiprocessors and shared memory multiprocessors. Because the technique to memory hardware [20] is improved, therefore, the access time of shared memory for a system of multiprocessors is obviously decreased and the 1383-7621/$ - see front matter Ó 2004 Published by Elsevier B.V. doi:10.1016/j.sysarc.2004.06.001 * Corresponding author. Tel.: +886 6 6891 421/2533131x 4300; fax: +886-6-2541621. E-mail addresses: [email protected], changwl@ csie.ncku.edu.tw (W.-L. Chang), [email protected] (C.-P. Chu), [email protected] (M. (Shan-Hui) Ho). 1 Tel.: +886 6 2757575x62527; fax: +886 6 2747076. 2 Tel.: +886 6 2533131x4300; fax: +886 6 2541621. Journal of Systems Architecture 50 (2004) 729–742 www.elsevier.com/locate/sysarc

Upload: weng-long-chang

Post on 29-Jun-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Journal of Systems Architecture 50 (2004) 729–742

www.elsevier.com/locate/sysarc

Exploitation of parallelism to nested loops withdependence cycles

Weng-Long Chang a,*, Chih-Ping Chu b,1, Michael (Shan-Hui) Ho a,2

a Department of Information Management, Southern Taiwan University of Technology, Tainan County 710, Taiwan, ROCb Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan City 701, Taiwan, ROC

Received 24 March 2001; received in revised form 6 June 2004; accepted 10 June 2004

Available online 25 August 2004

Abstract

In this paper, we analyze the recurrences from the breakability of the dependence links formed in general multi-state-

ments in a nested loop. The major findings include: (1) A sink variable renaming technique, which can reposition an

undesired anti-dependence and/or output-dependence link, is capable of breaking an anti-dependence and/or output-

dependence link. (2) For recurrences connected by only true dependences, a dynamic dependence concept and the

derived technique are powerful in terms of parallelism exploitation. (3) By the employment of global dependence test-

ing, link-breaking strategy, Tarjan�s depth-first search algorithm, and a topological sorting, an algorithm for resolving a

general multi-statement recurrence in a nested loop is proposed. Experiments with benchmark cited from Vector loops

showed that among 134 subroutines tested, 3 had their parallelism exploitation amended by our proposed method. That

is, our offered algorithm increased the rate of parallelism exploitation of Vector loops by approximately 2.24%.

� 2004 Published by Elsevier B.V.

Keywords: Parallelizing compilers; Vectorizing compilers; Loop optimization; Data dependence analysis; Dependence cycle; Paralle-

lism exploitation

1383-7621/$ - see front matter � 2004 Published by Elsevier B.V.

doi:10.1016/j.sysarc.2004.06.001

* Corresponding author. Tel.: +886 6 6891 421/2533131x

4300; fax: +886-6-2541621.

E-mail addresses: [email protected], changwl@

csie.ncku.edu.tw (W.-L. Chang), [email protected]

(C.-P. Chu), [email protected] (M. (Shan-Hui) Ho).1 Tel.: +886 6 2757575x62527; fax: +886 6 2747076.2 Tel.: +886 6 2533131x4300; fax: +886 6 2541621.

1. Introduction

In high speed computing [20], there are the two

most popular parallel computational models, dis-

tributed memory multiprocessors and shared

memory multiprocessors. Because the techniqueto memory hardware [20] is improved, therefore,

the access time of shared memory for a system of

multiprocessors is obviously decreased and the

730 W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742

system has been increasingly used for scientific and

engineering applications. However, the major

shortcoming of shared memory multiprocessors

is the difficulty in programming because program-

mers are responsible for analyzing data depend-ence relations among statements in programs and

exploiting the parallelism of statements in pro-

grams among shared memory multiprocessors.

A successful vectored/paralleled compiler is

capable of exploiting the parallelism of a program.

Three features of a vectored/paralleled compiler

determine its level of parallelism exploitation: (1)

an accurate data dependence testing, (2) efficientloop optimization and (3) efficient removals of

undesired data dependences.

Techniques for dependence analysis algorithms,

which directly support data dependence testing,

have been developed and used quite successfully

[2,5,7–10,20,22–27]. Computationally expensive

programs in general spend most of their time in

the execution of loops. Extracting parallelism fromloops in an ordinary program therefore has a con-

siderable effect on the speed-up. Many loop opti-

mizational methods have been developed and

broadly fallen into two classes: loop vectoriza-

tion and loop parallelization [3,4,20,21,28–30].

In terms of the reduction of data dependences,

most researches concentrate on loop optimiza-

tion by the front-end of vectored/paralleled com-pilers such as scalar renaming, scalar expansion,

scalar forward-substitution and dead code elimina-

tion [20]. Studies on the back-end of vectored/

paralleled compilers primarily deal with separation

of parallelism execution and sequential execu-

tion of the statements. Relatively less attention

has been given to data dependence elimination

[1,9,11].Recurrence is a type of p-blocks in a general

nested loop, which is extracted at the time of loop

distribution, a back-end phase [17]. Statement(s)

involved in a recurrence are strongly connected

via various dependence types. Famous techniques,

such as node splitting, thresholding, cycle shrinking,

etc., are able to eliminate data dependence of a

recurrence to a certain extent, depending uponspecific dependence types [1,17,18,20,21].

In this paper, we study a parallelism exploita-

tion for loops with dependence cycles on basis of

breaking dependence links. Formally the breaking

strategies for dependence cycles were surveyed in a

single loop [11]. Their method is extended to break

the dependence links in a nest of loops. In Section

2, the concept of data dependence is reviewed. InSection 3, an analysis of the formation of depend-

ence cycles is provided and three dependence links

on the breaking-strategy basis are derived. For

each link pattern, its features, breaking techniques

and applications are introduced in detail. An algo-

rithm is developed for the resolution of a general

multi-statement recurrence. Experimental results

showing the advantages of the proposed methodare given in Section 4. Finally, Section 5 contains

a conclusion.

2. Data dependence

It is assumed that there are two statements

within a general loop. The general loop is pre-sumed to contain n common loops. Statements

are postulated to be embedded in n common loops.

An array A is supposed to appear simultaneously

within statements and if a statement S2 uses the

element of the array A defined first by another

statement S1, then S2 is true-dependent on S1. If

a statement S2 defines the element of the array A

used first by another statement S1, then S2 isanti-dependent on S1. If a statement S2 redefines

the element of the array A defined first by another

statement S1, then S2 is output-dependent on S1.

Another dependence, control dependence, which

arises due to control statements, is not addressed

in this paper.

Each iteration of a general loop is identified by

an iteration vector whose elements are the valuesof the iteration variables for that iteration. For

example, the instance of the statement S1 during

iteration ~i ¼ ði1; . . . ; inÞ is denoted S1ð~iÞ; the in-

stance of the statement S2 during iteration~j ¼ ðj1; . . . ; jnÞ is denoted S2ð~jÞ. If (i1,. . .,in) is iden-

tical to (j1, . . ., jn) or (i1, . . ., in) precedes (j1, . . ., jn)lexicographically, then S1ð~iÞ is said to precede

S2ð~jÞ, denoted S1ð~iÞ < S2ð~jÞ. Otherwise, S2ð~jÞ issaid to precede S1ð~iÞ, denoted S1ð~iÞ > S2ð~jÞ. In

the following, Definitions 2.1–2.7, cited from

[3,4,11], will be used later.

W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742 731

Definition 2.1. Loop-independent dependence re-

fers to the dependence confined within each single

iteration. Loop-independent dependences include

loop-independent true-dependence (denoted dt),

loop-independent anti-dependence (denoted da)and loop-independent output-dependence (de-

noted do). These relations are represented by the

set (denoted D), i.e., D={dt,da,do}.

Definition 2.2. Consistent loop-carried depend-

ence refers to the dependence occurring across

the iteration boundaries. Consistent loop-carried

dependences include consistent loop-carried true-dependence (denoted [dt]), consistent loop-carried

anti-dependence (denoted [da]) and consistent

loop-carried output-dependence (denoted [do]).

These relations are represented by the set (denoted

[D]), i.e., [D]={[dt], [da], [do]}.

Definition 2.3. A vector of the form ~h ¼ðh1; . . . ; hnÞ is termed as a direction vector. Thedirection vector (h1, . . .,hn) is said to be the direc-

tion vector from S1ð~iÞ to S2ð~jÞ if for 1 6 k 6 n,

ikhk jk, i.e., the relation hk is defined by

hk ¼

< if ik < jk;¼ if ik ¼ jk;> if ik > jk�the relation of ik and jk can be ignored;

i:e:; can be any one off<;¼; >g:

8>>>><>>>>:

We remember S1 d~h S2, where d2D[ [D] and~h ¼ ðh1; h2; . . . ; hnÞ.

Definition 2.4. The dependence distance vector

from S1ð~iÞ to S2ð~jÞ is denoted by distð~i;~jÞ ¼ðj1 i1; . . . ; jn inÞ.

Definition 2.5. The dependence distance matrix of

nested loops is a matrix whose columns are the

dependence distance vectors of all the dependences

in nested loops.

Definition 2.6. For an inter-statement or intra-

statement dependence, the source variable of the

dependence refers to the instance of an indexed

variable to be accessed first; the sink variable of

the dependence refers to the instance of indexed

variable to be accessed later.

Definition 2.7. The direction vector matrix of

nested loops is a matrix whose columns are the

direction vectors of all the dependences in nested

loops.

In the following, Lemma 2.1, cited from [12],introduces a dependence relation for which direct

vectorization of the code is available. Lemma 2.2,

cited from [1,21], emphasizes an important fact.

That is, no matter how complicated dependence

relations in statements are, so long as depend-

ence relations do not form a cycle, the statements

can always be vectorized via statement reorder-

ing.

Lemma 2.1. A nest of loops without inconsistently

dependent statement(s) can be fully vectorized

directly if the following two conditions hold.

1. There does not exist a single statement S1 such

that S1ð~iÞ½dt�S1ð~jÞ.2. There does not exist a pair of statements S1 and

S2, where S1<S2, such that S2ð~jÞdS1ð~iÞ, where

d2[D].

Lemma 2.2. A nest of loops with two statements S1

and S2, where S1<S2, S2ð~jÞdS1ð~iÞ and d2[D], is the

only dependence relation in the statements, can be

vectorized via the statement reordering technique,

i.e., reorder S1 and S2 to become S2<S1.

The vectorization of statements in a nest of loops

is inhibited if dependence relations in statements

form a dependence cycle. Statements involved in a

dependence cycle are strongly connected by variousdependence edges. There exists at least one path be-

tween any pairs of statements in the dependence

graph. The vectorizability of the dependence cycle(s)

depends on whether the dependence cycle(s) can be

broken or the level of dependence links that can be

eliminated [1,6,11,21].

3. The breaking strategy and exploitation of

parallelism

In general, we can exploit the parallelism ofa nest of loops as long as one of the existing

732 W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742

dependence links in a dependence cycle is breaka-

ble. If we break one dependence cycle and new

dependence links satisfy the condition of Lemma

2.1, then these statements involved in the depend-

ence cycle can be vectorized. If we break onedependence cycle and new dependence links satisfy

the condition of Lemma 2.2, then these statements

involved in the dependence cycle can be vectorized

via statement reordering. So, to deal with the par-

allelism exploitation of a dependence cycle, we

only need to consider the breakability of its

dependence links in a dependence cycle.

If we examine the possible dependence linksfor a statement S2ð~jÞ on a statement S1ð~iÞ, then

we find seven dependence links which can exist:

(1) true-dependence, (2) anti-dependence, (3)

output-dependence, (4) true- and anti-depen-

dences, (5) true- and output-dependences, (6)

anti- and output-dependences and (7) true-, anti-

and output-dependences [11]. With respect to

breaking strategy of a dependence link of adependence cycle in a nest of loops, the depend-

ence types of a link can be classified as three

patterns: (1) anti-dependence link, (2) output-

dependence link and (3) true-dependence links

including any possible dependence link. The first

two link patterns can be broken while the third

pattern of dependence link is unbreakable. In or-

der to prove the correction of breaking strategy,we need to use a general dependence cycle to

study the breaking strategy of the first two link

patterns. Suppose we have n statements S0,S1,

. . .,Sn 1 in a nest of loops, where S0<S1<

� � �<Sn 1. These n statements are involved in a

dependence cycle in a nest of loops. If there exists

a pair of statements Sa and Sb, where 0 6 a,b 6

n1 and a5b, and the dependence link fromSa to Sb is one of the first two link patterns,

the breaking techniques and applications are de-

scribed below.

3.1. Pattern I––anti-dependence link

For an anti-dependence link of a dependence

cycle in a single loop, a sink variable-renamingalgorithm to break such a dependence cycle was

developed [11,12]. We extend that algorithm to

break such a dependence of a dependence cycle

in a nest of loops. The algorithm is described

below.

Algorithm 1. The Breaking Strategy of Link

Pattern I

Input:

(1) A nest of loops L={l1, l2, . . ., ln}.(2) A set of statements S={S0,S1, . . .,Sn 1} that

are involved in a dependence cycle.(3) The dependence link from Sa to Sb is an anti-

dependence link.

(4) A direction vector matrix D ¼ ð~d1;~d2; . . . ;~dpÞ.

Output: Generate the new dependence links not to

form a new dependence cycle.

Method:

/*

Let Ssrca and Ssin k

b represent the source and sink var-

iables in the dependence link from Sa to Sb,respectively.

Let~db represent the direction vector from Sa to Sb.

Let Sindexa and Sindex

b represent the indexed expres-

sion of the source and sink variables in the

dependence link from Sa to Sb, respectively.

*/

1. If one of elements in a direction vector matrix, D,

includes one direction vector �>�, then the trans-

formation is exited and all of the statements arepreserved. Otherwise, the two variables X and

index represent the sink variable name and the

indexed expression of the sink variable in anti-

dependence between one statement Sa and

another statement Sb, respectively. Simultane-

ously, one Boolean variable is set to a false value.

2. Determine which variables is true-dependent on

the sink variable X in the anti-dependence. If avariable Ssin k

d in a statement Sd is true-dependent

on the sink variable X and the variable Ssin kd can-

not be the sink variable of other true-depen-

dences, then the name of the variable Ssin kd is

changed and the Boolean variable is allocated

to a true value. If the variable Ssin kd is the sink

variable of another true-dependence, then such

a transformation is exited and all of the state-ments are preserved.

W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742 733

3. If the value of the Boolean variable is a true

value, then the name of the sink variable of

the anti-dependence relation is changed and

the value of each element within an extra tem-

porary variable T refers to the value of the sameelement within the sink variable X. Otherwise,

the name of the sink variable in this statement

Sb is changed.

4. Add an extra assignment statement Sc, where

Sc>Sa and Sc>Sb.

5. Sc carries the extra temporary variable T as

the input variable and assigns the values com-

puted in the nest of loops to the sink variableX.

Consider the example in Fig. 1(a). All of the

data dependences in this example include

S1[dt]S2[d

t]S3[da]S1 and all of the direction vectors

Fig. 1. (a) A dependence cycle with an anti-dependence link. (b

in this example contain (<,<),(<,<) and (<,<).

Algorithm 1 is applied to Fig. 1(a) to generate the

codes shown in Fig. 1(b). While all of the data de-

pendences in Fig. 1(b) are: (1) S1[dt]S2[d

t]S3[da]Sc

and (2) S1dtSc. The dependence relations in Fig.

1(b) satisfy the conditions of Lemma 2.1. There-

fore, the codes from Fig. 1(b) can be fully vector-

ized directly. An array section is a subset of the

elements of an array or is itself an array. We can

refer to an array section by specifying a range of

subscript values with a lower bound, an upper

bound and an increment. We use this notation to

describe the codes shown in Fig. 1(b). Fig. 1(c)shows the form of the vectorized codes. We thus

can derive the following theorem.

Theorem 3.1. An anti-dependence link of a depend-

ence cycle in a nest of loops is breakable viaAlgorithm 1.

) The codes of transformation. (c) The vectorized codes.

734 W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742

Proof. To prove the link is breakable we need to

show:

(1) All of the dependence relations are preserved

via Algorithm 1.(2) The original cycle does not exist and no new

cycle forms.

If Sb is anti-dependent on Sa, then by defini-

tion, there exists some common (indexed) varia-

ble(s) in Sa and Sb which are used first by Sa and

then defined by Sb. This relation is determined by

the index expressions and relative position (left-hand side or right-hand side of the assignment

operator) of the variables existing in these two

statements. It is also related to the lexical order of

statements if it is a loop-independent anti-

dependence.

If there does not exist any variables which are

true-dependent on the output variable of Sb, then

since Sc>Sa, Sc>Sb, and Sc uses the new outputvariable of Sb to define the original output vari-

able of Sb, the order of defining the common

variable(s) by Sa and Sb is actually unchanged.

Moreover, in terms of the new output variable

of Sb, the true-dependent relation, between Sb

and Sc also ensures that Sb uses the same

values to define the common variable(s). So, the

renaming technique preserves the original anti-dependence.

If there exist any variables which are true-

dependent on the output variable of Sb, then the

renaming technique in steps 2 and 3 of Algorithm

1 also preserves such a relationship. In case those

variables are themselves also the source variables

of other anti-dependences, renaming semantically

does not change the anti-dependences. Since theredo not exist any dead statements, we do not need

to consider the case that there is a variable which is

loop-independently output-dependent on the out-

put variable of Sb. Other dependences are obvi-

ously not changed via Algorithm 1. Therefore, (1)

holds.

The proof of (2) is straightforward. As there is

no dependence arc from Sc to Sb, the originaldependence cycle does not exist and no new cycle

forms. Both (1) and (2) hold, thus the link is

breakable. h

3.2. Pattern II––output-dependence or anti-

and output-dependence link

For an output-dependence link or anti- and

output-dependence links in a dependence cyclein a single loop, a sink variable-renaming algo-

rithm to break such a dependence cycle was

developed [11,12]. We extend that algorithm to

break such dependence links in a dependence

cycle in a nest of loops. The algorithm is de-

scribed below.

Algorithm 2. The Breaking Strategy of Link

Pattern II

Input:

(1) A nest of loops L={l1, l2, . . ., ln}.(2) A set of statements S={S0,S1, . . .,Sn 1} that

are involved in a dependence cycle.

(3) The dependence link from Sa to Sb may be

either an output-dependence link or anti-

and output-dependence links.

(4) A direction vector matrix D ¼ ð~d1~d2; . . . ;~dpÞ.

Output: Generate the new dependence links not to

form a new dependence cycle.

Method:

/*Let Ssrc

a and Ssin kb represent the source and sink var-

iables in the dependence link from Sa to Sb,

respectively.

Let~db represent the direction vector from Sa to Sb.

Let Sindexa and Sindex

b represent the indexed expres-

sion of the source and sink variables in the

dependence link from Sa to Sb, respectively.

*/

1. If one of elements in a direction vector matrix, D,

includes one direction vector �>�, then the trans-

formation is exited and all of the statements are

preserved. Otherwise, X and index represent the

output variable name and the indexed expressionof the sink variable in either an output-depend-

ence or anti- and output-dependences between

one statement Sa and another statement Sb,

respectively. Simultaneously, a Boolean variable

is assigned to a false value.

W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742 735

2. Determine which variables are true-dependent

on the variable X. If a variable Ssin kd in a state-

ment Sd is true-dependent on the variable Xand the variable Ssin k

d cannot be the sink varia-

ble of other true-dependences, then the name ofthe variable Ssin k

d is changed and the only Boo-

lean variable is allocated to a true value. If the

variable Ssin kd is the sink variable of another

true-dependence, then such a transformation is

exited and all of the statements are preserved.

3. If the value of the Boolean variable is a true

value, then the name of the output variable in

this statement Sb is changed and the value ofeach element within an extra temporary varia-

ble T refers to the value of the same element

Fig. 2. (a) A dependence cycle with an output-dependence links. (b

recording technique. (d) The vectorized codes.

within the output variable X. Otherwise, the

name of the output variable in this statement

Sb is changed.

4. Add an extra assignment statement Sc, Sc>Sa

where Sc>Sb.5. Sc carries the extra temporary variable T as the

input variable and assigns the values computed

in the nest of loops to the sink variable X.

Consider this example in Fig. 2(a). All of the

data dependences in this example contain (1)

S1[do]S2[d

a]S2 and (2) S2[da]S1, and all of the direc-

tion vectors in this example include (<,<),(<,<),

and (<,<). Algorithm 2 is applied to Fig. 2(a) toresult in the codes shown in Fig. 2(b). All of the

) The codes of transformation. (c) The result of the statement

736 W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742

data dependences in Fig. 2(b) consist of: (1)

S2[da]S1[d

o]Sc and (2) S2[da]Sc and S2d

tSc. The

dependence relations in Fig. 2(b) satisfy the condi-

tion of Lemma 2.2. Therefore, the codes from Fig.

2(b) can be vectorized via the statement reorderingtechnique, i.e., reorder S1 and S2 to become

S2<S1. Fig. 2(c) shows the result of such a trans-

formation. Fig. 2(d) shows the form of the vector-

ized code. We thus can derive the following

theorem.

Theorem 3.2. An output-dependence link or anti-

and output-dependence links of a dependence cycle

in a nest of loops is breakable via Algorithm 2.

Proof. Similarly, to prove the link is breakable we

need to show:

(1) All of the dependence relations are preserved

via Algorithm 2.

(2) The original cycle does not exist and no newcycle forms.

The output-dependence relation is proved here.

For the anti and output-dependent relations the

proof is similar.

If Sb is output-dependent on Sa, then by

definition, there exists some common (indexed)

variable(s) in Sa and Sb which are defined first bySa and then defined by Sb.This relation is deter-

mined by the index expressions and relative

position (left-hand side or right-hand side of the

assignment operator) of the variables existing in

these two statements, and is also related to the

lexical order of statements if it is a loop-independ-

ent output-dependence.

In case there do not exist any variables whichare true-dependent on the output variable of Sb,

then since Sc>Sa, and Sc>Sb, and Sc uses the new

output variable of Sb to define the original variable

of Sb, the order of defining the common variable(s)

by Sa and Sb is actually unchanged. Moreover, in

terms of the new output variable of Sb, the true-

dependent relation, between Sb and Sc also ensures

that Sb uses the same values to define the commonvariable(s). So, the renaming technique preserves

the original output-dependence.

If there exist any variables which are true-

dependent on the output variable of Sb, then the

renaming technique in steps 2 and 3 of Algorithm 2

also preserves such a relationship. In case those

variables are themselves also the source variables ofother anti-dependences, renaming semantically

preserves the anti-dependences. Since there do not

exist any dead statements, we do not need to

consider the case that there is a variable which is

loop-independently output-dependent on the out-

put variable of Sb. Other dependences are obviously

not changed via Algorithm 2. Therefore, (1) holds.

The proof of (2) is straightforward, becausethere does not exist a dependence arc from Sc to

Sb, the original dependence cycle does not exist

and no new cycle forms. Both (1) and (2) hold,

thus the link is breakable. h

3.3. Pattern III––true-dependence or other links

A dependence cycle formed by link of pattern III

can only be executed in partial vector mode. We

firstly discuss the resolving strategy for the depend-

ence cycle formed only by true-dependence, and

then further extend the results to the dependence cy-cle involving other types of link patterns III. The

partial vectorization based on static dependence dis-

tance is somewhat conservative in terms of parallel-

ism exploitation in a dependence cycle. To more

fully expose the parallelism we need to further incor-

porate the concept of dynamic dependence distance.

Definition 3.1. If a dependence cycle has no

dependence sub-cycle, then it is called a simple

dependence cycle.

Definition 3.2. Static dependence distance instatements refers to the dependence distance found

at compile-time; dynamic dependence distance in

statements refers to the dependence distance found

at run-time.

Let us use a n-statement simple dependence

cycle, i.e., a dependence cycle, which does not

contain any sub-cycles, to state the concept of

dynamic dependence distance. Suppose we have nstatements S0,S1, . . .,Sn 1 in a nest of loops,

S0<S1<� � �<Sn 1 and S0d+S1d

+� � �Sn 1d+S0,

where d+2D[[D]. The static dependence for

W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742 737

S(i +1)mod n on Si is ki for all i, where 0 6 i 6 n1.

Since the static dependence distance in a nest of

loops for S0 on Sn 1 is kn 1, S0 is free of

dependence for kn 1 iterations. So, S0 can be

executed in vector mode for kn 1 iterations at run-time. After S0 has been executed for kn 1 itera-

tions, the dynamic dependence distance for S1 on

S0 will be kn 1+k0, where k0 is the static depend-

ence distance for S1 on S0. That is, S1 is free of

dependence for kn 1+k0 iterations at run time

instead of k0 iterations at compile-time. Therefore,

in this manner at the first cycle of execution for the

simple dependence cycles the dynamic dependencedistance for Si will be ð

Pi1k¼0kk þ kn1Þ �

ðQn

b¼2Ub Lb þ 1Þ, where 0 6 i 6 n1. Since the

dynamic indirect dependence distance for each

statement on itself in the dependence cycle isPn1k¼0kk, after the first cycle of execution, every

statement can be executed in vector mode for

ðPn1

k¼0kkÞ � ðQn

b¼2Ub Lb þ 1Þ iterations sequen-

tially until the upper bound. These observationsare summarized in the following definition.

Definition 3.3. In a simple recurrence, the

dynamic indirect dependence distance for each

statement on itself is the sum of all static direct

dependence distance between each successive pair

of statements in the path. In the first execution,

the dynamic indirect dependence distance for eachstatement is the sum of all static direct statements

lexically in front of the particular statement.

Based on these descriptions, in order to par-

tially vectorize the statements in a simple depend-ence cycle via dynamic dependence strategy, the

statements must be aligned lexically in front of

strategy. We use the following algorithm to par-

tially vectorize a simple dependence cycle.

Algorithm 3. A Partially Vectorized Method

Input:

(1) A nest of loops L={l1, l2, . . ., ln}.(2) A set of statements S={S0,S1, . . .,Sn 1} that

are involved in a dependence cycle.

Output: Generate the partially vectorizable

codes.

Method:

1. Compute the index upperbound of the outer-

most loopPk¼n1

k¼0 kk.

Generate the outermost loop with a new indexupperbound

Pk¼n1

k¼0 kk.

2. Generate the other loops with the original

index bounds.

3. Align these statements in the above loops.

4. Adjust execution times for each state-

ment based on its dynamic indirect depend-

ence distance. Use control statements to do

this.5. Decompose the outermost loop into the inner

loop and outer loop.

6. The indexed upper bound of the inner loop

and outer loop arePk¼n1

k¼0 kk and

dN=Pk¼n1

k¼0 kke 1, respectively. (N is upper

bound of the original outermost loop.)

Generate the outer loop with a new index

upperbound dN=Pk¼n1

k¼0 kke 1.Generate the inner loop with a new index

upperboundPk¼n1

k¼0 kk.

7. Generate the other loops with the original

index bounds.

8. Modify the index expression for each indexed

variable in statements to get the same values

as the original index expression.

9. Compute the index upperbound of the outer-most

Pk¼n1

k¼0 kk kn1.

Generate the outermost loop with a new index

upper boundPk¼n1

k¼0 kk kn1.

10. Generate the other loops with the original

index bounds.

11. Adjust iteration times for each statement

based on its remaining iterations.

12. Use control statements to do this.

Consider this example in Fig. 3(a). Algorithm 3

is applied to Fig. 3(a) to generate the codes shown

in Fig. 3(b). While Fig. 3(c) shows the form of thevectorized codes. This approach is significantly

efficient, although it may include some difficulties

on the modification of the codes. The following

theorem thus can be derived.

Theorem 3.3. A simple dependence cycle can be

partially vectorized via Algorithm 3.

738 W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742

Proof. Refer to Theorems 3.1 and 3.2. h

3.4. An algorithm for resolution of a general

recurrence

All of the above results are aimed at developing

an efficient algorithm to deal with a general recur-

rence. In the algorithm presented below, the term

p-block is used. A p-block is a node of the depend-ence graph derived by using loop distribution algo-

Fig. 3. (a) A simple dependence cycle with true-dependences. (b) P

vectorized codes.

rithm [14] to general dependence graphs. The node

is either a single statement or a set of statement(s)

cyclically connected by data dependence.

Suppose there exists a general recurrence con-

sisting of n statements. A recursive algorithm usedto resolve it follows.

1. Starting from the first statement, examine one

link connecting it. If it is a pattern-I link, follow

the strategy mentioned in Section 3.1. If it is a

pattern-II link, use the sink variable renaming

artial vectorization via dynamic dependence distance. (c) The

Fig 3 (continued)

W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742 739

technique to break it. For pattern-III link,

ignore it temporarily and examine another link

connecting it. If there does not exist any break-able link is found, do the same thing for the suc-

ceeding link. Once a link is broken go to step 2.

If examine the last link of the last statement and

still do not find a breakable link, go to step 6.

2. After breaking a link, a transformed code with

new relation is generated. Use a global depend-

ence-testing algorithm to determine the new

dependence relations.3. Use Tarjan�s depth-first search technique [19] to

find p-block.

4. Use a topological sorting algorithm [16] to reor-

der the sequences of p-block.

5. Examine the reordered p-block. in order and do

the following steps.

(1) If it is a vectorizable p-block, output the

code and go back to step 5.

(2) If it is an unvectorizable p-block, go to step

1, then go back to step 5.

(3) If no other p-block are left, go to step 7.

6. Use the partial vectorization algorithm to trans-

form the code. For a simple recurrence, use avariation of cycle shrinking technique. For

other patterns of recurrence, use the general

partial vectorization strategy. After code trans-

formation, output the code.

7. Stop.

This algorithm may need to involve a code opti-

mization phase and a performance evaluationphase at the end stage. The former is used to re-

move redundant code in the transformed code.

For example, at the time of breaking anti-depen-

dences (or output-dependences), extra independentloops may be generated. Those loops may be able

to be combined to make the resulted code cleaner

and more efficient. The performance evaluation

phase is used to evaluate the performance of the

transformed code in vector mode, and to decide

whether the original code merits vectorization.

4. Experimental results

Both the original and transformed versions of

the program were tested to evaluate the perform-

ance of the proposed scheme. For the test machine,

we choose the DECmpp model 12000 which has

1024 data processors in our environment [13]. Vec-

tor loops [15] were used as benchmark. The origi-nal programs tested in the benchmark were

executed in scalar mode (sequence mode). The

transformed versions of the same original program

tested were run in parallel mode.

Suppose that kSM and kPM are the execution

time of the original programs and the transformed

programs, subsequently. The speed-up in Table 1

is defined to be the set of kSM/kPM. Each row inTable 1 shows how many times the execution time

of the original program took longer than the exe-

cution time of the transformed program. For

example, the first row shows that there was one

subroutine, called S231, in which the execution

time of the original codes took 41 times longer

Table 1

The performance of the proposed methods to loops tested in Vector loops

Benchmark Loop name Scalar mode (s) Parallel mode (s) Speed-up

Vector loops S231 3.485 0.085 41

Vector loops S232 4.656 0.097 48

Vector loops S221 5.559 0.109 51

Vector loops S212 6.435 0.117 55

Vector loops S211 6.897 0.121 57

Vector loops S243 9.258 0.149 62

Vector loops S241 8.875 0.136 65

Vector loops S244 11.243 0.17 66

Vector loops S222 14.697 0.213 69

740 W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742

than that of the transformed codes. For all of the

subroutines in our experiments, the execution time

of the original programs was indicated to take

from 41 to 69 times longer than the execution time

of the transformed programs. This indicates that

the proposed scheme is very significant in term of

speed-up, ranging from 41 to 69.

5. Conclusion

Parallelism exploitation for statements with

dependence cycles in a nest of loops is necessary.

The vectorizability of a dependence cycle depends

primarily on its dependence links. There exist

seven possible dependence relations for one state-ment on another statement. A dependence cycle

with an anti-dependence link is breakable via node

splitting [17]. In case the source variable for the

anti-dependence relation is itself the sink variable

of another true-dependence, a corrective strategy

should be incorporated into the node splitting

algorithm. An output-dependence link or an anti-

and output-dependence link in a dependence cycleis breakable via a sink variable renaming tech-

nique. This technique is also applicable to break

an anti-dependence link. In practice, node splitting

and sink variable renaming techniques should be

utilized in a complementary manner. For a

dependence cycle formed only by links of true-

dependence or all other dependences, a general,

simple, but less-efficient partial vectorization algo-rithm is available. To improve the efficiency, a dy-

namic dependence concept and its derived

technique are powerful. This approach is particu-

larly efficient to deal with a simple dependence

cycle. All of the recurrence-resolving strategies

can be integrated with the dependence testing tech-

nique, Tarjan�s depth-first search algorithm, and a

topological sorting to develop an automatic recur-

rence-resolving system.

References

[1] R. Allen, K. Kennedy, Automatic translation of Fortran

program to vector form, ACM Transactions on Program-

ming Languages and Systems 9 (4) (1987) 491–542.

[2] U. Banerjee, Dependence, Analysis, Kluwer Academic

Publishers, Norwell, MA, 1997.

[3] U. Banerjee, Loop, Parallelization, Kluwer Academic

Publishers, 1994.

[4] U. Banerjee, Loop, Transformation for Restructuring,

Compilers: The Foundations, Kluwer Academic Publish-

ers, 1993.

[5] W. Blume, R. Eigenmann, Nonlinear and symbolic data

dependence testing, IEEE Transaction on Parallel and

Distributed Systems 9 (12) (1998) 1180–1194.

[6] P.-Y. Calland, A. Darte, Y. Robert, Plugging anti and

output dependence removal techniques into loop parallel-

ization algorithm, Parallel Computing 23 (1997) 251–266.

[7] W.-L. Chang, C.-P. Chu, The extension of the interval test,

Parallel Computing 24 (14) (1998) 2101–2127.

[8] W.-L. Chang, C.-P. Chu, The generalized direction vector

I test, Parallel Computing 27 (8) (2001) 1117–1144.

[9] W.-L. Chang, C.-P. Chu, J. Wu, The generalized lambda

test: a multi-dimensional version of Banerjee�s algorithm,

International Journal of Parallel and Distributed Systems

and Networks 2 (2) (1999) 69–78.

[10] W.-L. Chang, C.-P. Chu, The infinity lambda test: a multi-

dimensional version of Banerjee infinity test, Parallel

Computing 26 (10) (2000) 1275–1295.

[11] C.-P. Chu, D.L. Carver, An analysis of recurrence relation

in Fortran do-loops for vector processing, in: Proc. Fifth

Parallel Processing Symp, IEEE CS Press, Los Alamities,

CA, 1991, pp. 619–625.

W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742 741

[12] C.-P. Chu, D.L. Carver, Exploitation of parallelism in

Fortran do-loops for vectoring processing, Technical

Report 91-004, Department of Computer Science, LSU,

1991.

[13] MasPar Group, Parallel, Programming, language, Digital

Equipment Corporation, Part number: AA-PV4QA-TE,

1993.

[14] D.J. Kuck, R.H. Kuhn, D.A. Padua, B. Leasure, M.

Wolfe, Dependence graphs and compiler optimizations, in:

Conference Record of 8th ACM Symposium on Princ. of

Prog. Lang., 1981, pp. 207–218.

[15] D. Levine, D. Callahan, J. Dongarra, A comparative study

of automatic vectorizing compilers, Parallel Computing 17

(1991) 1223–1244.

[16] D.E. Kunth, The, Art, of, Computer, Programming, vol. 1:

Fundamental Algorithm, Addison-Wesley, Reading, MA,

1973.

[17] C.D. Polychronopoulos, Advanced loop optimizations for

parallel computers, in: Proceedings of the 1987 Interna-

tional Conference on Supercomputing, 1987, pp. 255–277.

[18] W. Shang, M. O�Keefe, J. Fortes, On loop transformation

for generalized cycle shrinking, IEEE Transaction on

Parallel and Distributed Systems 5 (2) (1994) 193–204.

[19] R. Tarjan, Depth first search and linear graph algorithms,

SIAM Journal on Computing 1 (2) (1972) 146–160.

[20] M. Wolfe, High, Performance, Compilers, for, Parallel,

Computing, Addison-Wesley Publishing Company, Red-

wood City, CA, 1996.

[21] H. Zima, B. Chapman, Supercompilers for Parallel and

Vector Computers, Addision-Wesley Publishing Company,

1991.

[22] W.-L. Chang, C.-P. Chu, J.-H. Wu, A precise dependence

analysis for multi-dimensional arrays under specific

dependence direction, The Journal of Systems and Soft-

ware 63 (2) (2002) 99–107.

[23] W.-L. Chang, C.-P. Chu, J. Wu, A multi-dimensional

version of the I test, Parallel Computing 27 (13) (2001)

1783–1799.

[24] W.-L. Chang, C.-P. Chu, J.-H. Wu, A polynomial-time

dependence test for determininginteger-valued solutions in

multi-dimensional arrays under variable bounds, Journal

of Supercomputing, in press.

[25] M. Guo, W.-L. Chang, J. Cao, The non-continuous

direction vector i test, accepted on the International

Symposium on Parallel Architectures, Algorithms, and

Networks (I-SPAN2004), Hong Kong, 2004.

[26] W.-L. Chang, J.-W. Huang, C.-P. Chu, The non-continu-

ous i test: an improved dependence test for reducing

complexity of source level debugging for parallel compilers,

in: The Third International Conference on Parallel and

Distributed Computing, Applications and Technologies

(PDCAT�02), Kanazawa Bunka Hall, Kanazawa, Japan,

4–6 September 2002, pp. 455–462.

[27] W.-L. Chang, B.-H. Chen, A proof method for the

correctness of the interval test to be applied for determining

whether there are integer-valued solutions for one-dimen-

sional arrays with subscripts formed by induction variable,

in: The Third International Conference on Parallel and

Distributed Computing, Applications and Technologies

(PDCAT�02), Kanazawa Bunka Hall, Kanazawa, Japan, 4–

6 September 2002, pp. 52–57.

[28] A. Bik, M. Girkar, P. Grey, X. Tian, Efficient exploitation

of parallelism on Pentium III and Pentium 4 processor-

based systems, Intel Technology Journal Q1 (March)

(2001) 1–9.

[29] D. Petkov, R. Harr, S. Amarasinghe, Efficient pipelining of

nested loops: unroll-and-squash, in: 16th International

Parallel and Distributed Processing Symposium, Fort

Lauderdale, Florida, April, 2002.

[30] W.-L. Chang, C.-P. Chu, J.-H. Wu, A simple and general

approach to parallelize loops with arbitrary control flow

and uniform data dependence distance, The Journal of

Systems and Software 63 (2) (2002) 91–98.

Weng-Long Chang received his Ph.D.degrees in Computer Science andInformation Engineering fromNational Cheng Kung University,Taiwan, Republic of China, in 1999.He is currently an Assistant Professorat the Department of InformationManagement, Southern Taiwan Uni-versity of Technology, Tainan County710, Taiwan, Republic of China. Dr.Chang�s research interests includemolecular computing (including bio-molecular cryptography, bio-molecu-

lar computational complexity and bio-molecular computational

theory), parallel and distributed processing and parallelizingcompilers.

Chih-Ping Chu received a B.S. degreein agricultural chemistry fromNational Chung Hsing University,Taiwan, an M.S. degree in computerscience from the University of Cali-fornia, Riverside, and a Ph.D. degreein computer science from LouisianaState University. He is currently aprofessor in the Department of Com-puter Science and Information Engi-neering of National Cheng KungUniversity, Taiwan. His researchinterests include parallelizing compil-

ers, parallel computing, parallel processing, internet computing,

and software engineering.

742 W.-L. Chang et al. / Journal of Systems Architecture 50 (2004) 729–742

Michael (Shan-Hui) Ho, AssociateProfessor of Southern Taiwan Uni-versity of Technology, has 25 yearsindustrial and academic experience inthe computing field. He had worked asa Sr. Software Engineer and ProjectManager developing B2B/B2C/C2Cweb and multimedia applications anda Sr. database administrator for SQLclustered servers, Oracle, and DB2databases including network/systemsLAN/WAN system administration toUS major corporations and govern-

ment organizations such as the GAS Company, Fox, Fidelity,

ARCO, US LA Hall Records, . . .etc. He is also certified inOracle DBA/Developer (OCP), CCNA/CCNP, MCSE, UnixSolaris systems SA, and Dell server/computer technician engi-neer. Dr. Ho had more than 10 years of college teaching/research experience as an Assistant Professor and ResearchAnalyst at Central Missouri State University, the University ofTexas at Austin, and BPC International Institute. He earned aPh.D. in IS/CS with minors Management and Accounting fromthe University of Texas at Austin and a M.S. degree from St.Mary�s University. His research interests include algorithm andcomputation theories, software engineer, database and datamining, parallel computing, quantum computing and DNAcomputing.