performance bounds for column-block partitioning of parallel gaussian elimination and gauss-jordan...

20

Upload: independent

Post on 28-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Performance Bounds for Column-BlockPartitioning of Parallel GaussianElimination and Gauss-Jordan Methods�Apostolos GerasoulisDepartment of Computer ScienceRutgers UniversityNew Brunswick, NJ 08903, [email protected] YangDepartment of Computer ScienceUniversity of CaliforniaSanta Barbara, CA 93106, [email protected] partitioning is commonly used in the parallelizationof Gaussian-Elimination(GE) and Gauss-Jordan(GJ) algorithms. It istherefore of interest to know performance bounds of such partitioningon scalable distributed-memory parallel architectures. In this paper,we use a graph-theoretic approach in deriving asymptotic performancelower bounds of column-block partitioning for both GE and GJ. The�The work presented here was in part supported by ARPA contract DABT-63-93-C-0064, by the O�ce of Naval research under grant N000149310114, by a startup fund andfaculty fellowship from University of California at Santa Barbara. The content of theinformation herein does not necessarily re ect the position of the Government and o�cialendorsement should not be inferred. 1

new contribution is the incorporation of communication cost in theanalysis which results in the derivation of sharper lower bounds. Weuse our scheduling system PYRROS to experimentally compare theactual run time performance with that derived by these lower boundson the nCUBE-2 hypercube parallel machine.1 IntroductionGaussian-Elimination (GE) and Gauss-Jordan algorithms are widely usedin the solution of linear algebraic systems and in the inversion of matrices.Parallel algorithms have been proposed for both GE and GJ [1, 3, 5, 8, 12, 14].Given a linear algebraic system Ax = bwhere matrix A, where A = (ai;j) is a non-singular n � n dense matrixand b is the right hand side. A very natural partitioning for GE and GJ,when solving the above system, is a block of columns, where operations areperformed between blocks of columns of size r, e.g. Golub and Ortega [8],Robert et. al. [12]. Deriving lower performance bounds for such partition-ing is of great interests since they can provide performance information forparallel programs, e.g. Cosnard [3], Ipsen et. al [11], Saad [14]. Most re-sults in the literature for lower bound analysis ignore communication costbecause it is quite di�cult to incorporate such cost. In this paper, we usethe graph scheduling theory and some of our recent results on the granular-ity of task graphs [7] to derive lower bounds that incorporate communicationdelays. We provide experimental evidence with our scheduling and code gen-eration system PYRROS on the nCUBE-2 parallel machine to demonstratethe sharpness of the bounds. Cappello [1] does an asymptotic analysis of GEwith block partitioning. Our work studies the optimal parallel time for anymethod with column-block partitioning, and our analysis is more exact.2 BackgroundThere are two fundamental steps in program parallelization:2

1. Partitioning of program and data, and identifying the parallelism. Theparallelism of a program can be modeled as a directed acyclic taskgraph (DAG).2. Scheduling the task graph on a parallel machine.We provide a summary of results and de�nitions that will be needed inthe later analysis:A directed acyclic weighted task graph (DAG) is de�ned by a tupleG = (V;E; C;T ) where V = fnj; j = 1 : vg is the set of task nodes andv = jV j is the number of nodes, E is the set of communication edges ande = jEj is the number of edges, C is the set of edge communication costs and Tis the set of node computation costs. The value ci;j 2 C is the communicationcost incurred along the edge ei;j = (ni; nj) 2 E, which is zero if both nodesare mapped in the same processor. The value �i 2 T is the execution timeof node ni 2 V .A task is an indivisible unit of computation which may be an assignmentstatement, a subroutine or even an entire program. We assume that tasksare convex, which means that once a task starts its execution it can run tocompletion without interrupting for communications, Sarkar [13]. In the taskcomputation, a task waits to receive all data in parallel before it starts itsexecution. As soon as the task completes its execution it sends the outputdata to all successors in parallel.Scheduling is de�ned by a processor assignment mapping, PA(nj), ofthe tasks onto the p processors and by a starting time mapping, ST (nj), ofall nodes onto the real positive numbers set. CT (nj) = ST (nj)+�j is de�nedas the completion time of task nj in this schedule.Fig. 1(a) shows a weighted DAG with all computation weights assumedto be equal to 1. Fig. 1(b) shows a processor assignment using 2 proces-sors. Fig. 1(c) shows a Gantt chart of a schedule for this DAG. The Ganttchart completely describes the schedule since it de�nes both PA(nj) andST (nj). The scheduling problem with communication delay has been shownto be NP-complete for a general task graph in most cases, Chretienne [2],Papadimitriou and Yannakakis [10], Sarkar [13],Performance bounds. Let PTopt(G) be the length of an optimal sched-ule for task graph G. ThenPTopt(G) � max(CP; Seqp ) (1)3

1P

0P

Gantt chart

n1

n 6n7 n8

n3n2

n4

n5

1 2 3 4 5 6 7 Time

(a)

22

1 1 1

2 2 2

n1

n 6

n8

n3n2 n4

n7n5

2

(b)

n1

n 6

n8

n3n2 n4

n7n5

(c)Figure 1: (a) A DAG with node weights equal to 1. (b) A processor assign-ment of nodes. (c) The Gantt chart of a schedule.where p is number of processors, CP is the length of the critical path as-suming zero communication between the edges, and Seq is the sequentialexecution time which is the summation of all task weights.Granularity. To incorporate communication in the bounds we will needto use the concept of granularity. Let SUCC(nx) be the set of tasks as thesuccessors of task nx and PRED(nx) be the set of tasks as the predecessorsof task nx. In [7], we de�ne the grain of a task as follows:g(nx) = minf minnk2PRED(Tx)f�kgmaxTk2PRED(Tx)fck;xg ; minnk2SUCC(Tx)f�kgmaxnk2SUCC(Tx)fcx;kggThen the granularity of G is:g(G) = minnx2Vfg(nx)g:We call a DAG coarse grain if g(G) � 1. For a coarse grain DAG, eachtask receives or sends a small amount of communication compared to thecomputation of its neighboring tasks. When there is a su�cient numberof processors, we say that a schedule uses nonlinear clustering if it assignsindependent tasks in the same processor, otherwise linear clustering. In [7],we prove the following theorem:Theorem 1 If g(G) � 1, a schedule that uses nonlinear clustering can betransformed into a schedule that uses linear clustering with equal or shorterparallel time. 4

Thus there exists a schedule with linear clustering that attains the optimalsolution. This result will be useful in the derivation of lower bounds for GEand GJ later on.In this paper, we consider asymptotic lower bounds instead of exact forthe sake of simplicity. By asymptotic bounds we mean that certain lower-order terms of problem parameters, e.g. the size of the matrix n, are ignoredwhen computing lower bounds. Thus the lower bounds are approximationswhich converge to the exact lower bound when the parameter, say the sizeof the matrix n, is su�ciently large.We also assume that the processors are fully connected. This assumptionimplies that the performance bounds are valid for any processor topology.3 Parallelization of the Gauss-Jordan Algo-rithmWe �rst discuss the column oriented partitioning of Gauss-Jordan algorithmand then a submatrix based column block partitioning.3.1 Column partitioningGauss-Jordan kji formfor k = 1 : nfor j = k + 1 : n + 1T jk : f ak;j = ak;j=ak;kfor i = 1 : n and i 6= kai;j = ai;j � ai;k � ak;jendgendendFigure 2: The kji form of GJ with interior loop task partitioning.Figure 2 shows the kji form of the GJ algorithm [5], with interior looppartitioning. The interior loop de�nes task T jk which uses column k to operate(modify) column j of matrix A. We assume that the matrix is partitioned5

into columns data units to be consistent with the data accessing pattern ofthe tasks. The dependence task graph is given in Figure 3 [3, 9, 12]. TaskT k+1k is a broadcasting node, sending the same column k + 1 to all T jk+1,j = k + 2 : n + 1. This DAG has a degree of parallelism (the width of theDAG) equal to n.The communication and computation weights can be estimated as follows.For T jk there are about W = n! computation operations for each task, where! is the time for performing \ai;j = ai;j � ai;k � ak;j". If � is the startuptime for processor communication and � be the transmission speed, then thecommunication delay of sending a column between two neighbor processorsis C = �+ n�, see Dunigan[6]. Therefore, the granularity of the GJ DAG inFigure 3 is g(GJ) = WC = n!�+ n� = !�n + � � !�where the \�" assumes n is large enough.12T 1

4T

Tn+1n- 1Tn

n-1

34T

23T

nn+1T

13T 1

n+1T

2Tn+1

n+1T3

4T2

. . .

. . .

. ..

. ..Figure 3: The GJ DAG.We can easily see that the critical path length without communication isCP = nW , and the sequential time Seq � Wn22 . Then a general lower bound6

based on Inequality 1 is:PTopt(GJ) � nW If p � n2 ; and PTopt(GJ) � n2W2p If p � n2 :We now derive tighter bounds by considering communication delays and usingour results for linear clustering given in Theorem 1.Theorem 2 PTopt(GJ) � nW + (n� 1)C; If W � CPTopt(GJ) � (2n � 1)W; If W < C:Proof. If W � C, then g(GJ) � 1. We will derive this bound assumingthat there is a unlimited number of processors. From Theorem 1. we canassume that an optimal solution uses linear clustering.We de�ne a layer k of the GJ DAG in Fig. 3 as the set of tasks fT k+1k ; : : : ; T n+1k g.We will prove that the completion time of each task T jk at layer k satis�esCT (T jk) � kW + (k � 1)C where j = k + 1; � � � ; n+ 1:This is trivial for k = 1. Suppose it is true for all tasks at layer k�1. We willprove that it is also true for the tasks at layer k. We examine the completiontime for each task T jk at layer k. Since each task has two incoming edgesfrom tasks at layer k � 1, and linear clustering zeros at the most one edge,task T jk has to wait at least C = � + n� time to receive the message fromone of its two predecessors, say T rk�1, at layer k � 1. ThereforeCT (T jk) � CT (T rk�1) + C +W:From the induction hypothesis we have thatCT (T rk�1) � (k � 1)W + (k � 2)Cwhich implies CT (T jk) � kW + (k � 1)C:Since PTopt(GJ) = CT (T n+1n ), the statement is true.IfW < C, then we can construct a graph called GJ� with all computationand communication weights equal to W . Then any optimal solution for the7

GJ DAG is a legal schedule for the GJ� DAG. And an optimal schedule ofGJ� has a length shorter or equal to that for GJ.For GJ� DAG, by using the same argument for case W � C, we havePTopt(GJ�) � nW + (n� 1)W . This is also a lower bound for the GJ DAG.Next we provide another lower bound in the case of a limited number ofprocessors p.Theorem 3 PTopt(GJ) � n2W2p + (W + C)2p2(W + 2C); If W � C:PTopt(GJ) � n2W2p + 4Wp6 ; If W < CProof: Consider an optimum schedule and its parallel time PTopt. In Fig. 3,we select an integer k (1 < k < n). Assume that task T k+1k has a startingtime STk in the optimal schedule. Let Hk = PTopt(GJ)� STk.Tk+1

k

Tn+1n

T1k+1

Tkn+1

T1n+1

STk

Hk

PTopt(GJ)

n−k+1 paths

Tk−1kFigure 4: Examining n� k+1 paths. Each path has a computational lengthof nW .The task T kk�1 and all predecessors of T kk�1 have to be executed in theinterval between 0 to STk. The total amount of arithmetic work for thesetasks is k2W2 +O(k).Let us examine the n � k + 1 paths starting from T k+11 ; T k+21 ; � � � andT n+11 as depicted in Fig. 4. These paths end at T n+1n and each of them has8

a computational length of nW . For each path, assume that an amount t ofits computational work executes in the interval between STk and PTopt andnW � t in the interval between 0 and STk. Since t must be less than or equalto Hk, then the part of computation work at each path that must be �nishedbefore STk is at least nW �Hk if nW � Hk or 0 otherwise.Thus PTopt(GJ) � Hk + k2W=2 + (n� k + 1)(nW �Hk)p� n2W + (n� k)2W2p +Hk(1 � n� kp ):If W=C � 1, then using the result of Theorem 2, we can show thatHk � (n� k + 1)W + (n� k)C � (n� k)(W + C):Set K1 = n� k, thenPTopt(GJ) � B(K1) = (n2 +K21 )W2p +K1(W + C)(1� K1p ):We �nd K1 that maximizes B(K1) by di�erentiating B(K1) when K1 � p.We get the maximum point achieved when 0 � K1 = (W+C)pW+2C � p. ThenPTopt(GJ) � B((W + C)pW + 2C ) = n2W2p + (W + C)2p2(W + 2C):When W < C, we construct a DAG called GJ� from GJ by making allcomputation and communication weights in the GJ graph equal to W . Sinceany schedule for the GJ DAG is valid for the GJ� DAG then the lower boundis: PTopt(GJ) � PTopt(GJ�) � B(2Wp3W ) = n2W2p + 2Wp3 ):Let us now look at the parameters of some commercially available messagepassing architectures to see for what values of n, g(GJ) = W=C � !=� � 1holds true. Table 1 lists the parameters of several message passing architec-tures for single precision arithmetic.Observe that for the �rst generation of hypercube machines (Intel iPSC/2and nCUBE-1), the column-based partitioning is coarse grain for a reasonable9

iPSC/2 iPSC/860 nCUBE-1 nCUBE-2� (microseconds) 697 136 383.6 200� (microseconds/word) 1.6 1.6 10.4 2.4! (microseconds/ op) 11.4 0.2 35.1 1.6g(GJ) � 1 when n � 72 16limn!+1 g(GJ) = !=� = 7.1 0.125 3.4 0.667Table 1: The granularity of the GJ DAG in di�erent architectures.size of n. For the second generation of hypercubes, nCUBE-2 and iPSC/860,the processor speed is very high compared to communication speed andcolumn-based partitioning does not produce coarse grain tasks for any sizeof n. There are many approaches to increase the granularity of the GJ DAGand we describe one next.3.2 Block partitioningA popular approach for increasing the granularity of GJ is BLAS-3 blockpartitioning, see Dongarra et. al. [5]. The matrix of n � n is divided intoN�N submatrices and each submatrix has size of r�r whereN = n=r. Eachtask T jk in the block GJ DAG is operating on a block of columns composedof N submatrices.The task de�nition for block GJ without pivoting can be described as:T jk : f Ak;j = A�1k;k �Ak;jfor i = 1 : N and i 6= kAi;j = Ai;j �Ai;k �Ak;jendgwhere k = 1; � � � ; N and j = k + 1; � � �N + 1. Notice that the task TN+1kis operating on the right hand side column b. Thus the de�nition TN+1k ,k = 1; 2; � � �, involves only matrix-vector multiplication. For simplicity, wewill exclude these tasks TN+1k , k = 1; 2; � � �, when computing lower bounds.Such exclusion does not a�ect the correctness of asymptotic lower boundcalculation. 10

The dependence structure of the block GJ DAG remains the same as inFigure 3 but the degree of parallelism is reduced from n to N + 1 whereN = n=r. Except tasks TN+1k , the computation size of T jk increases to aboutW = Nr3! and a message communicated between tasks is a block columnof size Nr2. Thus the communication delay is C = �+Nr2�.We call the block GJ graph as the BGJ DAG. The granularity g(BGJ)is asymptotically equal to r!� , increases by a factor of r compared to that ofthe column GJ DAG. To make g(BGJ) � 1 the minimum submatrix size ris d �!e, which becomes 2 for the nCUBE-2 and 8 for the Intel iPSC/860.The lower bound analysis for the BGJ DAG is similar to that for thecolumn-oriented GJ DAG. We summarize these results as follows:Theorem 4 Let W = Nr3! and C = �+ Nr2�.PTopt(BGJ) = max((N�1)W+(N�2)C; N2W2p +(W + C)2p2W + 2C ); If W � C:PTopt(BGJ) � max((2N � 3)W; N2W2p + 2pW3 ); If W < C:3.3 Gauss-Jordan with partial pivotingfor k = 1 to nT kk : f Find ja(l; k)j= maxfja(k; k)j; � � � ; ja(n; k)jg;piv(k)=l; swap(a(l,k), a(k,k));for j = k + 1 to n + 1T jk :fswap(a(piv(k); j); a(k; j))a(k; j) = a(k; j)=a(k; k)For i = 1 to n, i 6= kai;j = ai;j � ai;k � ak;jendgendendFigure 5: GJP: GJ with partial pivoting and its task partitioning.A column-based partitioning for the Gauss-Jordan algorithm with piv-oting is shown in Fig. 5. Computing the lower bound for GJ with partial11

pivoting is complicated by the addition of pivoting task T kk (k = 1; 2; � � �).We call this DAG as GJP. This graph has a structure similar to Fig 8. Wetake a simple approach in this paper by deleting the tasks T kk , k = 1; 2; � � �from the graph. We call this new graph as the reduced graph GJP/R. It canbe shown that an optimal schedule for GJP is a legal schedule for the GJP/Rgraph and PTopt(GJP ) � PTopt(GJP=R).Thus we can conduct the lower bound analysis on the GJP/R graph.Then the asymptotic lower bounds computed in the previous subsections arestill valid for GJP/R and as a result for GJP.4 Gaussian-Eliminationfor k = 1 to n� 1T kk : f Find ja(l; k) = maxfa(k; k); � � � ; ja(n; k)jg;piv(k)=l; swap(a(l,k), a(k,k));for j = k + 1 to nT jk :fswap(a(piv(k); j); a(k; j))a(k; j) = a(k; j)=a(k; k)For i = k + 1 to nai;j = ai;j � ai;k � ak;jendgendendFigure 6: LU with partial pivoting and its column-based partitioning.We only consider the LU part of the Gaussian Elimination algorithm.The column-oriented with pivoting LU factorization is shown in Fig. 6. Theblock LU factorization without pivoting is shown in Figure 7. The n � nmatrix is divided into N � N submatrices and each submatrix has size ofr � r where N = n=r. The resulting task graph is similar to GJ algorithmbut the weight distribution di�ers. The dependence graph is shown in Fig. 8.Each task T jk is operating on a block column composed of N elements andeach element is an r � r submatrix. Task T kk has a weight (N � k + 1)r3!=2and it communicates with T jk (N �k+1) submatrices. Task T jk has a weight(N � k + 1)r3! and it sends the part of column block j of size (N � k)12

submatrices to T jk+1. This graph is coarse grain in the top part but becomes�ne grain in the bottom. The block LU algorithm with partial pivoting isrelatively complicated [5]. The performance lower bound of block LU withoutpartial pivoting can be used for block LU with partial pivoting.for k = 1 to NT kk : f Factorize Ak;k as Lk � Ukfor i = k + 1 to NAi;k = Ai;k �U�1kendgfor j = k + 1 to NT jk : f Ak;j = L�1k �Ak;jFor i = k + 1 to NAi;j = Ai;j �Ai;k �Ak;jendgendendFigure 7: Block LU factorization and its task partitioning.We �rst derive the following general lower bound for mapping this blockGE/LU graph (called BLU): The sequential time: Seq � N3r3!=3 = n3!=3.The critical path excluding communication is fT 11 ; T 21 ; T 22 ; � � �TNN g. ThusCP = NXk=1(N � k + 12 + (N � k + 1))!r3 � 3N2r3w4 :Thus PTopt(BLU) � max(n3!3p ; 3n2r!4 ):Next we incorporate communication overhead on the bound. If we delete alltasks T kk , k = 1; � � � ; N , the remaining graph has a structure similar to Fig. 3except that weights are monotonically decreasing from the top to bottom.We �rst �nd the place the local granularity becomes less than 1. The newgraph has a local grain valueg(T jk ) � (N � k)r3!�+ (N � k)r2�13

23T

. . .

. ..

11T

13T1

2T

. ... . .

14T

2T2

42T . . .

1TN

T2N

TNN

TN-1NFigure 8: The task graph of block LU factorization with or without pivoting.When N is very large, the condition to make g(T jk ) � 1 is r!=� � 1 andk � N � �r2(r!��). Let s = N � �r2(r!��).Lemma 1 If r!=� � 1,PTopt(BLU) � W1 + sXi=2(Wi + Ci)where Wi = (N � i+ 1)r3! and Ci = � + �(N � i+ 1)r2.Proof. The proof is similar to that of Theorem 2. We exclude the bottom�ne grain part of this graph, i.e., tasks T jk with k � s+ 1.Since the granularity is greater than one, the optimum parallel time mustbe achieved by a linear clustering. We de�ne a layer k in Fig 8 as the set oftasks fT k+1k ; : : : ; TNk g. We will prove that the completion time of each taskT jk at layer k satis�es CT (T jk ) �W1 + kXi=2(Wi + Ci):14

This is trivial for k = 1. Suppose it is true for tasks at layer k � 1. Weexamine the completion time for each task T jk at layer k. Since each taskhas two incoming edges from tasks at layer k� 1, and linear clustering zerosat the most only one edge, task T jk has to wait at least Ck to receive themessage from one of its two predecessors, say T rk�1, at layer k� 1. ThereforeCT (T jk ) � CT (T rk�1) +Wk + Ck:Then using the induction hypothesis we prove the statement.Notice when N is very large or r is large, s � N , and then we can simplifythe bound as W1 + N�1Xi=2 (Wi + Ci) � N2r22 (rw + �):For the case of r! < �, we can transform the BLU graph to a graph BLU�by replacing weight Ci with Wi. Thus we have the following theorem:Theorem 5 If r! � �,PTopt(BLU) � N2r22 (rw + �):If r! < �, PTopt(BLU) � N2r3w:We further estimate a lower bound for block LU when we are given a�nite number of processors p.Theorem 6 Let W = r3! and C = r2�.If r! � � PTopt(BLU) � N3W3p + (W + C)3p26(W + 1:5C)2 :If r! � � PTopt(BLU) � N3W3p + 16Wp275 :Proof: For Fig. 8, we select an integer k (1 < k < N) and use a techniquesimilar to that for Theorem 3 to prove the statement. Assume that task T k+1khas a starting time STk in an optimal schedule. LetHk = PTopt(BLU)�STk.15

Thus T kk�1 and all predecessors of T kk�1 have to be executed in the intervalbetween 0 to STk. The total amount of work in these tasks is aboutk�1Xi=1(k � i+ 1)(N � i+ 1)W � Nk2W2 � k3W6 :We examine N � k paths starting from T k+11 ; � � � TN1 and ending in TNN . Foreach of those paths, the total amount of computational work has to be per-formed before time STk is:NXi=1(N � i+ 1)W �Hk � N2W2 �Hk:Thus PTopt(BLU) � Hk + Nk2W2 � k3W6p + (N � k)(N2W2 �Hk)p :Let K1 = N � k, then we can simplify the above expression as:PTopt(BLU) � N3W3p + K31W6p +Hk(1� K1p ):IfW � C (i.e. r! � �), we choose K1 < p. By Lemma 1, Hk � K21 (W+C)=2and PTopt(BLU) � B(K1) = N3W3p + K31W6p + K21 (W + C)(1� K1p )2 :We perform the di�erentiation of B(K1)d B(K1)d K1 = 0:5K21Wp +K1(W + C)� 1:5K21 (W + C)p ;and d2 B(K1)d K21 = W + C � (2W + 3C)K1p :To make d2 B(K1)d K21 < 0, we have K1 > (W+C)p2W+3C . Let d B(K1)d K1 = 0, then K1 =(W+C)pW+1:5C . Apparently (W+C)pW+1:5C > (W+C)p2W+3C . ThusPTopt(BLU) � N3W3p + (W + C)3p26(W + 1:5C)2 :16

When r! � �, we construct a new graph from BLU such that communicationweights Ci are replaced with computation weights Wi. We derivePTopt � B( 2Wp2:5W ) = N3W3p + 16Wp275 :5 Summary and ApplicationsWe summarize our analysis for block GJ and GE/LU algorithms in Ta-ble 2. We have ignored the lower order terms assuming that N = n=r islarge. Notice that when n is large, the asymptotic lower bounds for column-partitioning is a special-case of column block partitioning. The lower boundsfor algorithms without partial pivoting are also valid for algorithms withpartial pivoting. r! � � r! < �GJ max(n2(r! + �); 0:5(n3!p + nr(r!+�)2pr!+2� )) max(2n2r!; n3!2p + 2nr2p3 )LU max(3n2r!4 ; 0:5n2(r! + �); n3!3p + (r!+�)3r2p26(r!+1:5�)2 ) max(n2r!; n3!3p + 16r3!p275 )Table 2: A summary of the lower bounds.Next we compare these lower bound results with the performance of par-allelizing GJ/GE programs on nCUBE-2 using PYRROS [15]. PYRROS isan automatic scheduling and code generation tool for mapping task graphson message-passing machines. The table contains the results for two di�er-ent block partitions, r = 8 and r = 16 which result in parallelism degreeof N = 128 and N = 64 respectively when n = 1024. The Speedup-Boundcolumn is the speedup upper bound Seq=B using our lower bound result Bin Table 2. We use � = 160�s, � = 2:4�s=word, and ! = 2:4�s [6].Table 4 lists the results for block Gauss-Jordan (GJ) algorithm withoutpivoting, with r = 10 and n = 1000. The sequential time is about 1150.Again we calculate speedup bound Seq=B and B is from Table 2. We alsoinclude the Speedup � Bound1 value which is calculated based on Inequal-ity 1. We can see for this case, our bounds that incorporate communicationdelay provide sharper results than that using Inequality 1.17

r = 8 N = 128 r = 16 N = 64#Proc. Speedup-Bound PYRROS-Speedup Speedup-Bound PYRROS-Speedupp=4 3.99 3.8 3.99 3.7p=8 7.99 7.1 7.99 6.7p=16 15.98 12.5 15.87 11.2p=32 31.75 19.4 28.44 21.2p=64 56.89 29.7 28.44 23.6Table 3: The block LU with pivoting performance of PYRROS for n = 1024.From these two experiments, we can see that lower bounds performanceprediction is quite good. r = 10 N = 100#Proc. Speedup-Bound PYRROS-Speedup Speedup-Bound1p=4 3.99 3.99 4p=8 7.99 7.6 8p=16 15.6 14.1 16p=32 29.0 24.3 32p=64 45.3 35.1 50Table 4: The GJ PYRROS performance and the performance bounds.References[1] P. R. Cappello, Gaussian elimination on a hypercube automaton, Jour-nal of Parallel and Distributed Computing 4, (1987), 288-308.[2] P. Chretienne, Task Scheduling over Distributed Memory Machines,Proc. of the International Workshop on Parallel and Distributed Al-gorithms, (North Holland, Ed.), 1989.18

[3] M. Cosnard, B. Tourancheau, and G. Villard, Gaussian Elimination onMessage Passing Architecture, Lecture Notes in Computer Science 297,Berlin: Springer-Verlag, 1987, pp. 611-628.[4] G. Davis, Column LU Factorization with Pivoting on a Hypercube Multi-processor, SIAM J. Algebraic and Discrete Methods, vol. 7, pp. 538-550,1986.[5] J. J. Dongarra, L. S. Du�, D. Sorensen and H. A. van der Vorst. SolvingLinear Systems on Vector and Shared Memory Computers, SIAM, 1991.[6] T.H. Dunigan, Performance of the INTEL iPSC/860 and nCUBE 6400Hypercube, ORNL TM-11790, Oak Ridge National Laboratory, OakRidge, TN, Nov. 1991.[7] A. Gerasoulis and T. Yang, On the Granularity and Clustering of Di-rected Acyclic Task Graphs, IEEE Trans. on Parallel and DistributedSystems, June, 1993. Vol. 4, No. 6, June 1993, pp. 686-701.[8] G. Golub and J.M. Ortega, Scienti�c Computing : An Introduction toParallel Computing, Academic Press, 1993.[9] R.E. Lord, J.S. Kowalik, and S. P. Kumar, Solving Linear AlgebraicEquations on an MIMD Computer, Journal of the ACM, vol. 30, pp.103-117, 1983.[10] C. Papadimitriou and M. Yannakakis, Towards on an Architecture-Independent Analysis of Parallel Algorithms, SIAM J. Comput., vol.19, pp. 322-328, 1990.[11] I.C.F. Ipsen, Y. Saad and M. Schultz, Complexity of Dense Linear Sys-tem Solution on a Multiprocessor Ring, Linear Algebra and Appl., vol.77, pp. 205-239, 1986.[12] Y. Robert, B. Tourancheau and G. Villard, Data allocation strategies forthe Gauss and Jordan Algorithms on a ring of processors. InformationProcessing Letters 31, (1989) 21-29.[13] V. Sarkar, Partitioning and Scheduling Parallel Programs for Executionon Multiprocessors, The MIT Press, 1989.19

[14] Y. Saad, Gaussian Elimination on Hypercubes, Parallel Algorithms andArchitectures, Cosnard, M. et al. Eds., Elsevier Science Publishers,North-Holland, 1986.[15] T. Yang and A. Gerasoulis, PYRROS: Static Task Scheduling and CodeGeneration for Message-Passing Multiprocessors, Proc. of 6th ACM In-ter. Confer. on Supercomputing, Washington D.C., 1992, pp. 428-437.

20