boundary integral domain decomposition on hierarchical...

Boundary Integral Domain Decomposition on hierarchical

memory multiprocessors

E. Gallopoulos and D. Lee Center for Supercomputing Research and Development

University of Illinois at Urbana Champaign Urbana, Illinois 61801

U.S.A.

Abstract

A method, called Boundary Integral-based Domain Decomposition was recently proposed for the solution of Laplace’s equation. The method is characterized by the complete decoupling of the problem domain into subdomains which is possible after integral equation based techniques are used for the calculation of the solution on the subdomain boundaries, We describe some theoretical and practical issues involved in the use of such methods on shared memory multiprocessors.

1 Introduction

A method, called Boundary Integral-based Domain Decomposition (BIDD) was recently proposed for the solution of Laplace’s equation in a short note [16]. Given a bounded domain D and function g continuous on 8D the Dirichlet problem is to find u satisfying:

V2u(z) = 0 (1)

when I E D and u(z) = g(z) when z E 80. The method is based upon:

1. Partitioning the domain into subdomains;

2. Solving for the interface points separating the partitions;

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by pemtis.sion of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/ or specific permission.

0 1988 ACM O-89791-272-1/88/0007/0488 $1.50

3. Computing the solution at predefined points in D.

As mentioned in [16] such a formulation is quite abstract. An implementation requires the specifi- cation of i) a domain partitioning strategy, ii) a method to compute the interface values, and iii) the method(s) to compute the solution at the subdomains.

In this paper we discuss aspects of the method which are essential for its efficient application. We are particularIy concerned with the issues making the method particularly attractive for implementation on multiprocessor systems with memory hierarchy. Since this is a discussion of work in progress, we also raise some questions which we are actively investigating. As will become apparent from the discussion, there are many important issues that have to be resolved. It has to be noted also that apart from the methods of fundamental solutions which is used here to compute the interface values, there are other integral equation based schemes which could be used instead. A similar comment applies for the subdomain solver. In that sense, we really regard BIDD as a family of methods.

2 Background

It was a common characteristic of most of the earlier research into parallel algorithms, that very little effort was expended in investigating issues such as data management. This was a natural consequence of the fact that very few parallel machines were in existence, and even then, access to them was very limited. The interest in demonstrating algorithms with good theoretical speedups, even if this meant making quite unrealistic assumptions, was far more urgent. The coming-of-age of multiprocessor man- ufacturing in the 1980s allowed the building of af-

488

fordable systems. Now scientists also have to im- plement their algorithms, which in turn means that “real” issues such as communication, synchronization, load-balancing, and efficient memory management have to be dealt. With many of the parallel systems which are currently available having ei- ther distributed or hierarchical memories, this last issue is becoming a crucial consideration in the design of efficient algorithms. As mentioned in the introduction, our point of focus is the class of parallel architectures with a non-trivial memory hierarchy. Examples of such systems are the Cray 2, the ETA 10, the Alliant FX/8 and the CEDAR system [32]. For example, it was clearly demonstrated in [14] for the case of the Alliant FX/8 - a vector multiprocessor with a memory hierarchy consisting of registers, a fast shared cache and slower memory, which is also a single cluster of CEDAR - that algorithms designed to have increased data locality by making intelligent use of the memory hierarchy

demonstrate superior performance when compared with algorithms designed only to exploit the parallelism. We have thus experienced a flurry of activ- ity in redesigning algorithms by means of blocking techniques to reduce the memory traffic, even if that implies an increase in the computational complexity. Finally we note that, in the case of systems like the CEDAR, the increase of computational resources in the form of added processor clusters will further complicate the analysis by introducing more param- eters in the algorithm design space. It is however intuitive, that algorithms which can be mapped on the architecture so that intercluster communication is minimized would be particularly desirable.

There is a very large body of literature examining methods for the solution of elliptic PDEs in two di- mensions. For example, rapid elliptic solvers of computational complexity at most O(n210gn), where n is the number of gridpoints per direction, can be applied for special domains when the operator is Poisson-like in the one direction ([27,44] and references therein). Some of these methods have also been extended to a wider range of equations and regions [6,38]. A potential source of trouble when implementing such algorithms on machines with hierarchical memory is that the coefficient matrices arising in the solution of the problem are sparse. The simplest example is when D in Problem 1 is a rectangular region and a 5 point stencil is used, in which case the resulting coefficient matrix is block tridiagonal, i.e. sparse with a very regular sparsity pattern. Although methods of low computational

complexity exist for the solution of these systems on rectangular domains, these methods can suffer from

a drastic loss of data locality. For example, when block cyclic reduction (BCR) is used ([45]) the kernel of the computation is the solution of multiple tridiagonal systems. On an n x m grid, solving for each system requires O(m) computations for O(m) data, giving an O(1) computation-to-memory load ratio. This low ratio has a negative effect on the performance of the method on architectures with hierarchical memory ([18]). In fact, the difficulty to compare the performance of rapid elliptic solvers had been observed much earlier ([26]) even for unipro- cessor architectures. A conclusion - so close to our discussion - of that early study was that

66 . . . the operation count is not necessar- ily an adequate figure-of-merit in compar-. ing theoretically the value of algorithms in numerical analysis . . . Other factors, such as . . . the pattern in which memory banks of the computer are referenced, may be as important as the operation count in deter- mining the speed of a program.”

The technique of domain decomposition is naturally very attractive for implemention on a parallel architecture ([3,12,31,8]). It is also a natural technique to use in dealing with irregular domains. Partition- ing the spatial domain into subdomains and solving a subproblem in each of them implies a corresponding decrease in the size of the computational domain to be handled by each task and is thus an intuitive counterpart to blocking. An advantage is that this form of blocking has an immediate physical interpre- tation. This is one more example concurring with the comments made by Rice ([40]) and others, that there is a high level of parallelism in the physical world to be exploited. For the popular approaches based on the Schwars alternating procedure ([35,43]) some information exchange takes place between the processes handling each of the subdomains, at every step of the Schwarz iteration. As a result, its implementation on a multiple cluster architecture can suffer a performance degradation.

Overall, it is fair to say that most of the research for the solution of (1) h as concentrated around finite difference and finite element techniques. Our approach is slightly different in that we use an integral equation formulation of the problem. Integral equation (IE) based techniques have been in use for a long time for the solution of (1). We point to many of the references in Section 3.2. The idea of BIDD however is apparently new. One of the observations made by researchers was that IE techniques help reduce the dimensionality of the problem. This comes at the expense of having to solve dense systems of

equations. For regular regions where rapid elliptic solvers can be used, comparisons have shown that IE methods were more efficient only if the solution had to be calculated at a small number of points [41].

Starting from these observations, we were led to BIDD, a hybrid method where we use IE techniques to solve for only a small number of interface points, thus keeping down the cost of the IE technique, and then decouple into boundary value problems which are independent and are defined on subdomains of any suitable shape and size. The former feature practically eliminates the need for communication between subproblems while the latter means that the subdomains can be chosen to enhance data locality.

3 Description of the method

We next give a description of the method in the form we are currently developing it. In this paper we are more interested in the implementation issues as opposed to the mathematical justification which we leave for a forthcoming paper.

3.1 Domain partitioning

Although the domain shape may impose constraints, partitioning should be done with the following objec- tives in mind:

1. The workload required for each of the subdomains to be balanced.

2. The subdomains to be of shapes suitable for the application of a rapid elliptic solver.

3. The subdomains are of sizes which minimize the use of the most distant units in the memory hierarchy, enhancing data locality.

4. The interface points (and subdomains) should not be so many, that the computation of the solution on them overwhelms the remaining computational effort.

3.2 Computing the interface values

The integral equations of potential theory furnish us a host of schemes for the calculation of interface values. We choose the method of fundamental rolu- tions, also known as charge rimulation method, here- after denoted as CSM.

An approximation 4 to the solution u is sought as a finite linear combination of a linearly independent

set of particular solutions {&, . . . ,d~) of V2u = 0:

N

(2) j=l

The method is characterized by choosing the J?i to be fundamental solutions of Laplace’s equation.. We recall (1221) that for a domain D, a function v of two complex variables (z, w) C: (D, D) is called a fundamental solution of Laplace’s equation for D if for each w E D, v as a function oft is harmonic in D \ {w} and at z = w has an isolated singularity such that

lim ~(2, w) = +oo. L-et0

It follows from standard theory that

v(z, 4 = #clog -- ,ZJW,

+ h(z, w)

where tc > 0 and h is a harmonic function of z # w that can be continued harmonically at w. For our purpose, we choose the simple, normalized fundamental solutions:

The singularities wj, lie in the exterior dc of b = D U 8D. In physical terms, the method amounts to placing particle charges of strength Uj at the points wj. Assuming that the law of attraction is that of inverse first power, each of these charges generates a logarithmic potential field ([30, p. 631). In CSM we want the charges to combine to a potential Q which approximates u in D. In other words 6 is the single- layer potential generated by means of a discrete set of monopoles. Because the charges are placed outside, computations with singular integrands are avoided and one source of complication with boundary integral schemes is avoided. Moreover the same approach can be followed for any homogeneous problem with known fundamental solutions. We must next decide on

1. the number and location of the points wj,

2. the strengths uj.

In [33] the combined problem of charge placement and strength determination was solved for various domains and boundary conditions. The method used was based on solving a non-linear least squares problem for the charge strengths and charge coordinates. It is clearly the case that such a general approach would result in a better approximation to the solution. A disadvantage is the extra cost in the non- linear least squares algorithm. Our approach in here

is to fix the location of the charges on a circle of radius R enclosing D and then compute the charge strengths. We note however that this approach is not always successful. The charge strengths must be determined from the boundary data. This can

be done by calculating the boundary values at the discrete set of observation points ’ T = {q, . . . , zv} where T C 8D and minimizing

for z E T and Q as in Eqs. (2) and ( 3). The problem of best approximation with functions of this form is discussed in [33]. The choice of observation points has also been discussed in the context of boundary collocation methods for Problem 1 by [39].

The origins of the method are from [37]. [21,42,36,33,9,13,4] and others made valuable con- tributions. It is closely related to the methods described in [2,23,5,28,24,29,34,7]. [13] have drawn the link between boundary integral methods and the method of fundamental solutions.

In discrete form we are looking for tr E XN to satisfy

m$n lb - Gdlp (4

where G E 8YXN is an influence mat& with general term -&log IZk - Wjls

We distinguish the follqwing cases:

v = N This is equivalent to seeking a 7-polynomial ([25]) interpolating the Y boundary values. We thus collocate Ga = u.

u > N The system is overdetermined and we distinguish two subcases:

l P = 2 and we minimize in a least squares sense.

l P = 60 and we minimize in a Chebyshev sense.

Solving the interpolation problem may imply diffi- culties for special boundary conditions and boundary shapes. Although we do not discuss Chebyshev schemes in this paper we note that they can be com- putationally expensive but can also offer a better approximation to the solution. Our choice was for QR factorization of G and solution for both v = N and Y > N with p = 2. The possible ill-conditioning of G also makes the use of a QR method more appro- priate than LU decomposition. This ill-conditioning was observed in [33] independently of the integral formulation of the problem. Letting N -+ 00 the

‘The term is from [13]

Figure 1: BIDD for an irregular region

ill-conditioning is traced to the continuous form of Eq. (2) which is an integral equation of the first kind. Christiansen’s work is fundamental in deriving estimates for the condition of G for certain regions [10,19]. We note however, that the computed u is just an intermediate result of the computation. Since the final result is obtained after multiplying with H, the effect of a less accurate u might not be so serious.

Once the charges are available, it is possible to solve for any point on b. If the solution must be computed at p interface points &, we form the influence matrix H E gipxN with general term

-& log ICk - wj] and compute

4=Hu. (5)

Figure 1 depicts the relative locations of the charge, observation and interface points for an arbitrary, closed region.

3.3 Computing the solution in each subdomain

Once the interface values are available, the solution for each subdomain can proceed independently by means of any convenient method. Clearly the method to be used is intimately coupled with the subdomain shapes, with more regular shapes allow- ing the use of rapid elliptic solvers. We must not forget however that such a split is not always the most efficient from the point of view of load balancing.

3.4 Discussion

Ultimately the total time to solve the problem will depend upon

Cl The complexity of computing the charges and subsequently solving for the interface values;

491

C2 The complexity of solving for each of the subdomains.

Great care must be exercised for the work involved in the former step not to overwhelm the work in the latter. Consider for example a domain composed of a small number of subdomains for each of which a rapid elliptic solver can be applied. The complexity of the QR-based solver for cr is O(yN2). The jr interface values are then obtained from Eq. (5) at 8 cost of O(pLN) operations. Finally, if there are O(na) gridpoints per subdomain, Step C2 will require O(n*log n) operations. From these counts it is immediately clear that if the problem requires N = O(V) and n = O(Y) the computational cost of Step Cl can dominate that of C2. From the same argument it also becomes clear why using only CSM to solve for 8ll the gridpoints is expensive: In that case p = O(V’) and the cost of the matrix-vector multiplication in Eq. (5) will be O(Y’N) which can easily overwhelm the cost of a rapid elliptic solver.

Despite the pessimistic predictions made from the complexity analysis, our experiments 8s well 8s those in [33] show that the method csn be very compet- itive. Some factors contributing to this are 8s follows. The use of a rapid elliptic solver (e.g. FFT, BCR or multigrid based) for the subdomains may not be possible. In that case slower methods may have to be used, in which case the gap in computational complexity between the subdomain solvers and CSM will diminish. We also suggest the theoretical analysis in [33] of the simple case of concentric disks, where very high order convergence rates are demonstrated for boundary functions possessing high order derivatives which are Rijlder continuous. This in turn means that N can be kept 8s small 8s n* for some positive integer le. A full analysis for more general regions however remains to be done. Finally, 8s discussed in Section 2 the performance of the method will be influenced from many other factors in addition to the computationd complexity.

Another issue of concern is related to the effec- tiveness of the method. It is not hard to construct examples where the algorithm does very badly in its current form, i.e. when the charge locations are fixed. This was observed 8s early as 1961 by Davis and Rabinowitz ([ll]) for a slightly different formulation of the problem. We leave the more extensive discussion of these aspects for another paper [15]. We repeat however the comment in [ll, p. 1221 that the most favorable condition is that of g coming from entire functions or from harmonic functions, regular in large portions of the plane, with geometric convergence rates possible. Moreover boundary data corresponding to solutions which do not continue

Figure 2: BIDD on square grid with p strips.

harmonically across ,the boundary, or of low conti- nuity class can cause problems. Moreover in that case the shape of 8D becomes important, with better behavior expected for analytic, convex boundaries or starlike regions. Results for the application of CSM on a racetrack with boundary function gg(z, y) = O-25($ + g) (torsi.on problem) can be found in [lS]. In [17] we show that applying CSM on a disk when the charge locations lie on 8 circle concentric to the disk is equivalent to the discrete Poisson kernel method described in [22].

4 A multicluster BIDD algo- rit hm

In this section we present an algorithm for the solution of Problem 1 with BIDD w:hen D is a rectangle. Although for this particular region many simplifica- tions can be applied in BIDD (e.g. taking advantage of the symmetries in the region) and in the other methods, we do not take advantage of them, since the purpose is to show the structure of the algorithm in a manner suitable for implementation on 8 wide variety of regions.

We assume that the number of clusters p ia the same with the number of subdomains in which D is partitioned. We are mostly interested in the small values of p and large values for m and n, although our current implementation does not allow for very fine grids because of memory limitations. The do-

492

main is partitioned into p horizontal strips of a x m gridpoints each.

We next consider the computation of the interface values. In the most obvious partitioning strategy, subdomain Oi would have two boundaries in the z- direction. C8ll them Ii-1 and I’i, with I’0 and I’z being along y = -1 and y = 1 respectively. Hence there would be p - 1 interface lines on which the solution must be computed from the p clusters. Let subdomain Di and interfue line Pi be assigned to cluster i. Such a partition of the computation has two disadvantages. Firstly the one cluster will re- main idle. More seriously, cluster i will also need m items of information from cluster i - 1 in the form of interface values at the m gridpoints corresponding to r i-1. If intercluster communication (e.g. through a global memory) is expensive, the strategy described next may be more efficient. This is based on subdi- viding D in such a manner that the subdomain (re- 8lly the discrete subregion) handled by each cluster is completely independent from any other subregion in that even the interface points are disjoint. This is possible due to the ability of CSM to compute the solution at any point of b without referencing any neighboring points. As shown in Figure 2, ad- jacent strips Di and Di+l have as boundaries in the s&irection the gridpoints corresponding to Pi-r,, and I’i,l. In this case, cluster i computes the solution in Di after computing the 2m interface values

on ri-l,2 and Pi,l. As before, clusters 1 and p would have less work to do, but the communication step is avoided at the expense of having to compute 2m instead of m interface values per cluster. In the same time however, a computational gain is made since the boundary value problem to be solved in each cluster will be one gridpoint row smaller in size.

BIDD.RECT(n,m,v,g, R,N, w)

Input m (internal) gridpoints in the x direction; n (internal) gridpoints in the y direction; u boundary “collocation” points; Boundary values g = [9r, . . . , gy]; N charges on circle of radius R; w vector of locations Wj of the N charges.

Output Approximate solution u computed on a n x m grid.

Comment Compute the charge strengths.

A0 Compute the v x N elements of the inilu- ence matrix G defined in Eq. (4).

Al Compute the QR factorization of G.

A2 Compute u E RN satisfying min ]]g - Gu]]z

Comment Compute the interface values.

do i= l,...,p

BO Compute the 2m x N elements of the influence matrix J3i from the charges to the gridpoints defined on I;-I,2 and r- 1,l.

Bl Compute fl; = Hrp; Hence pi E fRzm consists of the interface (grid) points on Pi-l,2 and Fi,l.

enddo

Comment Compute the solution.

do i= l,...,p

CO Using the values of fli compute the solution in the (” corresponding f

- 2) x m gridpoints o Di with a rapid el-

liptic solver.

enddo

4.1 Implementation

We first make the following observations.

1.

2.

3.

4.

In a multicluster environment, steps BO, Bl and CO for each instance of the loop variable i can be concatenated to form a task. This task can then be processed by one processor cluster.

Steps A0 and BO are initialization steps, consisting of computation of fundamental solutions. For Problem 1 this amounts to the fast computation of N(p + V) logarithms, where p =

2m(p - 1). It thus becomes desirable to have available intrinsic library functions which can exploit not only the pamllelism but also the vector capabilities of the architecture [l]. In that case W and G would be evaluated by means of concurrent calls to the vector logarithm instruc- tion.

Although for large values of v it would be natural to spread the QR decomposition in Steps Al and A2 across clusters, we have not done it at this stage of our experiments.

Different partitionings, e.g. along the x- direction as well, may offer better performance and are under investigation.

493

We have implemented the algorithm on an AlIiant FX/8 multiprocessor in double precision arithmetic. The boundary function was gA(z, y) = ea* sin(ay). The exact solution for gA is given by the same function. We were interested in the expected performance of the algorithm on a multicluster architecture such as CEDAR. We note that the Alliant FX/8 has no vector logarithm instructions. Since it was chosen to be the CEDAR cluster, we could obtain estimates for the performance of the algorithm as follows: Let T,h,j be the measured time for step AO, and Teh the measured time for steps Al and A2, all steps performed on a single cluster. Let T,,,l be the single cluster time for the p occurences of steps BO together with Bl. Let TRES be the single cluster time for the p occurences of step CO. Then the single cluster time for the algorithm is

TI = Tmhd +Teh +T,oI +TRES

whereas on p clusters the time will be approximately

T- = Tooha +zh + T ad TRES -+-.

P P

Moreover,

T TCSM = Tmhd + %, + Li

P

denotes the time spent for the CSM calculation of the interface values. What is missing in the above is of course the overhead associated with the multicluster implementation, in particular due to the traffic between global memory and clusters.

The charge points are fixed at uj = RP*(j-‘) with j = 1,. . ., N, and R = 2. The Y “collocation” points zb are equispaced on 80. Steps Al and A2 use the Alliant Scientific library subroutines DBQRDC and DBQRSL. These use block Householder transformations and are based on the work by Har- rod ([20]). The matrix-vector multiplication step Bl uses the Cray-compatible BLAS2 subroutine MXV

written by Gallivan, Jalby and Meier, included in the CSRD Scientific library.

Step CO is implemented with subroutine HWSCRT

from FISHPACK ([45,46]). The original formulation of BCR suffers a bottleneck, as it requires the solution of linear systems whose coefficient matrices are products of tridiagonal matrices. We have thus also used the algorithm described in [la] which overcomes the problem by using partial fraction expansions.

In all subsequent figures, the case of p = 1 corresponds no partitioning and using BCR for the entire domain. The times are in seconds. Figure 3 shows the time Teh required for the computation of the

0.5

0.0

__-----

.J

0 500 1000 1500

Collocation points

Figure 3: Time Tch set for the computation of the charges

I I I I I I I I I I i I I -4 5----l

I I I I ----p----,---v* ---- 1

I I I I

!Lkl

I

I I ;+

-5:----;----;-----,-.---&----t

0 2 4 6 8 10

Number of strips p and clusters

Figure 4: logI of the maximum error for various partition levels.

charges as a function of the number of collocation points u for a fixed number of charges on one cluster. In summary the module BIDDAECT had parameter list (n, 12, V, gA, 2.0,40, w). One sees the linear de- pendence predicted from TchcrVgd = O(vN2). Figure 4 shows the log,, of the error in the solution as a function of the number of partitions. Each of the strips is solved by means of the parallel version of BCR. In summary the module BIDD.RECT had parameter list (118,118,6O,gA, 2.0,40, w) and a = 2 in gA. Hence N and v are relatively small. We ob- serve that the decomposition of the problem in this case results in higher accuracy. This behavior was observed for all resolutions tried in our experiments. We note in this context, that the reduction in size of the coefficient matrix for the subproblems relative to the entire problem implies that each subproblem will enjoy a smaller condition.

494

20 --q--~----:-----:--- ’ ’ 1

-*----q

0 2 4 6 8 10 0 ~~~‘~~~~‘~~~~‘~~~~‘~~““‘~~‘~~“““”””’~”’~ Number of strips and clusters 0 2 4 6 8 10

p

Figure 5: Estimated parallel time Tp set, Y = 60. ?l= 418.


Figure 7: Estimated parallel time Tp set, v = 1,680, n = 418.

2 4 6 8


1 I I

I I I I

j x

Figure 6: Percentage of parallel time spent on CSM, ’ 2 4 6 8 10

v=60,n=418. Number of strips p and clusters

For the next set of figures, our interest is in seeing the reduction in time offered by the multicluster BIDD algorithm versus BCR. Figure 5 shows Tp for the original BCR algorithm as a function of p. The parameter list this time is (418,418,60, gA, 2.0,40, w). Figure 6 shows Ty x 100 for the same parameter values. Figures <and 8 depict the same quantities, but for Y = 1,680 = 4 x 420. The figures clearly show the very impres- sive improvements in the time when BIDD is used in this case. Figure 7 corresponds to the case when the boundary values used in the collocation are sam- pled at the same density as the number of gridpoints. Although from 4 it is easy to see that such a sam- pling rate is unnecessary for this case, BIDD still offers great improvement in performance. The remaining figures depict the same quantities as before but using the parallel version of BCR ([18]). Never- theless, a comparison with a multicluster version of BCR remains to be done. In this case, the per-

Figure 8: Percentage of parallel time spent on CSM, Y = 1,680, n = 418.

; --‘-l----1----1

I I

I

I

i-d

I I l-----l-----l ---- 4

I I I

I I I I I I I

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*~~~~~~~~~~~~~~~~~~~

0 2 4 6 8 10


Figure 9: Estimated parallel time Tp set, v = 60, n = 418, with BCR from [IS].

495

0 2 4 6 8 10


Figure 10: Percentage of parallel time spent on CSM, Y = 60, n = 418, with BCR from [18].

Figure 11: Estimated parallel time Tp set, v = 1,680, n = 418, with BCR from [18].

40 --

l

20 -- ,-- ----,-----l-----j----+

I

/

I I I I I 1 I I I I I I I I I I I 11

o~““‘lll”l”‘,~~~‘~~~~‘~~~~‘*~ll’l~~~’~~~~’J 0 2 4 6 8 10


Figure 12: Percentage of parallel time spent on CSM, v = 1,680, n = 418, with BCR from [18].

formance improvements are not as good. The rea- son is that BCR is much faster and the effect of the cost of TC~M and Tch in particular, becomes apparent. The degradation is particularly obvious when v = 1,680 in Fig. Il. It is clew that 8 multicbrster QR decomposition algorithm would improve matters drastically. Moreover, it is worth considering partitionings along both directions in order to cut down the size of the tridiagonal systems which must be solved at each step of BCR.

We also experimented with the application of the method for the problem

V2U(%) - Au = 0 (6)

when z E D and U(Z) = eafn+z) when t E BD,

with a = J

i so that the true solution is also given

by U(X) = eatz+z). Th e un f d amental solutions are then #i(z) = Ke(fi(z - wj I). Ko (a modified Bessel function) was evaluated by calling function DBESKO from the function library FNLIB developed by Wayne Fullerton. Our experiments &owed an immediate surge in TCSM, all of which is due to the overhead in computing the influence matrices E and G, thus re- inforcing the need for efficient vector/multiprocessor versions of libraries of mathematical functions.

4.2 Error estimates

Since the approximate solution G is by construction harmonic in the domain, the error w = u - 0 is also harmonic. From the maximum principle ([22]) w attains its maximum on 80. Hence, when CSM is used by itself for Problem 1, good error estimates can be derived by evaluating 119 - IZ+.,, where H is the influence matrix from the wj’s to 1 >> Y points on BD.

5 Conclusions

We saw that Boundary Integral. based Domain De- composition holds great promise for parallel architectures. The ability to completely decouple the problem helps to reduce the global memory traffic and synchronization on a multiple cluster, shared- memory system, and reduce the communication complexity on 8 message-passing architecture. We have also presented many of the theoretical and practical questions associated with the method currently under investigation.

496

Acknowledgements PI This research was supported by the National Sci- ence Foundation under Grants No. US NSF-MIP- 8410110 and US NSF DCR85-09970, the US Depart- ment of Energy under Grant No. DOE DEFG02- 85ER25001, by the US Air Force under Contract [lOI AFSOR-85-0211, and the IBM donation.

We would like to thank S. Eisenstat, K. Galli- van, Y. Saad, A. Sameh, and II. Wijshoff for helpful discussions, R. Skeel for his careful reading of the Ml

manuscript and comments, and M. Anderson for his help with EpTE;x.

References P4

PI

PI

PI

PI

PI

PI

PI

PI

R. C. Agarwal, I. W. Cooley, F. G. Gustavson, J. B. Shearer, G. Sishman, and B. Tuckerman. New scalar and vector elementary functions for the IBM system/370. IBM J. Res. Develop.,

~131

30(2):126-144, March 1986.

Q. V. Dihn, R. Glowinski, and J. PCriaux. Solv- ing elliptic problemi by domain decomposition methods with applications. Academic Press, 1984.

S. Bergman and 3. G. Herriot. Numerical solution of boundary value problems by the method of integral operators. Numkr. Math., 7:42-65, 1965. P41 P. E. Bjarstad and 0. Widlund. Iterative methods for the solution of elliptic problems on regions partitioned into substructures. SIAM J. Numer. Anal., 23(6):1097-1120, December 1986.

G. Fairweather and L. Johnston. The method of fundamental solutions for problems in potential theory. In C. T. II. Baker and G. F. Miller, editors, l+eatment of Integral Equations by Numerical Method&, pages 359-349, Aca- demic Press, 1982.

K. Gallivan, W. Jalby, U. Meier, and A. Sameh. The impact of hierarchical memory systems on linear algebra algorithm design. Technical Re- port 625, Center for Supercomputing Research and Development, September 1987.

P51 A. Bogomolny. Fundamental solutions method for elliptic boundary value problems. SIAM J. Numer. Anal., 22(4):644-669, Aug. 1985.

E. Gallopoulos and D. Lee. Boundary Inte- gral Domain Decomposition: theory and prac- tice. Technical Report, Center for Supercom- puting Research and Development, in prepara- tion.

C. Brebbia. Boundary integral formulations. In C. Brebbia, editor, Topics in Boundary El- [I63 ement Research: Basic Principles and Applica- tions, pages l-12, Springer-Verlag, 1984.

1). Buzbee, F. Dorr, A. George, and G. Golub. The direct solution of the discrete Poisson equa- [17] tion on irregular regions. SIAM J. Numer. Anal., 8(4):722-736, December 1971.

E. Gallopoulos and .D. Lee. Fast Laplace solvers by Boundary Integral based Domain Decompo- sition. In Proc. Third SIAM Conf. Par. Proc. Scien. Comput., 1987.

E. Gallopoulos and D. Lee. Method of fundamental solutions and the discrete Poisson kernel method. 1988. unpublished manuscript.

J. Carrier, L. Greengard, and V. Rokhlin. A [18] E. Gallopoulos and Y. Saad. Parallel block fast adaptive multipole algorithm for particle cyclic reduction algorithm for the fast solution simulations. Technical Report RR-496, Yale of elliptic equations. Technical Report 659, University, Department of Computer Science, Center for Supercomputing Research and De- September 1986. velopment, April 1987.

T. F. Chan and D. C. Resasco. A domain- [19] P.-C. Hansen and S. Christiansen. An SVD decomposed fast Poisson solver on a rectangle. analysis of linear algebraic equations derived SIAM J. Sci. Stat. Comp., 8(l):s14-s26, Jan- from first kind integral equations. J. Comp. uary 1987. Appl. Math., 12-13:341-357, May 1985.

497

S. Christiansen. A comparison of various integral equations for treating the Dirichlet problem. In C. T. Il. Baker and G. F. Miller, editors, l%eatment of Iniegral Equations by Numerical Methods, pages 12-24, Academic Press, 1982.

S. Christiansen: Condition number of matrices derived from two classes of integral equations. Mat. Meth. Appl. Sci., 31364-392, 1981.

P. J. Davis and P. Rabinowitz. Advances in orthonormalising computation. In F. L. Alt, editor, Advances in Computers, pages 55-133, Academic Press, 1961.

[20] W. J. Harrod. Programming with the BLAS. In [33] R. Mathon and L. Johnston. Th.e approximate L. H. Jamieson, D. Gannon, and R. J. Douglass, solution of elliptic boundary-value problerns by editors, The Characteristics of Parallel Algo- fundamental solutions. SIAM J. Numer. Anal., rithms, pages 253-276, The MIT Press, 1987. 14:638-650, September 19’77.

[21] U. Heise. Numerical properties of integral equa- [34] A. Mayo. Fast high-order solution of Laplace’s tions in which the given boundary values and equation on irregular regions. SIAM J. Sci. sought solutions are defined on different curves. Stat, Comput., 6:144-157, January 1985. Computers and Structurea, 8:199-205, 1978.

[22] P. Henrici. Applied and Computational Com- [35] K. Miller. Numerical analogs to the Schwarz

plez Analysis. Volume 3, Wiley, 1986. alernating procedure. Numer. Math., 7:91-103, 1965.

[23] P. Henrici. A survey of I. N. Vekua’s theory of elliptic partial differential equations with

[36] S. Murashima and H. Kuhara. An approximate

analytic coefficients. 2. Angew. Math. Phya., method to solve two-dimensional Laplace’s

8169-203, 1957. equation by means of superposition of Green’s function on a Riemann surface. J. Information

1241 J. L. Hess and A. M. 0. Smith. Calculation of potential flow about arbitrary bodies. In D. Kuchemann, editor, Progress in Aeronautical Sciences, pages 1-138, Pergamon Press, 1967.

[25] C. R. Hobby and J. R. Rice. Approximation Gom a curve of functions. Arch. Rational Mech. Anal., 24:91-106, 1967.

1261 R. Hackney. Computers, compilers, and Pois- son solvers. In U. Schumann, editor, Com- puters, Fart Elliptic Solvers, and Applications: Proc. GAMM Workshop, 1977.

[27] R. W. Hackney. Rapid elliptic solvers. In B. Hunt, editor, Numerical Methods in Applied Fluid Dynamics, pages l-48, Academic Press, London, 1980.

[28] G. C. Hsiao, P. Kopp, and W. L. Wendland. A Galerkin collocation method for some integral

equations of the first kind. Computing, 25:89- 130, 1980.

[29] M. A. Jaswon and G. T. Symm. Integral Equa- tion Methods in Potential Theory and Elasto- statics. Academic Press, 1977.

[30] 0. D. K e 11 og. Foundations of Potential Theory. Dover Pub., 1953.

[31] D. E. Keyes and W. D. Gropp. A comparison of domain decomposition techniques for elliptic partial differential equations and their parallel implementation. SIAM J. Sci. Stat. Comput., 8(2):s166-~202, March 1987.

Processing, 3:127-139, 1980.

[37] E. R. Oliveira. Plane stress analysis by a general integral method. J. Engng. Mech. Div. ASCE, 94 (EM 1):79-101,1968.

[38] W. Proskurowski. Capacitance matrix methods - a brief survey. In M. Schultz, editor, Ellip- tic Problem Solvers, pages 391-398, Academic Press, 1981.

[39] L. Reichel. On the determination of boundary collocation points for solving some problems for the Laplace operator. J. Comput. Appl. Math., 11:175-196, October 1984.

1401 J. R. Rice. Pamllel methods for partial diReren- tial equations. Technical Report TR-587, Com- puter Science Department:, Purdue University, April 1986.

[41] V. Rokhlin. Rapid solution of integral equations of classical potential theory. J. Comp. Phys., 60:187-207,1985.

[42] H. Singer, H. Steinbigler, and P. Weiss. A charge simulation method for the calculation of high voltage fields. IEEE Bans. Power Appara- tus and Systema, PAS-93:1660-1668, September 1974.

[43] D. R. Stoutemyer. Numerical implementation of the Schwarz alternating procedure for elliptic partial differential equations. SIAM J. Numer. Anal., 10(2):308-326, April 1973.

[32] D. J. Kuck, E. S. Davidson, Lawrie D. L., and [44] P. N. Swarztrauber. Fast .Poissson solvers. In A. H Sameh. Parallel supercomputing today G. H. Golub, editor, Studies in Numerical Anal- and the Cedar approach. Science, 231:967-974, ysis, pages 319-369, Mathematical Association February 1986. of America, 1984.

498

[45] P. N. Swarztrauber and R. A. Sweet. Algo- rithm 541: efficient Fortran subprograms for the solution of separable elliptic partial differential equations. ACM TOMS, 5:352-364, September 1979.

[46] R. A. Sweet. A cyclic reduction algorithm for solving block tridiagonal systems of arbitrary dimension. SIAM J. Numer. Anal., 14(4):707- 720, September 1977.

499

boundary integral domain decomposition on hierarchical...

Documents