parallel globally adaptive algorithms for multi-dimensional integration

15

Upload: independent

Post on 13-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Parallel Globally Adaptive Algorithms forMulti-dimensional IntegrationJ. M. Bull� & T. L. Freeman�y,Centre for Novel Computing & Department of Mathematics,University of Manchester,Manchester, M13 9PL, U.K.August 15, 1994AbstractWe address the problem of implementing globally adaptive algorithms for multi-dimensional numerical integration on parallel computers. By adapting and extend-ing algorithms which we have developed for one-dimensional quadrature we developalgorithms for the multi-dimensional case. The algorithms are targeted at the lat-est generation of parallel computers, and are therefore independent of the networktopology. Numerical results on a Kendall Square Research KSR-1 are reported.KEY WORDS: Numerical integration, multi-dimensional quadrature (cubature), parallel al-gorithms, globally adaptive algorithms.1 IntroductionThe problem considered in this paper is the approximation of the de�nite multi-dimensionalintegral I = Z b1a1 Z b2a2 : : : Z bnan f(x1; x2; : : : ; xn) dx1 dx2 : : : dxnto some absolute accuracy �. We are particularly interested in implementing globallyadaptive algorithms on parallel computers. In previous papers ([2] and [3]) we have in-vestigated parallel globally adaptive quadrature algorithms for one-dimensional problemsand here we seek to extend and adapt our ideas to the multi-dimensional case.In recent years a number of authors have considered parallel algorithms for numericalintegration. Some suggest algorithms that impose an initial static partitioning of theregion (interval) of integration and treat the resulting subproblems as independent, andtherefore capable of concurrent solution. These algorithms often include a mechanism�Both authors acknowledge the support of the EEC Esprit Basic Research Action Programme, Project6634 (APPARC).yThe second author acknowledges the support of the NATO Collaborative Research Grant 920037.1

for detecting load imbalance and redistributing work to neighbouring processors; see,for example, [1], [4], [18] for the one-dimensional case, and [5], [6], [16] for the multi-dimensional case.In the light of recent developments in communication networks which have given riseto parallel architectures in which communication latencies are no longer strongly depen-dent on the distance (number of links) between communicating processors, it no longerseems appropriate to design algorithms which are restricted to neighbour-to-neighbourcommunication. Indeed, in a machine such as the KSR-1, there is no notion of neigh-bourhood in the programming model (see Section 4). Another recent trend in parallelmachines is the emergence of single address space programming models for distributedmemory machines as typi�ed by the KSR-1. The model is described in more detail inSection 4.1, and its exploitation is discussed in Section 4.2. However we also discuss theimplementation of our algorithms in the more traditional message-passing and shared-memory frameworks in Section 4.3.In [10] Genz describes a parallel adaptive quadrature algorithm (for multi-dimensionalproblems) that does not depend on neighbour-to-neighbour communication. In [2] and[3] we develop similar parallel algorithms for one-dimensional quadrature based on theroutine D01AKF in the NAG library [19] and we �nd that one of these algorithms regu-larly outperforms Genz's algorithm. In this paper we extend the ideas of [2] and [3] tocubature over a hyper-rectangle. The restriction to a hyper-rectangle is not unreasonablesince a problem over a convex domain can often be transformed to a problem over ahyper-rectangle. We base our multi-dimensional algorithms on the routine D01FCF inthe NAG library. The underlying cubature rule pair is due to Genz and Malik [12] (seealso [11]); it requires 2d+2d2+2d+1 evaluations of the integrand to estimate an integralover a d-dimensional hyper-rectangle. The routine also returns an estimate of the errorin the approximation, and the dimension of the hyper-rectangle in which the integrandis most badly-behaved, based on fourth divided di�erences of the integrand. In Section 2we review the parallel algorithms that we developed for one-dimensional quadrature in[2], [3], and consider the ways in which they can be adapted for multi-dimensional prob-lems. One of the algorithms (the DS algorithm) requires a strategy for selecting intervals(hyper-rectangles) for further subdivision. Section 3 describes some possible strategiesfor multi-dimensional problems. In Section 4 we describe the implementation of theseparallel algorithms, and numerical results for the KSR-1 are presented in Section 5.2 Parallel AlgorithmsFirstly we describe the parallel algorithms for one-dimensional quadrature that wereimplemented in [3]. There are two levels at which parallelism can be exploited in globallyadaptive quadrature. Each application of the quadrature rule pair to an interval requiresa number of evaluations of the integrand which can be executed in parallel; this �ne-grainparallelism is well-suited to SIMD machines and vector processors|Genz [9] considersthe approximation of multiple integrals on an ICL DAP and Mascagni [17] considers thesame problem on a Connection Machine; Gladwell [13] considers the vectorisation of theNAG routine D01AKF on a CRAY-1. However this parallelism may be too �ne-grained(depending, of course, on the cost of integrand evaluation) for e�cient implementationon a MIMD machine. It is limited in extent by the number of points in the quadrature2

rule pair so that the number of processors than could be exploited e�ectively is alsolimited. For these reasons it is worth seeking to exploit the coarser-grained parallelismthat results from concurrent applications of the quadrature rule pairs to di�erent intervals.The standard sequential algorithm that is used in the NAG routines has very limitedparallelism at this level as only one interval is selected for bisection at any stage. In[3] we describe two algorithms that exploit the coarser-grained parallelism by identifyinga number of intervals to be bisected, and that also attempt to overcome some of thede�ciencies of the coarse-grained parallel algorithm developed by Genz in [10]. In the �rstof these algorithms we allow (pairs of) processors to proceed asynchronously|each pairof processors bisects the interval with the largest error estimate that is not currently beingbisected by any other pair of processors. This algorithm is referred to as the DynamicAsynchronous (DA) algorithm and is described in pseudocode as follows (throughout weassume that the parallel machine has p processors):Algorithm DA:p-sect interval [a; b]do allapply quadrature rule to subintervalsend do allmark all intervals inactivein parallel, each pair of processors executes:do while (error > �)�nd interval with largest error estimate which is inactivemark it activebisect itapply quadrature rules to both subintervals in parallelremove `old' interval from listadd two `new' intervals to listmark them inactiveupdate the approximation to the integral and the error estimateend doThe second algorithm, which we refer to as the Dynamic Synchronous (DS) algorithm,retains the synchronisation points of Genz's algorithm, but at each stage attempts toidentify the intervals which are likely to require further subdivision, and to subdividethem in such a way as to keep all processors usefully busy. (Genz's algorithm, which canbe written as a particular example of the DS algorithm, always identi�es p=2 intervals forsubdivision at each stage.) If the number of intervals m selected for further subdivisionis greater than or equal to p, all are bisected and the work is divided amongst theprocessors. If m is less than p, some intervals are divided into more than two pieces,allowing more than one processor to work on an interval. In addition, in order to generatea balanced load (so that the number of subintervals generated is divisible by p) an even�ner subdivision is applied to some of the intervals.3

Algorithm DS:p-sect interval [a; b]do allapply quadrature rule to subintervalsend do alldo while (error > �)Interval Selectionif (m � p) thenbisect each intervalelse if (p=2 < m < p) thenb2pm c-sect each intervalelse if (m � p=2) thenb pmc-sect each intervalend ifdo allapply quadrature rule to subintervalsend do allremove `old' intervals from listadd `new' subintervals to listupdate integral approximation and error estimateend doWe could extend these algorithms to cubature over a hyper-rectangle simply by changingthe underlying rule. (Henceforth we will refer to a hyper-rectangle as a region.) Weshould however take some care as to how the initial p-secting of the region is performed.In one dimension there is only one way to do this, but in several dimensions there is somechoice. We would like to divide the region in the dimensions in which the integrand is leastwell-behaved. On the initial stage the simplest way to obtain the necessary informationis to apply the cubature rule to the entire region, since this provides the required divideddi�erence estimates. The following heuristic is then be used to decide the dimensions inwhich to divide. Suppose that the largest fourth divided di�erence is dmax, and that n�is the number of dimensions in which the fourth divided di�erence is larger than �dmaxfor some 0 < � < 1. We subdivide each of the k = min(n�; blog2 pc) dimensions in whichthe fourth divided di�erences are largest into l pieces, where l is the greatest integer suchthat lk � p. To make the total number of subregions equal to p (if lk < p), we furtherbisect some of these new subregions in the worst dimension. We use � = 0:01 in theexperiments of Section 5.In the next section we present a variety of interval selection strategies which could beused in this multi-dimensional DS algorithm.3 Interval Selection Strategies for Algorithm DSIn [3] we reported the results of using a variety of di�erent interval selection strategiesfor one-dimensional quadrature. Here we will consider all these strategies except the one4

based on local error estimates, which gave very poor performance. We also include anew selection strategy, which is an extension of the one suggested by Gladwell [13]. Theinterval selection strategies we consider are the following:� Strategy 1 (Bull and Freeman)This strategy was suggested in [2].1. Search the list of intervals for the largest error estimate, say Emax, and2. identify all the intervals with error estimates > �Emax, for some � satisfying0 < � < 1.At each stage of the algorithm Strategy 1 requires two passes through the list ofintervals, one to identify the interval with the largest error estimate and a sec-ond to �nd those with error estimates within the fraction � of the largest. In[3] it is reported that, for one-dimensional problems using a 30-point Gauss and61-point Kronrod rule pair, this strategy is insensitive to the choice of � in therange [10�5; 10�1]. However in the multi-dimensional case we �nd that the strat-egy is rather more sensitive to the choice of �, and a value of � = 10�1 seemsto give the best results. This di�erence in performance can be ascribed to thefact that the multi-dimensional rule has a much lower degree of precision thanthe one-dimensional rule (seven, as opposed to 91), and thus the spread of errorestimates corresponding to intervals in the list tends to be much narrower in themulti-dimensional case.� Strategy 2 (average tolerance)This is another simple strategy; the intervals for subdivision are identi�ed as thosewith error estimates greater than a fraction of the global error target, with thefraction decreasing as the algorithm progresses. A natural way of achieving this isto take the reciprocal of the current number of intervals as the fraction, since thisguarantees that at least one interval is always selected.1. Assume that currently there are s intervals in the list of intervals,2. select all intervals which have error estimates � �=s.Notice that this strategy requires just one pass through the list of intervals at eachstage.� Strategy 3 (Gladwell)This strategy was suggested by Gladwell in [13].1. Rank the list of intervals by error estimates so that �1 < �2 < : : : < �s,2. calculate r such that r�1Xi=1 �i < �and rXi=1 �i � �;3. identify all the intervals with error estimates � �r.5

Notice that if we restrict all interval subdivisions to be bisections, this strategyselects only those intervals that would be selected by a sequential globally adaptivealgorithm which identi�es the interval with the largest error estimate at each stage.This strategy is the most expensive since it requires the maintenance of an orderedlist.� Strategy 4 (cheap Gladwell)As we will see from the results in Section 5, Strategy 3 is very e�ective for multi-dimensional integration, but it is computationally expensive. Strategy 4 is an at-tempt to obtain a similar selection, whilst removing the need to maintain an orderedlist.1. Search the list of intervals for the largest and smallest error estimates (Emaxand Emin).2. Divide the range [Emin; Emax] into b exponentially spaced sub-ranges. The ithsub-range, i = 1; 2; : : : ; b, is given by[exp (log (Emin) + (i� 1)l) ; exp (log (Emin) + il)] ;where l = (log (Emax)� log (Emin)) =b.3. Determine the number of intervals ni whose error estimates lie in the ith sub-range.4. Find r such that r�1Xi=1Mini < �and rXi=1Mini � �;where Mi = exp (log (Emin) + (i� 1=2)l)is the exponential midpoint of the ith sub-range.5. Select all intervals with error estimates greater than exp (log (Emin) + (r � 1)l),that is all intervals whose error estimates lie in sub-ranges r; r + 1; : : : ; b.This strategy requires three passes through the list of intervals, one to �nd Emaxand Emin, one to determine ni; i = 1; 2; : : : ; b, and a �nal one to �nd the indices ofthe selected intervals. In the results of Section 5 we use b = 50 for this strategy.� Strategy 5 (Genz)Genz's algorithm [10] corresponds to the DS algorithm with the following intervalselection strategy:1. Search the list of intervals to identify the p=2 intervals with largest error esti-mates. 6

4 Implementation4.1 The Kendall Square KSR-1We have implemented the algorithms described in Section 2 on the 32-processor KendallSquare Research KSR-1 computer installed at the University of Manchester. This isa virtual shared memory machine; it has physically distributed memories but there isextensive hardware support which enables the programmer to view the memories of allthe processors as a single address space. Each processor has a peak 64-bit oating pointperformance of 40 M op/s and 32 Mbytes of memory. The processors are connected bya uni-directional slotted ring network with a bandwidth of 1 Gbyte/s.The memory system, called ALLCACHE, is a directory-based system which supportsfull cache coherency. Data movement is request driven; a memory read operation whichcannot be satis�ed by a processor's own memory generates a request which traverses thering and returns a copy of the data item to the requesting processor; a memory writerequest which cannot be satis�ed by a processor's own memory results in that processorobtaining exclusive ownership of the data item, and a message traverses the networkinvalidating all other copies of the item. The unit of data transfer in the system is asubpage which consists of 128 bytes (16 8-byte words).The machine has a unix-compatible distributed operating system allowing multi-useroperation. To obtain reliable execution times for the numerical experiment described inSection 5 we avoid using processors with special functions (such as I/O). This limits usto using no more than 28 processors.Parallel programming is supported by extensions to Fortran consisting of directivesand library calls, in a shared memory paradigm. Nested parallel constructs are permitted,and are used in the implementation of the algorithm DA.4.2 Implementation on the KSR-14.2.1 DA algorithmFor the DA algorithm the list of regions is a shared data structure to which all processorshave access. Out of every pair of processors, only one searches the list to determine whichregion to subdivide next. At any time only one processor can be allowed access to thelist, either to search it or to update it. This synchronised access is controlled by a mutexlock (for more details see [15]). In practice the ALLCACHE memory system ensures thateach processor will have a copy of the list in its own memory. Each time the processorsearches the list, any subpage that has been updated by another processor will be foundto be invalid, and a remote memory access is needed to bring the up-to-date copy of thatsubpage into the processor's memory. Thus only changes to the list are communicated,rather than the whole list.4.2.2 DS algorithmFor the DS algorithm, in addition to the list of regions which is searched at each stage, weuse a temporary list to store the new regions generated by the selection and subdivisionprocedures. It is advantageous to store all the data associated with a region on as fewsubpages as possible. In fact, for each region, two subpages are su�cient to store end7

points, integral estimate, error estimate and worst dimension data (largest fourth divideddi�erences) for up to 14 dimensions. For Strategy 4 it is worth storing the logarithm ofthe error estimate in addition.The main loop of the DS algorithm consists of four parts: region selection, regionsubdivision, application of the cubature rules, and updating of the list of regions. TheKSR-1's memory system easily enables the subdivision of regions to be computed inparallel. Indeed, there is no need for synchronisation between computing the limits ofthe new subregions and applying the cubature rules, so that each processor can computeits limits in the same loop that contains the call to the cubature rule. Again eachprocessor only needs to make remote memory accesses for those regions in the list whichhave been updated by another processor within the last iteration of the main loop. Also itis possible, and cost e�ective, to perform the updating of the lists in parallel. A processorinserts the subregions to which it has just applied the cubature rule into the list at theappropriate points|either replacing a region that has been subdivided, or at the end ofthe list. Thus the do while loop of Algorithm DS becomes:do while (error > �)Region Selectiondo allcompute subregion limitsapply cubature rule to subregionsend do alldo allremove `old' regions from listadd `new' subregions to listupdate integral approximation and error estimateend do allend doFurther it is possible to parallelise the region selection strategies. For Strategies 1, 2and 4 this is straightforward|the list can be divided in equal parts and the operationsperformed on the parts of the list concurrently. In Strategy 1 it is necessary to form aglobal maximum to determine the maximum error estimate. In Strategy 4 we need to�nd the maximum and minimum error estimates in the list, compute the sub-range inwhich each region lies, and then �nd the regions with the larger error estimates. Thisrequires three synchronisation points. Strategy 3 is more di�cult to parallelise, since itrequires parallel sorting or parallel list insertion. However, in practice we �nd that therelatively high cost of synchronisation on the KSR-1 means that parallelisation of any ofthe selection strategies is not cost e�ective unless the list is very long. In the experimentsof Section 5 the region selection is performed on a single processor.4.3 Implementation on other architecturesBoth DA and DS algorithms could readily be implemented on traditional shared-memorycomputers. Such machines generally have very fast synchronisation, so it may be worth-while parallelising the region selection phase of the DS algorithm. The DS algorithm can8

also be implemented in a message-passing paradigm without much di�culty. Howeverit would be necessary for one `master' processor to hold the list of regions and for it toperform the region subdivision and list updating. It is likely that the expense of com-municating error estimates would prevent e�ective parallelisation of the region selection.The master processor could either participate in the application of cubature rules, or itcould simply send and receive results from the other processors and maintain the list ofregions.5 Results5.1 Test ProblemsWe wish to design test problems that will expose di�erences between the di�erent algo-rithms as clearly as possible. We have four choices to make in designing test problems:the number of dimensions, the integrand function, the tolerance and the length of timetaken to evaluate the integrand. For the DS algorithm we have a number of choices forthe region selection scheme. We wish to test both the e�ectiveness (the number of regionsprocessed, and the number of times a selection step is required) and the e�ciency (theexecution time of each selection step) of the selection strategies. Therefore we concen-trate on problems where the integrand evaluation is moderate (if it is very expensive,the e�ectiveness of the region selection is the dominant factor in determining overallexecution time; if it is very cheap, the e�ciency is dominant). We also chose problemswhich require a signi�cant number (several thousand) of regions to be processed|thistests the ability of the strategies to minimise the number of selection stages. We selectfour problems which exhibit a variety of integrand behaviour.Problem 1 Z 10 Z 10 Z 10 Z 10 4x1x23 exp(x1x3)(1 + x2 + 2x4)2 dx1 dx2 dx3 dx4; � = 10�7Problem 2 Z 10 Z 10 cos 50x1 cos 50x2 dx1 dx2; � = 10�6Problem 3 Z 10 Z 10 (x1x2)�0:99 dx1 dx2; � = 10�9Problem 4Z 10 Z 10 4Xi=1 24 x1 � 1p(i+ 1=2)!2 + x2 � 1p(i+ 1=2)!235�0:5 dx1 dx2; � = 10�6Problem 1 has no special features in the integrand. Problem 2 is a purely oscillatoryproblem, Problem 3 has a strong corner singularity, and Problem 4 has four internal pointsingularities. The evaluation times for the integrands are as follows: 7�s for Problem 1,23�s for Problem 2, 49�s for Problem 3, and 185�s for Problem 4; however we choose to9

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 5 10 15 20 25 30

1 / e

xecu

tion

tim

e

No. of processors

DS: Strategy 1DS: Strategy 2DS: Strategy 3DS: Strategy 4DS: Strategy 5

DAD01FCF

ideal

Figure 1: Temporal performance on KSR-1 Problem 1add appropriate delays so that the evaluation times for the integrands become 100�s forProblems 1{3 and 200�s for Problem 4.Figures 1{4 show the results of applying the DA algorithm, and the DS algorithm withdi�erent region selection strategies to Problems 1{4 respectively. Following the exampleof Hockney [14] we plot the temporal performance (the reciprocal of execution time)rather than speedup. Hockney argues that speedup can be a poor metric for comparingalgorithms, unless the sequential time used as the basis for the comparisons represents an`optimal' sequential algorithm, whereas the temporal performance of a parallel algorithmis independent of the sequential algorithm. A temporal performance graph not onlyallows fair comparisons between algorithms, but also permits comparisons with othermachines, while still retaining the visual impact of a speedup graph. Both executiontime, and speedup, can be readily inferred from temporal performance if desired. Theunits of temporal performance are solutions per second. The `ideal' line represents thetemporal performance for the NAG routine D01FCF (compiled with full optimisationand running on a single processor of the KSR-1) multiplied by the number of processorsused (it corresponds to linear speed-up).For Problem 1 we observe that the DS algorithmwith selection strategies 1{4 performsmuch better than with selection strategy 5 (the Genz algorithm). Also the DA algorithmperforms reasonably well when p is small|as p increases beyond 14 however, performancerapidly deteriorates. This is because the computation becomes dominated by the listsearching, which only one processor can do at a time. Of the versions of the DS algorithm,Strategy 2 consistently processes more regions than Strategies 1, 3 and 4. Strategy3 processes the fewest regions, but the expense of the region selection means that itperforms worse than Strategy 4. Strategy 1 gives rather mixed results, due to wide10

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 5 10 15 20 25 30

1 / e

xecu

tion

tim

e

No. of processors

DS: Strategy 1DS: Strategy 2DS: Strategy 3DS: Strategy 4DS: Strategy 5

DAD01FCF

ideal

Figure 2: Temporal performance on KSR-1 Problem 2variations in number of regions processed.The results for Problem 2 show an even wider disparity between the DS algorithmvariants and the DA algorithm. This problem requires over 104 calls to the cubatureroutine, and whereas in Problem 1 the processing of each region required 57 evaluationsof the integrand, in Problems 2{4 it requires only 17. The lower cost of applying thecubature rule and the length attained by the region list means that the DA algorithm isalways dominated by region selection. The DS algorithm with selection strategy 5 (theGenz algorithm) requires approximately 104=p region selection steps, whereas the otherselection strategies require only 10 to 20 steps, depending on the particular selectionstrategy and the number of processors. Indeed the DS algorithm sometimes gives betterperformance on one processor than the original NAG routine, and hence better thanideal performance. Comparing the selection strategies we �nd that Strategy 4 is best forthis problem. Strategy 1 gives mixed results|again the number of regions processed isunpredictable. Strategy 3 su�ers because of the expense of maintaining a long orderedlist.On Problem 3 the DA algorithm again gives poor performance. For the DS algorithm,Strategy 3 and Strategy 5 (the Genz algorithm) are also relatively poor|for Strategy3 not only is region selection expensive, but on this problem it requires more regionselection stages than Strategies 1, 2 and 4. Strategy 2 requires about 50% more callsto the cubature routine than the other strategies. Strategies 1 and 4 give the bestperformance on this problem|there is little to choose between them, and Strategy 1 ismore consistent in the number of regions it selects than for the other problems.The results for Problem 4 are similar to those for Problem 3. The DA algorithm doesslightly better, owing to the greater cost of evaluating the integrand and Strategy 5 also11

0

0.5

1

1.5

2

2.5

0 5 10 15 20 25 30

1 /

exec

utio

n ti

me

No. of processors

DS: Strategy 1DS: Strategy 2 DS: Strategy 3DS: Strategy 4DS: Strategy 5

DAD01FCF

ideal

Figure 3: Temporal performance on KSR-1 Problem 3performs slightly better. Strategy 1 is a little less consistent in its choice of regions, andStrategy 3 performs somewhat better than on Problem 3 because of the costlier integrandevaluations, and the number of selection steps being similar to the other strategies.6 ConclusionsWe have adapted two algorithms which we developed for one-dimensional problems to themulti-dimensional case. We �nd that the DA algorithm performs signi�cantly less well formulti-dimensional problems. This is due to two factors|in the multi-dimensional case thenumber of integrand evaluations per region is smaller (for a small number of dimensions)than the 91 evaluations of the Gauss-Kronrod pair that we used in one dimension, andthe length of the list of regions tends to be markedly longer for the multi-dimensionalproblems.The DS algorithm is far more successful, though the behaviour of the di�erent selectionstrategies is sometimes di�erent from the one-dimensional case. Strategy 1 was verysuccessful in one-dimension. It is less successful in many dimensions as it is more sensitiveto the parameter �, and the total number of regions processed is more unpredictable.Strategy 2 processes more intervals than other strategies in one dimension, and in manydimensions this e�ect is ampli�ed. However it often requires fewer selection steps, andthe selection strategy is computationally cheaper, than the other strategies. It couldtherefore be useful in a case where evaluation of the integrand is very cheap, but manyregions have to be processed.As in one-dimension, Strategy 3 is both the most expensive, and selects the fewest12

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30

1 / e

xecu

tion

tim

e

No. of processors

DS: Strategy 1DS: Strategy 2DS: Strategy 3DS: Strategy 4DS: Strategy 5

DAD01FCF

ideal

Figure 4: Temporal performance on KSR-1 Problem 4regions. It may therefore be the best strategy when integrand evaluation is very expen-sive. For the problems we tested, however, Strategy 4 is clearly the most successful. Thenumber of regions selected is normally only slightly higher than by Strategy 3, but selec-tion strategy is much cheaper to execute, and the algorithm often requires fewer selectionsteps than Strategy 3. We expect this strategy to give the best results over a wide rangeof multi-dimensional problems.The implementation of the algorithms described in this paper was made much easierby our access to the virtual shared memory KSR-1. The machine is scalable, like a localmemory (message-passing) machine, yet is almost as straightforward to exploit as a trueshared-memory machine|however in contrast to a true shared-memory machine the userof the KSR-1 must bear in mind the expense of data communication.AcknowledgementsThe authors acknowledge Professor Ian Gladwell for his interest in this work and forstimulating discussions on the design of interval selection strategies.References[1] J. Bernsten and T. O. Espelid, (1988) A Parallel Global Adaptive QuadratureAlgorithm for Hypercubes, Parallel Computing, 8, pp. 313{323.13

[2] J. M. Bull and T. L. Freeman, (1993) Parallel Globally Adaptive Quadratureon the KSR-1 , N. A. Report No. 228, Department of Mathematics, University ofManchester, submitted to Advances in Computational Mathematics.[3] J. M. Bull and T. L. Freeman, (1993) Parallel Algorithms and Interval SelectionStrategies for Globally Adaptive Quadrature, CNC Tech. Report No. CNC/1993/029,Department of Computer Science, University of Manchester, submitted to PARLE`94.[4] K. Burrage, (1990) An Adaptive Numerical Integration Code for a Chain of Trans-puters, Parallel Computing, 16, pp. 305{312.[5] E. de Doncker and J. Kapenga, (1990) Parallel Systems and Adaptive Integra-tion, Contemporary Mathematics, 115, pp. 33{51.[6] E. de Doncker and J. Kapenga, (1992) Parallel Cubature on Loosely CoupledSystems, pp. 317{327 of [7].[7] T. O. Espelid and A. Genz (eds.) (1992) Numerical Integration, Kluwer Aca-demic Publishers, Dordrecht.[8] G. Fairweather and P. M. Keast (eds.) (1987) Numerical Integration. RecentDevelopments, Software and Applications, NATO ASI Series C203, D. Reidel, Dor-drecht.[9] A. Genz (1982), Numerical Multiple Integration on Parallel Computers, ComputerPhysics Communications, 26, pp. 349{352.[10] A. Genz, (1987) The Numerical Evaluation of Multiple Integrals on Parallel Com-puters, pp. 219{229 of [8].[11] A. Genz, (1990) Subregion Adaptive Algorithms for Multiple Integrals, Contempo-rary Mathematics, 115, pp. 23{31.[12] A. C. Genz and A. A. Malik, (1980) Remarks on Algorithm 006: An AdaptiveAlgorithm for Numerical Integration over an N-dimensional Rectangular Region, J.Comput. Appl. Math. 6, pp. 295-302.[13] I. Gladwell, (1987) Vectorisation of One Dimensional Quadrature Codes, pp. 230{238 of [8].[14] R. Hockney, (1992) A Framework for Benchmark Performance Analysis, Super-computer, 48, pp. 9-22.[15] K.S.R., (1991) KSR Fortran Programming, Kendall Square Research, Waltham,Mass.[16] M. Lapenga and A. D'Alessio, (1993) A Scalable Parallel Algorithm for theAdaptive Multidimensional Quadrature, pp. 933{936 of [21].[17] M. Mascagni, (1990) High-Dimensional Numerical Integration and Massively Par-allel Computing, Contemporary Mathematics, 115, pp. 53{73.14

[18] V. A. Miller and G. J. Davis, (1992) Adaptive Quadrature on a Message-PassingMultiprocessor , Journal of Parallel and Distributed Computing, 14, pp. 417{425.[19] N.A.G., (1991) N.A.G. Fortran Library Manual, Mark 15 , N.A.G. Ltd., Oxford.[20] R. Piessens, E. de Doncker, C. �Uberhuber and D. Kahaner, (1983)QUADPACK, A Subroutine Package for Automatic Integration, Springer-Verlag,New York.[21] R. F. Sinovec, D. E. Keyes, M. R. Leuze, L. R. Petzold and D. A. Reed(eds.) (1993) Proceedings of the Sixth SIAM Conference on Parallel Processing,SIAM, Philadelphia.

15