parallel computation of the mp2 energy on distributed memory computers

10
Parallel Computation of the MP2 Energy on Distributed Memory Computers ANTONIO M. MARQUEZ' Departamento de Quimica-Fisica, Facultad de Quimica, Universidad de Sevilla, E-41012, Sevilla, Spain MICHEL DUPUIS IBM Corporation, Department M L M A / MS 428, Neighborhood Road, Kingston, New York 12401 Received 13 December 1993; accepted 12 July 1994 A parallel distributed implementation of the second-order Mdler-Plesset perturbation theory method, widely used in quantum chemistry, is presented. Parallelization strategy and performance for the HONDO quantum chemistry program running on a network of Unix computers are also discussed. Superlinear speedups are obtained through a combined use of the CPU and memory of the different processors. Performance for standard and direct algorithms are presented and discussed. A superdirect algorithm that eliminates the communication bottleneck during the integral transformation step is also proposed. 0 1995 by John Wiley & Sons, Inc. Introduction he starting point for essentially all methods T of ab initio molecular electronic structure the- ory is the Hartree-Fock or self-consistent field (SCF) method. Although it has long been recognized that, in many instances, it is necessary to go be- yond the Hartree-Fock approximation and include the effects of the electronic correlation,' the SCF method continues to be of immense value in its own right. The Maller-Plesset perturbation theory *Author to whom all correspondence should be addressed. methodology carried out to the second order has proven to be an efficacious means to account for electron correlation and yields significant im- proved computational results.* The current trend is toward applying both methods to increasingly larger molecular systems. This results in a dramatic increase of the de- mand for computer resources, including CPU time and input/output time as well as main memory and disk storage, because the computationally in- tensive steps in quantum chemistry programs scale nonlinearly with the size of the system. This im- plies that much effort is necessary to ensure that quantum chemistry applications make a definite Journal of Computational Chemistry, Vol. 16, No. 4, 395-404 (1995) 0 1995 by John Wiley & Sons, Inc. CCC 0192-8651 I 9 5 1040395-1 0

Upload: us

Post on 11-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Parallel Computation of the MP2 Energy on Distributed Memory Computers

ANTONIO M. MARQUEZ' Departamento de Quimica-Fisica, Facultad de Quimica, Universidad de Sevilla, E-41012, Sevilla, Spain

MICHEL DUPUIS IBM Corporation, Department M L M A / M S 428, Neighborhood Road, Kingston, New York 12401

Received 13 December 1993; accepted 12 July 1994

A parallel distributed implementation of the second-order Mdler-Plesset perturbation theory method, widely used in quantum chemistry, is presented. Parallelization strategy and performance for the HONDO quantum chemistry program running on a network of Unix computers are also discussed. Superlinear speedups are obtained through a combined use of the CPU and memory of the different processors. Performance for standard and direct algorithms are presented and discussed. A superdirect algorithm that eliminates the communication bottleneck during the integral transformation step is also proposed. 0 1995 by John Wiley & Sons, Inc.

Introduction

he starting point for essentially all methods T of ab initio molecular electronic structure the- ory is the Hartree-Fock or self-consistent field (SCF) method. Although it has long been recognized that, in many instances, it is necessary to go be- yond the Hartree-Fock approximation and include the effects of the electronic correlation,' the SCF method continues to be of immense value in its own right. The Maller-Plesset perturbation theory

*Author to whom all correspondence should be addressed.

methodology carried out to the second order has proven to be an efficacious means to account for electron correlation and yields significant im- proved computational results.* The current trend is toward applying both methods to increasingly larger molecular systems.

This results in a dramatic increase of the de- mand for computer resources, including CPU time and input/output time as well as main memory and disk storage, because the computationally in- tensive steps in quantum chemistry programs scale nonlinearly with the size of the system. This im- plies that much effort is necessary to ensure that quantum chemistry applications make a definite

Journal of Computational Chemistry, Vol. 16, No. 4, 395-404 (1995) 0 1995 by John Wiley & Sons, Inc. CCC 01 92-8651 I95 1040395-1 0

MARQUEZ AND DUPUIS

real impact in the study of such large systems. The two to three orders of magnitude in computational power provided by the future teraflop Massively Parallel Processing (MPP) computers must be ade- quately combined with advances in the theoretical methods and innovation in the computational al- gorithms.

Parallelization of electronic structure codes be- gan in the early 1980s, with the work of Clementi and co-workers3 on the loosely coupled array of processors (LCAP) at IBM Kingston. The integral and SCF modules of the IBMOL program were parallelized at the beginning of the LCAP experi- ment. Later, a similar strategy was adopted by Dupuis and co-workers in parallelizing the inte- gral, SCF, and integral derivative modules4 and the MP2 code of the HONDO program5, on the same LCAP system. However, the LCAP system was always an experimental computer, not avail- able to the general scientific community.

A network of general-purpose computer sys- tems interconnected by existing network and sup- port services is an attractive and viable alternative to special-purpose hardware multiprocessors. Such systems provide a general-purpose programming environment. Their programming paradigm is based on procedure-call access to system facilities, limited interprocess communication in a machine, and network services based on unrealiable data delivery.

Parallelization of SCF, CI, and coupled cluster programs is a field of increasing research efforts in many group^.^-". For the purpose of program- ming a parallel distributed implementation of the HONDO program for a network of computers, the PVM programming environment was chosen. PVM is a widely available and general-purpose concur- rent computing system from ORNL" that permits a network of heterogeneous Unix computers to be used as a single parallel computer. Application programs view PVM as a general and flexible parallel computing resource that supports a mes- sage-passing model of computation. Thus large computational problems can be solved by using the aggregate power of many computers.

Early work on the implementation of a parallel version of the HONDO program system in the PVM parallel environment began with the paral- lelization of the integral, SCF, integral derivative, and SCF (hyper) polarizabilities codes on a net- work of IBM RS6000 workstations.6

In this article we describe the parallelization of the SCF and MP2 energy module of the program using the PVM system and its performance with

representative samples. We describe the sequential algorithm used in the HONDO program for the computation of the MP2 energy and show the parallelism implicit in the basic equations. Then we discuss the parallel implementation of the com- putationally demanding steps and the expected performance. Finally, we present results on repre- sentative sample tests for the different algorithms implemented.

The Sequential Algorithm

After the computation of the SCF energy and wave function, the first step is the transformation of the two-electron repulsion integrals from the atomic arbital (AO) basis to the molecular orbital (MO) basis. In this program, the integrals transfor- mation is done without the presorting step (i.e., the unordered list of A 0 integrals is taken and the first index transformation is done as follows:

The loop over integrals runs only over the petite list of symmetry-distinct two-electron integrals. In the standard (disk-based) program, the nonzero two-electron integrals are read from disk as were stored for the SCF program without the presorting step.

However, for large molecules, quartic storage of the two-electron integrals will impose the limit on the size of the calculation that can be performed using standard methods, and a direct approach should be used instead. Direct methods, originally introduced by Almlof et al.'* for SCF and MP2 levels of t he~ry , '~ - '~ constitute a major contribu- tion toward the extension of those methods to large molecular systems.

396 VOL. 16, NO. 4

MP2 ENERGY

Our direct MP2 code uses essentially the same algorithm as our standard code. Only symmetry- distinct two-electron A 0 integrals are computed, and the one-index transformation as expressed in eq. (1) is performed only for the nonvanishing of those integrals. This scheme allows us to exploit both sparsity and symmetry in the two-electron integral list, resulting in importants reductions of the computation on extended and/or symmetric systems. This approach is similar to the one de- scribed by Almlof et al.,'3 except that we prefer to complete the first index transformation before transforming the rest of the indexes. This requires extra core storage with respect to their approach but allows us to use fast matrix-matrix multiplica- tion routines on those steps.

At this point we consider the memory require- ments to perform this step. To hold all ( p v l h j ) integrals in memory, we need n 2 ( n + l)noCc/2 words of memory because the p u indexes are treated as a combined index. If less memory is available, the program proceeds by batches, trans- forming as many occupied orbitals ( j ) as possible for the given memory. The minimum memory required is thus d ( n + 1)/2 if only one MO can be transformed on each batch. This does not seri- ously limit the range of application of our program because molecular systems with up to 200 basis functions can be treated in memory sizes up to 32 Mb, availables on most workstations configura- tions.

After all (or some) of this first transformation has been accomplished, the rest of the indexes are transformed stepwise, to produce the full set (or a subset) of fully transformed integrals:

( p i l A j ) = cC, ,< p v I A j )

( a i I A j ) = xC,,(pilAj) Y

P

(ailbj) = xC,,(ailAj) (2) h

In this last step, point group symmetry can be effectively used to reduce the number of integrals to transform; only those for which r, @ r, = rb 8 r, are transformed, where r, . . .refers to the irre- ducible representation of MO i. The rest of the integrals are zero by symmetry. Note that symme- try equivalence among the A 0 integrals is also used according to Hollauer and Dupuis.'6

All these operations are carried out in memory in the space allocated for the one-index trans- formed integrals without any integral being writ-

ten to disk. In this way input/output is limited at most to the reading of the A 0 integrals from disk in the standard MP2 program, and no input/out- put is performed in the direct MP2 program. This is important because it is widely recognized that input/output limits seriously the performance of modern computers because CPU performance has increased much more rapidly than external storage technology.

The transformed integrals, either the full (ail bj) set or the subset of them for some j MOs, are not written to disk. Instead, they are used directly to compute the MP2 correction to the SCF energy or a contribution to this energy resulting from the available integrals. The following expression for the computation of the MP2 correction to the SCF energy

shows that the total MP2 correction to the SCF energy can be computed by adding contributions from transformed integrals that have the same last MO index ( j ) . Thus it is possible to compute for a given occupied orbital j all of its contribution to the MP2 energy, independently of the computation of that from others occupied MOs. The last step in our sequential program is a direct implementation of the previous equation. If only some j MOs have been transformed, then the program goes back to the integral transformation program to process the next batch.

The Parallel Algorithm

In this section we describe the parallel imple- mentation of the four most computationally de- manding steps in the calculation of the MP2 en- ergy: the two-electron integral evaluation, the Fock matrix construction, the two-electron integral transformation and the evaluation of the MP2 cor- rection to the SCF energy using eq. (3). The parallel implementation of the integral and SCF programs has been discussed elsewhere? and only a brief summary is given here.

TWO-ELECTRON INTEGRAL EVALUATION

In HONDO, as in other quantum chemical pro- grams, the two-electron integral program is driven

JOURNAL OF COMPUTATIONAL CHEMISTRY 397

MARQUEZ AND DUPUIS

by loops over shells, irrespective of the details of how the blocks are computed. The evaluation of the integrals in one block is independent of that of other blocks, and the program can be easily paral- lelized by distributing the loop over shell blocks over the different processors. This technique pro- duces a statistical distribution of the work over the processors and has been found to give an excellent load balancing; consequently, the parallel effi- ciency is high! The scalability of the integral eval- uation step with up to a large number of proces- sors has been demonstrated by Schmidt et a1.l'

FOCK MATRIX COMPUTATION

This step is also easily parallelized by having each processor compute a partial Fock matrix from its own integral sublist. The total Fock matrix is obtained by adding the contributions from each processor in a global add operation. On the stan- dard implementation of the SCF method, each processor reads from disk its own integral sublist on each iteration and builds its partial Fock matrix; in the direct implementation, the computation of the partial Fock matrix on each node is done on the fly, as the integrals are computed on each iteration. The computation of the Fock matrix is asymptotically an O( n4) step, whereas diagonal- ization and other steps on the SCF iterations are, at the most, O(n3) steps. Experience shows that for reasonably large basis sets that are good candi- dates for parallelization, the performance of the program is not affected by the serial execution of those O(n3) steps, and no attempt has been made for further parallelization. This strategy uses the fully replicated data model, which requires that the nodes have enough memory to hold the full density and Fock matrices at the same time. With current hardware trends of large memories in workstations, this is not a serious limitation.

TWO-ELECTRON INTEGRAL TRANSFORMATION

From the discussion of our serial integral trans- formation algorithm and the computation of the MP2 correction to the SCF energy, a simple way to parallelize the integral transformation is by dis- tributing the loop over occupied j orbitals in eq. (1) over the available processors. Assuming that the number of processors (nproc) divides exactly the number of occupied orbitals (nocc), each pro- cessor has n,,,/nproc orbitals to transform in the

first step. For processor number one, this step could be written as

loop over occupied orbitals ( j = 1 n,,/nproc)

( p v l A j ) = C(pv lAu)C , i U

( h u l p j ) = E ( p v I A u ) C u I

( h a l v j ) = c ( p v I A u ) C p I

V

/L

end loop (4)

However, because of the fourth-rank tensor na- ture of the integrals, the computation of a given ( pv lh j ) integral requires A 0 integrals ( p v l h u ) with all possible values of cr that can be dis- tributed over different processors. Thus, each pro- cessor needs to process its own sublist of ( pvlhcr) A 0 integrals, as well as those from other proces- sors. This is done by using a broadcast function: In sequence each processor takes a buffer of nonzero integrals, broadcasts them to the other processors, and then reads nproc - 1 buffers from them, one buffer from each processor.

The minimum memory requirements for this procedure are the same as discussed in the para- graph about the sequential algorithm. However, the fact that different processors are working on different occupied orbitals makes the global mem- ory available to the program increase linearly with the number of processors. This is particularly im- portant for calculations with large atomic basis sets. In most cases, the sequential program will need to process the A 0 integrals in many batches, because not enough memory is available to fit all ( pvl h j ) integrals. Increasing the number of pro- cessors will reduce the number of times that each processor needs to go through the A 0 integrals list making it possible to have more than 100% paral- lel efficiencies in this step.

This statement can be rationalized as follows. To obtain the A 0 integrals list, we have to perform an O(n4) step, whereas the transformation of eq. (1) itself is an O(n5) step. Thus the total time can be said to be

where a and b are constants that measure the relative importance of each operation on the total time. If the available memory allows for only one

398 VOL. 16, NO. 4

MP2 ENERGY

occupied orbital to be transformed on each batch, we can say that the total time will be

T, = nocc * a O(n4) + b O ( n 5 ) (6)

because the A 0 integrals list will have to be read or computed nocc times, one per occupied MO; the O(n5) step is not affected. Having nproc processors to do the work, assuming that nproc divides ex- actly to nocc and the same available memory per processor, the time to perform the same task is

nproc nproc nproc

where the a’ constant now incorporates interpro- cessor communication and other overhead of parallelization. Indeed, each processor handles O(n4)/nproc integrals, and the j transformation is spread over nproc processors, resulting in a two- level parallelization on this step. In this case, the parallelization efficiency would be

. l o o ‘serial T2 %eff = * 100 = nproc * ‘para nproc . T3

a . O(n4) + b . O ( n 5 ) nocc . nocc

nproc

* 100 (8) - -

a’ * O(n4) + b O ( n 5 ) -.

In the standard MP2 implementation, in which the two-electron A 0 integrals are read from disk, the O( n5) step dominates the computation-that is, b is much bigger than either a or a’, and the limit efficiency for big basis sets will be 100%. On a direct MP2 implementation, the O(n4) step in- cludes repeated computation of the two-electron A 0 integrals, and this step determines the total time resulting in increased efficiency with increas- ing number of processors.

Implementation and Performance

Although not enforced, the programming paradigm implicit in the philosophy of the PVM system is the host/node. In this paradigm, a mas- ter program (the host) starts other small programs (the nodes) to perform specific tasks. While the node programs are running, the host is usually in a idle state, waiting for results from the nodes to execute other, possibly serial, parts of the program. This was also the philosophy underlying the origi- nal LCAP system, although the terminology was

somewhat different.3 One such implementation will require many changes in the serial version of the program related to the transfer of the necessary data (molecular geometry, basis sets, density ma- trices, etc.) between the nodes and the host. The logic of the program would be rather cumbersome, and problems associated with maintenance of two different versions of the program will be created. To avoid this situation, one of the objectives of the parallel implementation was to keep the necessary changes to a minimum and to keep the necessary routines added to the serial program few and compact. This has been accomplished so that changes in the serial part of the code do not affect its parallel version; and only a small number of subroutines, held together in a single FORTRAN module (pvm.f), are necessary to build the parallel version of the HONDO program. Another result of the implementation is that, apart from the commu- nication routines, the implementation is indepen- dent of the parallel system used because such details are hidden to the program.

Although we adapt the implementation to the host/node paradigm, there is only a single exe- cutable module that is host and node at the same time, and it performs differents tasks only in se- lected points of the program depending on its logical CPU identification (ID). This way, the im- plementation is simple and easy to maintain. To clarify how the program works, Figure 1 shows schematically the structure of computation for a SCF calculation for both host and node.

When the user starts the host, the program determines, by reading the input file, how many processes are going to be used (nap), enrolls into the PVM system, and starts nap - 1 node pro- grams. The node programs enroll in turn into the PVM system and wait to receive the input file from the host. The host reads the input file, and at the same time that it makes a working copy, it sends it to the nodes. In this way, all processes have the necessary information about the molecule to perform the calculation. At this point, both host and node begin computation separately. They per- form the necessary setup of data (initial guess orbitals, one-electron integrals, etc.) and compute independently their share of the two-electron re- pulsion integrals on INTGRL (the real name of the subroutine is different, but it does not matter for this discussion). Although some work is repeated in the setup step, no communication is necessary at this point, the structure of the program is kept simple, and the performance of the parallel code is not affected. After this task is completed, the SCF

JOURNAL OF COMPUTATIONAL CHEMISTRY 399

MARQUEZ AND DUPUIS

ONETRF ONETRF

, , read input and send

setup calculation

I receive input

setup calculation

I INTGRL I

LCPADD LCPADD ! check convergence check convergence

receive partial F and send full F

send partial F and receive full F

I

end SCF 1 I

end SCF J FIGURE 1. Structure of the parallel computation of the SCF energy.

iterations begin. First, the one-electron density ma- trix is computed by each process (host and nodes), which then contracts its sublist of the two-electron integrals with this matrix to form an incomplete Fock matrix (routine FOCK). Then the node pro- grams send their incomplete Fock matrices to the host, which adds all contributions together and sends the complete Fock matrix to the nodes (routine LCPADD). The SCF energy is computed, convergence is checked, and a new SCF iteration begins, if necessary.

The structure of computation of the MP2 correc- tion to the SCF energy is schematically sketched in Figure 2. After the SCF iterations reach conver- gence, all processes have a copy of the SCF orbitals (C matrix in Fig. 2), and they can begin the trans- formation of the two-electron repulsion integrals. The transformation of the first index is performed in routine ONETRF and requires the broadcast of the integrals in the A 0 basis to all processes. After this first index transformation has been finished, the ( pvl A j ) integrals are fully transformed to MO integrals (ailbj) independently on each node (routine MP2TRF2), with no communication at this

input: C - input: C - L“I E‘” - (partial)

I I L C P y D

E‘” - (partial)

LCPADD

I El2’ - (total) 1 FIGURE 2. Structure of the parallel computation of the MP2 energy.

point. This step is, thus, fully parallelized. Finally, the current batch of transformed MO integrals is used on each node to compute a partial contribu- tion to the MP2 energy as in eq. (3), and all contributions are added together by the host (routine LCPADD). If necessary, depending on the available main memory, a new batch of MO inte- grals is processed until all contributions to the MP2 correction to the energy have been computed.

As a first test sample, we ran MP2 calculations on the glycoluril urea monomer, whose molecular structure is depicted in Figure 3. In Tables I and 11, timing results for a representative, small-sized cal- culation on this molecular are presented. A stan- dard basis set of 6-31G type was used for this test, resulting in 102 basis set atomic orbitals. Although the samples were run on a Convex C2 computer with plenty of memory, the program was limited to use 64 Mb per host/node, because this is a

FIGURE 3. Molecular structure of glycoluril urea monomer.

400 VOL. 16, NO. 4

MP2 ENERGY

TABLE I. Timing Results for Standard SCF Calculation.

Time / seconds (% efficiencyIa

Number of processors INTGRL SCFb Total SCF Speedup

- - 1 .oo 98.62 31.36 758.80 52.40 (94) 19.22 (82) 438.63 (86) 1.73 35.57 (92) 14.32 (73) 31 4.68 (80) 2.41 27.77 (89) 12.80 (61) 272.35 (69) 2.79

-

a Time in seconds on a C2 Convex computer. Last iteration. Twenty-two iterations were necessary for convergence.

common configuration in most desktop and desk- side workstation computers.

Table I presents timing results for the SCF step of the computation of the MP2 energy. The num- ber of processors ranged from one to four, and the efficiency of parallelization for the computation of the two-electron repulsion integrals, the SCF itera- tions, and the total SCF step are given in parenthe- ses. Those results show that the efficiency of paral- lelization is high (94-89%) even for this small-sized sample. The speedup is nearly linear in this step, as can be seen graphically in Figure 4. Parallel efficiency of the SCF iterations is affected by the serial execution of the O(n3), steps as previously discussed. The efficiency of parallelization reduces to 61% for four processors, and the speedup is not as good as on the integrals computation step. The speedup of the total SCF step is an average of both steps, because they dominate the computation.

Timing results for the two-electron integrals transformation and the computation of the MP2 energy are presented in Table 11; again, the effi- ciency of parallelization for the different steps is given in parentheses. The efficiency of paralleliza- tion of routine ONETRF (first index transforma- tion) is affected by the overhead produced by the

broadcast of the atomic integrals between the nodes; it comes down to only 54% with three nodes. Basically, for each node, some integrals that are read from disk in the sequential code are read from the network. This is a more expensive opera- tion, and the result is a degraded performance for this step. The transformation of the three other indexes and the computation of the contribution to the MP2 energy show a high efficiency of paral- lelization, as expected (93-95%). Because it is the ONETRF step which dominates the computation, the speedup of the total MP2 step is only a little better than that for the SCF step, as can be seen from Tables I and I and graphically in Figures 4 and 5.

However, for larger molecular systems, the available disk space will impose the limit on the size of the calculation that can be performed using the standard techniques, and a direct approach should be used instead. There has recently been much work in extending the direct approach to the evaluation of MP2 energies13-15 and gradient^.'^ In essence, a direct code simply recomputes the two- electron repulsion integrals when needed, avoid- ing quartic disk space requirements at the expense

TABLE II. Timing Results for Standard MP2 Energy Calculation.

Time / seconds (% efficiency)a

Number of processors ONETRF MP2TRF2 Total MP2 SCF + MP2 Speedup

1 237.32 - 196.02 - 436.20 - 1195.00 - 1 .oo 2 160.93 (74) 103.12 (95) 266.98 (82) 705.62 (85) 1.69 3 153.13 (52) 70.22 (93) 226.32 (64) 541.00 (74) 2.21 4 77.80 (76) 51.37 (95) 132.07 (83) 404.42 (74) 2.95

aTime in seconds on a C2 Convex computer.

JOURNAL OF COMPUTATIONAL CHEMISTRY 401

MARQUEZ AND DUPUIS

4

.- 0 3 E 7 B z- 2

Y

a

1

1 2 3 4 number of processors

FIGURE 4. Speedup of sample tests for standard SCF calculation.

of increased computational effort. Direct methods have become highly competitive due to improve- ments in integral evaluation codes and the rapid development in CPU technology.

Timing results for a direct MP2 energy calcula- tion are presented in Table 111; the same sample molecular system as for Tables I and I1 has been used, but in this case a standard 6-31G** basis set has been employed, resulting in a basis of 180 atomic orbitals. Even on this medium-sized sam- ple, the parallelization of the SCF steps is encour- aging (9544%) and the O(n3> do not affect much the overall efficiency. As discussed, parallel effi- ciency of the ONETW step is larger than 100% due to the combined use of the main memory on the different processors. This is an aspect of paral- lelism that to our knowledge has not been previ- ously used to improve efficiency on a quantum

4

0 3 E ::

.- - a -0

& 2

1

1 2 1 4 number of processors

FIGURE 5. Speedup of sample tests for standard MP2 calculation.

chemistry application. The parallel efficiency of the rest of the integrals transformation and of the MP2 energy itself is nearly the same as in the standard approach and has not been included. The overall efficiency of the implementation can be qualified as excellent because the speedups exceed the num- ber of processors. The performance of paralleliza- tion on the SCF, ONETRF, and the total of the run can be seen graphically in Figure 6.

As discussed, during the ONETRF step the dif- ferent nodes have to exchange the two-electron repulsion integrals, and this broadcast of data can impose a serious charge on the network. In places where the communication speed is not that of an FDDI line, it may be preferable to reduce the amount of data that need to be exchanged even at the expense of increased computation. Our last algorithm implements a superdirect approach to

TABLE 111. Timing Results for Direct SCF and MP2 Energy Calculation.

Time/ minutes (% efficiency)'

Number of processors SCF ONETRF Total MP2 Total Speedup

- - 1 .oo 1 97.3 392.4 - 412.1 509.4 2 51.2 (95) 159.8 (123) 167.9 (123) 219.0 (1 16) 2.32 3 37.5 (87) 96.0 (136) 102.6 (134) 140.0 (121) 3.64 4 28.7 (85) 62.9 (156) 68.4 (151) 97.4 (131) 5.23

-

aTirne in minutes on a C2 Convex computer.

402 VOL. 16, NO. 4

MP2 ENERGY

6

5

.- 0

E I

4 ? P 3 s

2

- direct MP2 calculation

ONETRF + 0 MP

xTotr

1 2 3 4

FIGURE 6. Speedup of sample tests for direct SCF and MP2 calculation.

this problem, in the sense that each processor computes all the two-electron integrals, thus com- pletely eliminating the broadcast problem. This implies that the (#n4) steps in ONETRF will not be parallelized, but the transformation itself is still done in parallel, and we expect good performance as well on this implementation. In a sense, this code eliminates disk 1/0 as well as network 1/0 in the parallel computation, and this is why we call it superdirect.

Table IV shows timing results for a series of superdirect MP2 energy calculations on the same sample used for Table 111. The performance of the SCF step is the same as in Table I11 and has not been included. The ONETRF step shows a good parallel efficiency, although in this case it cannot be larger than 100% because there is only a level of parallelization on the algorithm. The results are presented graphically in Figure 7 where a nearly linear speedup with the number of processors is observed.

4

0 3 c! a 7 8

.A c

d

2 2

1

I

/ super-direct MP2 calculation

I 2 3 4 number of processors

FIGURE 7. Speedup of sample tests for superdirect MP2 calculation.

Concluding Remarks

These preliminary results show that even for small and medium-size calculations, our parallel implementation of the computation of the MP2 energy in the HONDO program system performs extremely well. Two points need further work: the reduction of communication in the integrals trans- formation and the reduction of the cubic memory requirement for this step. As presented, our code can be used on networks of a few workstations or in multiprocessor computers with a small number of CPUs. Progress is being made in writing a new code that will eliminate the cubic memory require- ment and will be suitable to perform large-scale calculations on the next generation MPP comput- ers.

TABLE IV. Timing Results for Superdirect MP2 Energy Calculation.

~

Time / minutes (% efficiencyla ONETRF total MP2 Total Speedup Number of processors

- 1 .oo 1 392.4 412.1

3 143.4 (91 1 151.5 (91 1 189.0 (90) 2.70

- 509.4 -

2 205.0 (96) 21 5.7 (96) 266.9 (95) 1.91

4 102.4 (96) 107.8 (96) 136.6 (93) 3.73

aTime in minutes on a C2 Convex computer.

JOURNAL OF COMPUTATIONAL CHEMISTRY 403

MARQUE2 AND DUPUIS

Acknowledgment

A. Mhrquez wants to thank the staff of the Centro Informhtico Cientifico de Andalucia for computational facilities. This work was financially supported by the Direcci6n General Cientifica y TGcnica, Project No. PB92-0662.

References

1. H. F. Schaefer, The Electronic Structure of Atoms and Molecules: A Survey of Rigurous Quantum Mechanical Results, Addison- Wesley, Reading, MA, 1972.

2. J. A. Pople, J. S. Binkley, and R. Seeger, Int. J. Quantum Chem., 5, 280 (1984).

3. E. Clementi, S. Chin, G. Corongiu, J. Detrich, M. Dupuis, L. J. Evans, D. Folsom, D. Frye, G. C. Lie, D. Logan, D. Meck, and V. Sonnad, In Modern Techniques in Computa- tional Chemistry, 1991, E. Clementi, Ed., ESCOM, 1991; and references therein. Leiden (The Netherlands), pg. 1191.

4. M. Dupuis and J. D. Watts, Theor. Chim. Acta, 71, 9 (1987); see also J. D. Watts, M. Dupuis, and H. 0. Villar, IBM Technical Report KGN-78, IBM Corporation, Kingston, NY, 1986.

5. M. Dupuis, D. Spangler, and J. J. Wendolowski, NRCC Software Catalog, Vol. 1, Program Number Q G O l , 1980.

6. M. Dupuis, S. Chin, and A. Mirquez, In Relativistic and Electron Correlation Effects in Molecules and Clusters, G. L.

Malli, Ed., NATO AS1 Series, Plenum Press, New York, 1994, pg. 107.

7. H. P. Liiethi and J. Almlof, University of Minnesota Super- computer Institute Research Report UMSI 91/249, Min- neapolis, MN, 1992.

8. M. Schiiler, T. Kovar, H. Lischka, R. Shepard, and R. J. Harrison, Theor. Chim. Acta, 84, 489 (1993).

9. S. Brode, H. Horn, M. Ehrig, D. Moldrup, J. E. Rice, and R. Ahlrichs, J. Comp. Chem., 14, 1142 (1993).

10. M. W. Schmidt, K. K. Baldridge, J. A. Boatz, S. T. Elbert, M. S. Gordon, J. H. Jensen, S. Koseki, N. Matsunaga, K. A. Nguyen, S. Su, T. L. Windus, M. Dupuis, and J. A. Montgomery, Jr., J . Comp. Chem., 14, 1347 (1993).

11. (a) V. S. Sunderam, Concurrency: Practice and Experience, 2 315 (1990); (b) V. Sunderam, Concurrent Computing with PVM, Cluster Computing Workshop, Florida State Univer- sity, 1992.

12. (a) J. Almlof, K. Faegri, Jr., and K. Korsell, J. Comp. Chem., 3, 385 (1982); (b) J. Almlof and P. R. Taylor, In Advanced Theories and Computational Approaches to the Electronic Struc- ture ofMolecules, C. E. Dykstra, Ed., Reidel, Dordrecht, The Netherlands, 1984, pg. 315.

13. S. S a e b ~ and J. Almlof, Chem. Phys. Lett., 154, 83 (1989). 14. M. Head-Gordon, J. A. Pople, and M. J. Frisch, Chem. Phys.

15. R. Alrichs, M. Bar, M. Haser, H. Horn, and Ch. Komel,

16. E. Hollauer and M. Dupuis, J. Chem. Phys., 96, 5220 (1992). 17. (a) M. J. Frisch, M. Head-Gordon, and J. A. Pople, Chem.

Phys. Lett., 166, 275 (1990); (b) Chem. Phys. Lett., 166, 281 (1990).

Lett., 153, 503 (1988).

Chem. Phys. Lett., 31, 521 (1987).

404 VOL. 16, NO. 4