reactive molecular dynamics on massively parallel ... · abstract—we present a parallel...

1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2016.2548462, IEEETransactions on Parallel and Distributed Systems

1

Reactive Molecular Dynamics on MassivelyParallel Heterogeneous Architectures

Sudhir B Kylasa, Hasan Metin Aktulga, Ananth Y Grama

Abstract—We present a parallel implementation of the ReaxFF force field on massively parallel heterogeneous architectures, calledPuReMD-Hybrid. PuReMD, on which this work is based, along with its integration into LAMMPS, is currently used by a large numberof research groups worldwide. Accelerating this important community codebase that implements a complex reactive force field posesa number of algorithmic, design, and optimization challenges, as we discuss in detail. In particular, different computational kernelsare best suited to different computing substrates – CPUs or GPUs. Scheduling these computations requires complex resourcemanagement, as well as minimizing data movement across CPUs and GPUs. Integrating powerful nodes, each with multiple CPUsand GPUs, into clusters and utilizing the immense compute power of these clusters requires significant optimizations for minimizingcommunication and, potentially, redundant computations. From a programming model perspective, PuReMD-Hybrid relies on MPIacross nodes, pthreads across cores, and CUDA on the GPUs to address these challenges. Using a variety of innovative algorithmsand optimizations, we demonstrate that our code can achieve over 565-fold speedup compared to a single core implementation on acluster of 36 state-of-the-art GPUs for complex systems. In terms of application performance, our code enables simulations of over1.8M atoms in under 0.68 seconds per simulation time step.

Index Terms—Reactive Molecular Dynamics; Parallel GPU Implementations; Material Simulations;

F

1 INTRODUCTIONThere have been significant efforts aimed at atomisticmodeling of diverse systems – ranging from materialsprocesses to biophysical phenomena. Classical moleculardynamics (MD) techniques typically rely on static bondsand fixed partial charges associated with atoms [1]–[4].These constraints limit their applicability to non-reactivesystems. Quantum mechanical ab–initio methods havebeen used to simulate chemical reactions in reactivesystems [5]–[7]. These simulations are typically limitedto sub-nanometer length and picosecond time scalesbecause of their high computational cost. For this rea-son, ab–initio approaches are unable to simultaneouslydescribe bulk properties of systems and the reactivesubdomains. Attempts have been made to bridge thisgap between non-reactive bulk systems and reactive sub-domains using hybrid simulation techniques, wherebythe surface sites are simulated using quantum calcu-lations and bulk sections are simulated using classicalMD [8]–[10]. This approach has potential drawbacks dueto the interface of the ab–initio and MD sections of thesystem. Classical force fields must be tuned not only tofit experimental results, but also to interface with the

• S. Kylasa is with the Department of Electrical and Computer Engi-neering, Purdue University, West Lafayette, Indiana 47907. E-mail: [email protected]

• H. M. Aktulga is with Michigan State University, 428 S. Shaw Lane,Room 3115, East Lansing, MI 48824 and also with Lawrence BerkeleyNational Laboratory, 1 Cyclotron Rd, MS 50F-1650 Berkeley, CA 94720.E-mail: [email protected]

• A. Y. Grama is with the Department of Computer Science, PurdueUniversity, West Lafayette, Indiana 47907. E-mail: [email protected].

This work was supported by the National Science Foundation Grant CCF1533795.

ab–initio calculations. Inconsistencies between MD forcefields and quantum calculations can result in unwantedchanges in the structure of the system [11].

In this paper, we focus on a reactive atomistic sim-ulation method called ReaxFF, developed by van Duinet al. [12]–[15]. This method relies on the developmentof empirical force fields, similar to classical MD, thatmimic the quantum mechanical variation of bond order.ReaxFF replaces the harmonic bonds of MD with bondorders, and energies that are dependent on inter-atomicdistances. The satisfaction of valencies, explicitly satis-fied in MD simulation, necessitates many-body calcula-tions in ReaxFF. This approach allows bond orders andall bonded interactions to decay (or emerge) smoothlyas bonds break (or form), allowing chemical reactionswithin a conventional molecular dynamics framework.Consequently, ReaxFF can overcome many of the lim-itations inherent to conventional MD, while retaining,to a great extent, the desirable scalability. Furthermore,the flexibility and transferability of the force field allowsReaxFF to be applied to diverse systems of interest [12],[16]–[22]. In our prior work, we have demonstratednovel simulations and insights from reactive simula-tions of silica-water interfaces (Figure 1(a)) and oxidativestress on biomembrames (Figure 1(b)).

The critical enabling technology for such simulationsis the PuReMD (Purdue Reactive Molecular Dynamics)software suite [23]–[25]. This software is available asa standalone package, as well as a plugin (User-ReaxCpackage [26]) to LAMMPS [27]. Serial and parallel (MPI)versions of PuReMD serve as community standards forReaxFF simulations, with over a thousand downloadsworldwide, and an active user and secondary developer



2

Fig. 1: Examples of ReaxFF simulations: (a) silica-waterinterface modeling; and (b) oxidative stress on

biomembranes (lipid bilayers).

community. The accuracy and performance of PuReMDhas been comprehensively demonstrated by us and oth-ers in the context of diverse applications.

The large number of time-steps and size of typical sys-tems pose challenges for ReaxFF simulations. Time-stepsin ReaxFF are of the order of tenths of femtoseconds –an order of magnitude smaller than corresponding con-ventional MD simulations, due to the need for modelingbond activity. Physical simulations often span nanosec-onds (107 time-steps) and beyond, where interestingphysical and chemical phenomena can be observed.Systems with millions of atoms are often necessary foreliminating size effects. PuReMD incorporates severalalgorithmic and numerical innovations to address thesechallenges posed by ReaxFF simulations on CPU basedsystems [23], [24]. It achieves excellent per time-stepexecution times, enabling nanosecond-scale simulationsof large reactive systems.

As general purpose GPU accelerators become increas-ingly common on large clusters, an important next stepin the development of PuReMD, presented in this paper,is the efficient use of multiple GPUs (along with allavailable processing cores) to enable significant newsimulation capabilities. Our serial GPU implementationof PuReMD provides up to a sixteen-fold speedup withrespect to a single CPU core (Intel Xeon E5606) on nVidiaC2075 GPUs for test systems (bulk water) [25]. Thisformulation, however, does not make good use of theCPU resources on the node. Nodes in conventional clus-ters, however, include powerful CPUs, often comprisingtens of compute cores, along with one or more GPUs.Efficiently utilizing all of these compute resources posesa number of challenges.

In partitioning work within a node between the CPUcores and GPUs, one must consider suitability of thecompute kernels to the properties of the processor archi-tecture. Furthermore, such a partitioning must also min-imize data movement across CPU and GPU memories,while minimizing synchronization overheads. Acrossmultiple such nodes, one must pay particular attentionto the disparate tradeoffs in computation to communi-cation speed. In particular, if the processing speed ofcompute nodes on a cluster is increased by over an orderof magnitude without changing the capability of the

communication fabric, one can potentially expect a sig-nificant loss in parallel efficiency and scalability, unlesssuitable algorithms, optimizations, and implementationsare developed. To address these challenges, this paperhas the following goals: (i) to develop efficient computa-tion and communication techniques for ReaxFF on GPUclusters; (ii) to develop efficient work distribution tech-niques across processing cores and GPUs in a node; and(iii) to demonstrate, in the context of a production code,scalable performance and effective resource utilizationsimultaneously for reactive MD simulations.

A parallel GPU implementation of a large, sophis-ticated code such as PuReMD poses significant chal-lenges such as: (i) highly dynamic nature of interac-tions and the memory footprint, (ii) diversity of kernelsunderlying non-bonded and bonded interactions, (iii)complexity of functions describing the interactions, (iv)charge equilibration (QEq) procedure which requires thesolution of a large system of linear equations, and (v)high numerical accuracy requirements. All these requirecareful algorithm design and implementation choices.Effective use of shared memory to avoid frequent globalmemory accesses and configurable cache to exploit spa-tial locality during scattered memory operations areessential to the performance of various kernels on in-dividual GPUs. These kernels are also optimized toutilize GPUs’ capability to spawn thousands of threads,and coalesced memory operations are used to enhanceperformance of specific kernels. The high cost of dou-ble precision arithmetic on conventional GPUs must beeffectively amortized/masked through these optimiza-tions. These requirements are traded-off with increasedmemory footprint to further enhance performance. Todeal with significantly higher node computation rates,we present a sequence of design trade-offs for com-munication and redundant storage, along with alternatealgorithmic choices for key kernels.

We describe the design and implementation of allphases of PuReMD-Hybrid. Comprehensive experimentson a state-of-the-art GPU cluster are presented to quan-tify accuracy as well as performance of PuReMD-Hybrid.Our experiments show over 565-fold improvement inruntime on a cluster of 36 GPU-equipped nodes, com-pared to a highly optimized CPU-only PuReMD im-plementation on model systems (bulk water). Thesespeedups have significant implications for diverse sci-entific simulations.

The rest of the paper is organized as follows: Section 2discusses the related work on parallel ReaxFF. Section 4provides an in-depth analysis of the design and im-plementation choices made during the development ofPuReMD-Hybrid. Section 3 presents a brief overview ofour prior research in this area. We comprehensively eval-uate the performance of PuReMD-Hybrid in Section 5.Section 6 provides our concluding remarks and futurework.



3

2 RELATED WORK

The first-generation ReaxFF implementation of van Duinet al. [12] established the utility of the force field in thecontext of various applications. Thompson et al. usedthis serial Fortran-77 implementation as the base codefor a parallel integration of ReaxFF into the LAMMPSsoftware [27]. This effort resulted in the LAMMPS/Reaxpackage [28], the first publicly available parallel im-plementation of ReaxFF. Nomura et al. demonstrated aparallel ReaxFF implementation used to simulate largeRDX systems in [29]–[31]. Their implementation yieldshigh efficiencies, however it is not designed for GPUs.

Our previous work on ReaxFF resulted in the PuReMDcodebase [32], which includes three different pack-ages targeted to different architectures: sPuReMD [23],PuReMD [24], and PuReMD-GPU [25]. sPuReMD, ourserial implementation of ReaxFF, introduced novel algo-rithms and numerical techniques to achieve high perfor-mance, and a dynamic memory management scheme tominimize its memory footprint. Today, sPuReMD is be-ing used as the ReaxFF backend in force field optimiza-tion calculations [33], where fast serial computationsof small molecular systems are crucial for extendingthe applicability of ReaxFF to new chemical systems.PuReMD is an MPI-based parallel implementation ofReaxFF that exhibits excellent scalability. It has beenshown to achieve up to 3-5× speedup over the parallelReax package in LAMMPS [28] on identical machineconfigurations of hundreds of processors [24]. PuReMD-GPU, a GP-GPU implementation of ReaxFF, achievesa 16× speedup on an Nvidia Tesla C2075 GPU overa single processing core (an Intel Xeon E5606 core)[25]. PuReMD-GPU is the only publicly available GPUimplementation of ReaxFF.

Zheng et al. developed a GPU implementation ofReaxFF, called GMD-Reax [34]. GMD-Reax is reportedto be up to 6 times faster than the USER-REAXC packageon small systems, and about 2.5 times faster on typicalworkloads using a quad core Intel Xeon CPU. How-ever, this performance is significantly predicated on theuse of single-precision arithmetic operations and low-precision transcendental functions, which can potentiallylead to significant energy drifts in long NVE simulations(PuReMD codebase is fully double precision and adheresto IEEE-754 floating point standards). GMD-Reax canrun on single GPUs only, and is not publicly available.

A ReaxFF implementation on clusters of GPUs, whichis the main focus of this paper, has not been reportedbefore.

3 BACKGROUND: REAXFF IMPLEMENTA-TIONS IN PUREMDOur hybrid parallel GPU implementation utilizes sev-eral algorithms and computational kernels from existingpackages in PuReMD. For completeness of discussion,we provide a brief summary of these implementationsand underlying algorithms. Please refer to [23]–[25] for

complete descriptions. A summary of these implementa-tions in the context of the work presented here is shownin Table 1.

3.1 sPuReMD: Serial ReaxFF ImplementationsPuReMD, our serial ReaxFF implementation, offers highmodeling accuracy and linear scaling in number ofatoms, memory, and run-time [23]. Key components ofsPuReMD include neighbor-generation for atoms, com-putation of interactions to obtain energy and forces, andVerlet integration of atoms [35] under the effect of totalforce for a desired number of time-steps. Each interactionkernel in ReaxFF is considerably more complex in termsof its mathematical formulation and associated operationcounts, compared to a nonreactive (classical) MD imple-mentation due to dynamic bond structure and chargeequilibration in ReaxFF. ReaxFF relies on truncatedbonded and non-bonded interactions. Consequently amodified form of the cell linked list method is used forneighbor generation [36], [37]. Forces in ReaxFF arisefrom bonded and non-bonded interactions. To computebonded interactions, we first compute the bond structureusing a bond-order, which quantifies the likelihood ofexistence of a bond between the atoms, based on theirdistance and types. Bond structure is computed usingthese uncorrected bond orders and valency of atoms.

Another important quantity required for computingforces on atoms is the partial charge on each atom. Un-like conventional MD simulations, where partial chargeson atoms are invariant over time, ReaxFF computespartial charges at each timestep by minimizing electro-static energy of the system, subject to charge neutral-ity constraints (i.e., charge equilibration). This proceduretranslates to the solution of a large sparse linear systemof equations. sPuReMD relies on a GMRES solver withan incomplete LU (ILU) factorization preconditioner forfast convergence [38], [39]. Once the bond structureand partial charges have been computed, bonded andnon-bonded force terms can be computed. Non-bondedforce computations are relatively expensive because oflonger cutoff radii associated with these interactions.For this reason, sPuReMD optionally uses a cubic splineinterpolation to compute fast approximations of non-bonded energies and forces [40].

3.2 PuReMD: Parallel ReaxFF ImplementationOur message passing parallel ReaxFF implementation,PuReMD, relies on a 3D spatial partitioning of thephysical domain, along with corresponding partitioningof atoms among processors to distribute load acrossprocessing cores. Since all interactions in ReaxFF (withthe exception of the charge equilibration procedure) arerange limited, processors need to communicate onlywith logically neighboring processors to exchange datacorresponding to atoms at boundaries. In PuReMD, thefull-shell method [41] is used to communicate atomic po-sitions that are needed to compute various interactions,



4

sPuReMD PuReMD PuReMD-GPU PuReMD-PGPU and PuReMD-HybridPlatform single core multi-core single-gpu multi-gpu and mulit-coreProgrammingModel

C C/MPI C/CUDA C/CUDA/MPI/pThreads

Key Optimiza-tions (optimiza-tions built onprevious imple-mentations )

Memory efficientdata-structures andalgorithms

Efficient communi-cation substrate andsolvers

Efficient GPU imple-mentation ( virtually allGPU computation)

Efficient replication/communication,GPU implementation, optimizationsspecific to Keplar class GPUs,GPU/CPU work partitioning, pThreadoptimizations

Water systemscale

Demonstrated ≈ 1.2million atom system( limited only byresources )

6540 atom/core demon-strated up to 3375 cores( 22 million atoms)

36K on one C2075 GPU 40K/(GPU + 1 core) on 18 nodes (1.4million atoms), 50K/(GPU + 10 cores)on 18 nodes( 1.8 million atoms) anddemonstrated complex mixes of coresand GPU using K20 cards

TABLE 1: Table showing key features and capabilities of various implementations.

and to communicate back the resulting partial forces.Although the communication volume is doubled withthe use of a full-shell scheme, this is necessary for cor-rectly identifying dynamic bonds, as well as to efficientlyhandle bonded interactions straddling long distancesinto processor boundaries. Data between neighboringprocesses is communicated in a staged manner; i.e., allprocesses first communicate in the x-dimension, consoli-date messages, followed by the y-dimension, and finallythe z-dimension. This staged communication reducesoverheads of message start-ups significantly.

Charge equilibration (QEq) solver in PuReMD is adiagonally scaled conjugate gradient (CG) solver [42].This choice is motivated by the superior parallel scalabil-ity properties of a diagonal preconditioner over an ILUpreconditioner, despite the former having slower conver-gence. When combined with an effective extrapolationscheme for obtaining good initial guesses, the diagonallyscaled CG solver exhibits good overall performance.PuReMD has been demonstrated to scale to more than3K computational cores under weak-scaling scenarios,yielding close to 80% efficiency. Bonded and non-bondedinteractions in PuReMD scale particularly well, amortiz-ing the higher cost of the charge equilibration solver.

3.3 PuReMD-GPU: ReaxFF on GPUs

PuReMD-GPU [25] is the GPU port of sPuReMD, pri-marily targeted towards platforms in which the GPU isthe primary compute resource. PuReMD-GPU maintainsredundant data structures (neighbor lists, bond listsand QEq matrix) to exploit the GPUs Single Instruc-tion Multiple Thread (SIMT) programming model. Datastructures are also modified to avoid (most) floatingpoint atomic operations on GPUs yielding significantspeedup at the expense of additional memory usage.Using a number of optimizations for concurrency, mem-ory access, and minimizing serialization, PuReMD-GPUdelivers over an order of magnitude improvement inperformance on conventional GPUs, compared to stateof the art single processing cores [25]. PuReMD-GPU hasbeen comprehensively validated for accuracy by inde-pendent research groups and is publicly available [32].

4 REAXFF ON HETEROGENEOUS ARCHITEC-TURES

PuReMD-Hybrid, like PuReMD, adopts a 3D spatial do-main decomposition, full-shell for partitioning the inputsystem into sub-domains, and staged communicationfor exchanging boundary atoms between neighboringprocesses. Unlike PuReMD, subdomains in PuReMD-Hybrid are mapped onto heterogenous compute nodes,which may contain multiple sockets and GPUs, witheach socket containing multiple cores. Our validationplatform, for instance, contains nodes with two GPUsand two Xeon processors, each with ten processing cores.We describe a sequence of implementations of increasingcomplexity, that incorporate a variety of algorithmicchoices and optimizations.

4.1 PuReMD-PGPU: GPU-Only Parallel Implementa-tion

We initiate our discussion with an implementation,called PuReMD-PGPU, targeted towards platforms inwhich GPUs represent the primary computational ca-pability. We subsequently describe our hybrid imple-mentation that utilizes all CPU resources. GPU-onlyimplementations are suited to platforms in which nodeshave a small number of cores (e.g., dual- or quad-cores).On such platforms, a GPU cluster implementation canbe described in the following steps:

1) Initialize each process according to the problemdecomposition structure.

2) Create communication layer between neighboringprocesses.

3) Exchange boundary atoms between neighboringprocesses.

4) Move data from the CPU to the GPU.5) Compute neighbor-list, bond-list, hydrogen-bond-

list and QEq matrix.6) Compute bonded and non-bonded forces.7) Compute total force and energy terms and update

the position of each atom under the influence ofnet force.

8) Move data back from the GPU to the CPU.



5

9) Repeat from step 3 for required number of time-steps.

The computation on CPUs, GPUs, and communicationassociated with each of these steps is shown in Table 2.Step 1 partitions the problem space and Step 2 creates thecommunication layer among the processes. Both of thesesteps are performed on the CPU. Step 3 communicatesboundary atoms across neighboring processes. In thefirst time-step this data is already available in the CPUmemory. In subsequent time-steps, however, this dataresides in the GPU memory, since the associated compu-tations are performed by the GPU. This data movementfrom GPU memory to CPU memory is performed at theend of the time-step (Step 8).

Step 4 copies data from CPU memory to the GPUmemory. This data primarily consists of positions, veloci-ties, charges, and other data related to its assigned atoms.Step 5 builds all the auxiliary data-structures, whichare used during bonded and non-bonded computations.Full-shell sub-domain regions dictate that when buildingthese data-structures, atoms within current subdomainas well as boundary regions are examined to buildthese structures. This increases the number of atomsin each of the sub-domains, and hence the aggregatememory footprint of applications is larger in PuReMD-PGPU compared to PuReMD-GPU. For instance, whilePuReMD-PGPU can accommodate water systems up to40K atoms per GPU, PuReMD-GPU can simulate watersystems as large as 50K atoms on a single GPU with 5GBglobal memory space.

Step 6, which computes bonded and non-bonded in-teractions in PuReMD-PGPU, represents the computa-tional core of the algorithm. It is performed entirely onthe GPU, and employs multiple threads per atom im-plementation for neighbor-generation, hydrogen-bondinteractions, and non-bonded force computations. Thischoice is motivated by the amount of computation in-volved, as well as resources used by respective kernels.Atomic operations are avoided in all kernels exceptfor four-body interactions, at the expense of additionalmemory, to significantly reduce the total time per timestep [25]. With higher compute compatibility support inmodern GPUs, reliance on shared memory on the GPUsis reduced considerably with shuffle instructions yieldingbetter performance results.

Step 7 computes the total energy and force terms,and updates atom positions and velocities. This compu-tation is also performed on the GPUs. As mentioned,this data may potentially need to be communicatedto neighboring processes. For this reason, the data ismoved back from GPU memory to CPU memory inStep 8. PuReMD-PGPU can also be configured to analyzemolecules, bonds, etc. and output this information atspecified intervals. This also requires movement of datafrom GPU memory to CPU memory. Steps 3 through 8are repeated for the required number time steps.

4.2 PuReMD-UVM: Virtual GPU ImplementationOur initial implementation of PuReMD-PGPU, describedabove, uses one CPU core and one GPU per MPI process.However, high-end clusters often have multiple GPUsand multi-core CPUs on each node. On such platforms, itmay be possible to achieve increased resource utilization.

One possible way is through NVIDIA’s Unified VirtualMemory (UVM) addressing. UVM is designed to simplifyprogramming by using a single address space for bothCPU and GPU. Internally, the CUDA runtime handlessynchronization of data between the CPU and GPU,when the kernels are executed on the GPU. Since theavailable memory on the CPU is often larger than theglobal memory on the GPU, we can run larger systemsin this configuration, as opposed to executing the samesimulations directly on the GPU. When the applicationruns directly on the GPU, all memory managementand data synchronization is handled by the application(as done by PuReMD-PGPU). The additional cost ofcopying data and synchronization between the CPU andGPU memory causes potential performance overheadswhen using UVM [43]. Indeed, as we demonstrate inour experimental results, the performance of our virtualGPU implementation is not competitive with our bestimplementation described in the next section.

Another way to achieve this capability is usingNVIDIA’s Multi Process Server (MPS). MPS is a client-server implementation that interleaves (schedules) ker-nels from multiple MPI processes onto a GPU. TheMPS client-server model has low overhead for mappingkernels from different MPI processes onto a GPU. If theGPU is completely saturated (all the GPU resources likememory, available cores, registers, etc. are in use) byone (or few) MPI process(es), kernels from other MPIprocesses requesting the same GPU are serialized. In thisimplementation, one may observe increasing total timeper time-step, if each MPI process saturates the GPUcompletely. Furthermore, global memory managementon the GPU must still be handled by the MPI processesthemselves, which potentially increases execution timedue to movement of data between the CPU and GPUaddress spaces. CUDA 6.0 implementation of MPS doesnot support multi-GPU nodes. For these reasons we donot use this implementation in our simulations.

4.3 PuReMD-Hybrid ImplementationPuReMD-Hybrid leverages all available cores as wellas GPUs to maximize resource utilization. Figure 2(a)shows the data dependency graph of a typical ReaxFFsimulation time-step. Here, boxes correspond to compu-tational tasks, while data consumed by these tasks isshown in ovals. For instance, nbrs computation dependson atom information to generate the neighbors-list; initcomputation takes the neighbors-list as input and gener-ates bond-list, hydrogen-bond-list, and the QEq matrix;and so on. Data corresponding to atoms (force, velocityand position vectors) is updated using results from



6

Step Computation Name Communication CPU Computation GPU Computation1 Process initialization Creation of MPI processes and

allocation of data structures onCPU and GPU

2 Communication layer cre-ation

MPI Initialization Creation of MPI processes andallocation of MPI buffers

3 comm Exchanging boundary atomsbetween neighboring MPI pro-cesses

4 GPU initialization Copy the data from CPU ad-dress space to GPU’s globalmemory. Atom information likeposition-vector, velocity-vector iscopied into the GPU addressspace.

5 nbrs and init Generate neighbor-list for eachatom and, generate bond-list,hydrogen-bond-list and QEqmatrix used during later phasesof computation on the GPU.

6 bonded, nonb and QEq Linear solve during chargeequilibration involvescommunication for operationslike global reductions.

Computation of bonded forcesbond-order, bond-energy, lone-pair, three-body, four-body, andhydrogen-bond terms. Non-bonded computations includecoulomb’s and van der waalsforce computation and chargeequilibration, QEq.

7 Total force and EnergyComputation and updatingatoms position under theinfluence of net force

Compute the total force and en-ergy and update the positionof each atom depending on theconfigured ensemble.

8 Update CPU with atom in-formation

Copy the atom informationfrom GPU’s global memory toCPU address space

9 Repetition of steps 3–8 untilrequired number of steps

May involve journaling of out-put information like detailed in-formation of current state ofsystem ( atom positions, newmolecules formed etc), analysisof bond’s etc

TABLE 2: Tabular depiction of typical ReaxFF simulation in PuReMD-PGPU

bonded, nonb, and QEq computations. Boundary atomsare exchanged with neighboring nodes at the beginningof each time-step.

Taking a task-parallel view of the data flow, onecandidate work-splitting scheme between the CPU andthe GPU is as follows:

• perform nbrs and init computations on the GPU• move bond-list and hydrogen-bond-list to the CPU;

and• perform bonded computation on the CPU and nonb

and QEq computations on the GPUThis partitioning is motivated by the load charac-

teristics of the tasks, the suitability of the task to theplatform to which it is mapped, as well as the datatransfers between CPU and GPU memory. For instance,in our benchmarks using water-36K system, nbrs yieldsa speedup of 4.6× on a GPU, when compared to its sin-gle core performance (PuReMD-one-core-per-node); initachieves a speedup of 5.3×; bonded achieves a speedupof 13.5×; while nonb, and QEq computations achievespeedups of 85.2× and 4.7×, respectively. Note that nonbis a highly parallelizable computation without need for

atominfo

nbrs

nbrs_list

init

bond_lsthbond_listQeqmatrix

bonded Nonb,QEq

(a) Data dependencygraph of typical MDiterations

atomInfo

nbrs nbrs

nbrs_list

nbrs_list

Init_cpu Init_gpu

bond_listhbond_list

Qeqmatrix

bonded nonbQEq

C P U

C O M P U T A T I O N

G P U

C O M P U T A T I O N

(b) Data dependency graph as-sociated with enhanced task-parallel implementation.

Fig. 2: Data dependency graphs of MD iterations.



7

synchronization among the CUDA threads during theentire computation, which explains its excellent GPUperformance.

Data movement between the GPU and CPU representsa significant overhead [44], [45]. For instance, for a watersimulation with 36 thousand atoms, using PuReMD-PGPU on a single GPU, bond-list and hydrogen-bond-list data structures require ≈ 800 MB. Copying thesetwo lists from GPU to CPU address space takes ≈ 160milliseconds compared to 540 milliseconds for an aver-age time step on an NVIDIA K20 GPU. Note that thebond-list (296 MB) is smaller than hydrogen-bond-list(500 MB) because the cutoff for bonds is 4-5 A, whilethat for hydrogen bonds is 7-8 A.

A second consideration in mapping tasks in the depen-dency graph to GPU and CPU cores relates to synchro-nization and associated idling. For instance, CPU coresare idle while nbrs and init are being computed by theGPU. For reference, nbrs consumes about 75 millisec-onds for the above-mentioned water system, which issignificantly less than the corresponding data movementoverhead. Motivated by these observations, PuReMD-Hybrid adopts the task-parallel implementation shownin Figure 2(b). In this model we split the work as follows:

• CPU computes nbrs, init-cpu and bonded tasks, and• GPU computes nbrs, init-gpu, nonb and QEq tasks

init-cpu generates bond-list and hydrogen-bond-list,while init-gpu computes the QEq matrix. Both compu-tation paths depend on atom position information onlywhich is a significantly smaller data structures comparedto bond-list and hydrogen-bond-list. After this synchro-nization both CPU and GPU can execute their respectivetasks without any further dependencies. It is importantto note that depending on the relative mix of CPU coresand GPUs, tasks can be mapped back and forth in thisassignment.

Tasks assigned to CPU cores are parallelized usingPOSIX threads (pthreads). We adopt a data-parallel worksplitting mechanism across pthreads so that the work-load on each of these threads is approximately equal.Total number of atoms is evenly split across the availablepthreads (as defined at the beginning of simulation),so that contiguous chunks of atoms are assigned to aparticular thread. Assignment of contiguous chunks inthis way increases data locality and reduces schedulingoverheads.

Figure 3 describes the implementation of ReaxFF it-erations in PuReMD-Hybrid. At the beginning of thesimulation, the main thread on the CPU reads the inputdata, initializes the GPU context and creates the pthreadson the CPU. Within each iteration, boundary atoms arecommunicated to neighboring processes and the CPUand the GPU initiate their respective tasks, which areexecuted concurrently. The main thread waits for theCPU and GPU to complete their tasks, and updates atominformation at the end of each iteration.

MainThreadonCPU CPUPThreads GPU

NeighborsIni5aliza5onBonded

computa5on

NeighborsIni5aliza5onNon-Bonded

ChargeEquilibra5oncomputa5on

CommunicateAtomswithNeighborprocesses

ExitYes/NO

Start Computations on CPU and GPU

Report Back to Main Thread after Computation

Yes

No

SimInit

SimInit: In this phase, main thread initializes GPU context, creates pThreads on CPU, allocates necessary data-structures and creates 3-D grid structure

Initialize pThreads/CPU and GPU Context

Updateatomicforces/totalenergy

Fig. 3: Illustration of a typical MD simulation iterationin the PuReMD-Hybrid implementation.

5 EXPERIMENTAL RESULTS

We report our comprehensive evaluation of the perfor-mance of PuReMD-PGPU and PuReMD-Hybrid pack-ages. All reported simulations were run on the Intel14GPU cluster at the High Performance Computing Center(HPCC) at Michigan State University. Nodes in thiscluster contain two sockets, each of which is equippedwith a 2.5 GHz 10-core Intel Xeon E5-2670v2 processor,64GB of RAM, and an NVIDIA Tesla K20 GPU (with5GB of device memory). Nodes are interconnected usingFDR Infiniband interconnect and run RedHat EnterpriseLinux 6.3. Even though there are 40 such nodes, themaximum number of nodes that a user can utilize in asingle job is restricted. Simulations using up to 18 nodes(36 sockets/GPUs max.) are reported in our performanceevaluations.

We use water systems of various sizes for in-depthanalysis of performance, since water simulations repre-sent diverse stress points for the code. For weak scalingtests, we use water systems with 18,000 atoms and 40,000atoms per socket, and for strong scaling analysis, we usewater systems with 80,000 and 200,000 atoms. Note that40,000 atoms per socket is equivalent to 80,000 atoms pernode and the use of the terms water-80K per node andwater-40K per socket refers to the same system (similarlywater-18K/socket is equivalent to water-36K/node). Allour simulations use a time-step of 0.25 femtoseconds,a tolerance of 10−6 for the QEq solver, and a BerendsenNVT ensemble. All simulations are run for 100 time stepsand repeated ten times. Averages from these ten sets ofperformance data are reported in the plots and tablespresented.

Table 3 presents the acronyms associated with variousimplementations, along with computing resources andMPI processes used by each implementation. PuReMD,parallel CPU version, uses all 10 cores on each socketwith 10 MPI processes (20 MPI ranks/node). PuReMD-PGPU, parallel GPU version, uses 1 CPU core and 1



8

Codebase Resources used persocket

#MPI pro-cesses/socket

PuReMD 10 CPU cores 10PuReMD-PGPU 1 CPU core, 1 GPU 1PuReMD-UVM 10 CPU cores, 1 GPU 10PuReMD-Hybrid 10 CPU cores, 1 GPU 1

TABLE 3: Various codebases used for simulations alongwith their computing resources.

GPU per socket (2 MPI ranks/node), where the CPUcores essentially act as drivers for GPU computations.In PuReMD-UVM, a virtualized GPU implementationusing UVM, a single GPU is virtualized 10-ways across10 computing cores of a socket. It uses both GPUs on anode by creating 20 MPI ranks per node. Our most ad-vanced implementation, PuReMD-Hybrid, creates only2 MPI ranks per node, and each MPI process utilizes all10 cores available on a CPU socket using pthreads, inaddition to the GPU. This hybrid code uses 16 pthreadsper MPI rank to process tasks assigned to a CPU socketin parallel.

GNU version 4.8.2, OpenMPI version 1.6.5, and CUDA6.0 development toolkit were used for compiling andrunning all different versions. The following flags wereused for PuReMD: ”-O3 -funroll-loops -fstrict-aliasing”;PuReMD-PGPU and PuReMD-Hybrid are compiled withthe following compilation flags: ”-arch=sm 35 -funroll-loops -O3”.

An important performance optimization criteria forthe GPU kernels is the number of threads used peratom, which can be tuned based on the input system.For bulk water simulations, these parameters have beendetermined as follows: 8 threads/atom for nbrs kernel, 32threads/atom for hydrogen-bond kernel, 16 threads/atomfor nonb kernel, and the matrix-vector products arecomputed using 32 threads/atom. The thread-block sizefor all kernels has been set to 256 unless mentionedotherwise.

To better interpret performance results, we identify sixkey components of the simulations – comm, init, nbrs,bonded, nonb and QEq, in Table 2. Each of these parts hasdifferent characteristics: some are compute-bound, someare memory-bound, while others are communication-bound. Together they comprise ≈ 99% of the totalcomputation time for typical ReaxFF simulations. Weperform detailed analysis of these major components tounderstand how PuReMD-PGPU and PuReMD-Hybridimplementations respond to increasing system sizes andnumbers of nodes.

5.1 Brief description of NVIDIA K20 GPU

The Tesla K20 GPUs in the HPCC cluster have 13 Stream-ing Multiprocessors (SM), each with 192 cores, for a totalof 2496 cores. Each SM has 64KB of shared memory spacethat can be configured as L1 cache space or softwaremanaged shared memory. Shared memory operations

0.1

1

10

100

1000

0 5 10 15 20 25 30 35 40

Tim

e(s

eco

nd

s)

# of Sockets

sPuReMDPuReMD-UVM

PuReMDPuReMD-PGPUPuReMD-Hybrid

Fig. 4: Total time per time step comparison for allimplementations using water-80K/node system.

are significantly faster compared to global memory op-erations due to the proximity of shared memory to theSMs. Each SM also has a 65,536 byte register file. AnSM can schedule a maximum of 16 thread-blocks at anyinstant of time. In this context, occupancy is defined asthe ratio of the number of active thread blocks to themaximum number of thread blocks allowed on a specificarchitecture. Occupancy is effected by the availability ofshared resources on the GPU such as registers, sharedmemory, etc. At 100% occupancy, each thread can use upto 32 registers (assuming other resources are available),which implies that 2048 threads can be active on each SMat its peak utilization. Note that 2048×32 is 64K whichis the size of the register file on each SM.

5.2 PuReMD-Hybrid Performance ImprovementsIn this section, we summarize the highlights of speedupsachieved using our PuReMD-Hybrid implementationusing water-40K atoms per socket. Figure 4 plots thetotal time per time step for all implementations. Notethat the water system used by sPuReMD is scaled ac-cordingly, for instance 1.44M atom water system is usedby PuReMD-Hybrid on 36 sockets and the same systemis used by sPuReMD for comparison purposes, shownbelow:

• 565 fold compared to sPuReMD, a CPU-only coderun on a single core

• 2.41 fold compared to PuReMD, with 360 MPI rankson 36 sockets

• 1.29 fold compared to PuReMD-PGPU, using 36sockets/GPUs

• 38 fold compared to PuReMD-UVM, using 36 sock-ets, each with 10 processes and 10 virtual GPUs

The excellent speedup of PuReMD-Hybrid w.r.t.sPuReMD, 565-fold, can be attributed to efficient paral-lelization of ReaxFF, highly parallelizable computationssuch as nonb, nbrs and hydrogen-bond, as elaborated onbelow, and SIMT execution semantics of the CUDAruntime. PuReMD-Hybrid is 2.41× faster than PuReMDand 1.29× faster than PuReMD-PGPU. Comparing theperformance of PuReMD-PGPU to PuReMD, we observe



9

0.01

0.1

1

10

0 5 10 15 20 25 30 35 40

Tim

e(s

econds)

# of Sockets


(a) Time per time step (water-80K)

0.1

1

10

0 5 10 15 20 25 30 35 40

Tim

e(s

econds)

# of Sockets


(b) Time per time step (water-200K)

1

10

0 5 10 15 20 25 30 35 40

Speedup

# of Sockets

PuReMD v.s PuReMD-HybridPuReMD-PGPU v.s PuReMD-Hybrid

(c) Speedup (water-80K)

1

10

0 5 10 15 20 25 30 35 40

Speedup

# of Sockets


(d) Speedup (water-200K)

Fig. 5: Strong scaling results for various parallel implementations.

that a single K20 GPU is equal in performance to ≈ 19CPU cores of an Intel Xeon E5-2670v2 processor. There-fore one would expect a peak performance improvementof ≈ 2.9 fold for PuReMD-Hybrid over PuReMD on aper-node basis (a speedup of 0.95 using a single GPU anda factor of one from the 20 CPU cores on each node). Inresults below on larger systems, we show that PuReMD-Hybrid in fact approaches and even surpasses this es-timated speedup. These results suggest that PuReMD-Hybrid has excellent resource utilization.

In PuReMD-UVM, each GPU is virtualized 10 wayswith the help of UVM, which maps CUDA kernels fromeach of the MPI processes to available GPU resources(global memory, streaming processors, etc.). For thewater-40K/socket system using 36 sockets, each processhandles ≈ 19000 atoms (local as well as ghost atoms) andnbrs kernel spawns 594 (=19000x8/256) thread blocks.The nbrs kernel’s occupancy is 50% which means it canschedule 104 (=13x8) thread blocks at any instant oftime. Just one process is able to completely saturatethe GPU resources in this case. Because of this resourcecontention, which leads to serialization of CUDA kernelsinterlaced with data synchronization between GPU andCPU, virtualized GPU implementation is not a goodchoice for our implementation. In Figure 4, we noticethat the timings for PuReMD-UVM are considerablyhigher compared to codes optimized for GPUs. For thisreason, we do not include PuReMD-UVM results in ourdetailed comparisons presented later in this section.

5.3 Strong Scaling ResultsFigures 5(a) and 5(b) present the wall-clock time pertime-step for various implementations under strong scal-ing scenarios for the water-80K and water-200K systems.As the number of sockets increases, the number of atomsprocessed per socket decreases resulting in decreasedwall-clock times. For PuReMD, the rate of decrease inthe wall-clock time per step saturates at larger numbersof sockets. For water-80K system, beyond 24 sockets thistime starts increasing, due to increased cost of communi-cations with respect to computations. On the other hand,we notice consistent decrease in execution times withincreasing number of sockets for PuReMD-PGPU and

PuReMD-Hybrid codes. Note that the number of MPIranks in GPU versions is only 1 per socket (as opposedto 10 with PuReMD), the sub-domain size per processis 10× larger compared to PuReMD. For this reason, therelative ratio of the ghost regions with respect to theactual simulation domain, and hence the ratio of com-munication to computation cost is significantly lowerin the PuReMD-PGPU and PuReMD-Hybrid versions,explaining the better strong scalability observed withGPU implementations.

In PuReMD-PGPU, the decrease in time from 16 to36 sockets is not as pronounced for water-80K systemas it is for water-200K. This is because the numberof thread blocks created by init and bonded kernelsare not sufficient to saturate the SMs on the GPU.For instance, in a 16 socket simulation, each MPI rankhandles ≈ 19000 atoms, and PuReMD-PGPU spawnsonly 75 thread blocks for init and bonded kernels (excepthydrogen-bonds kernel). However, at 100% occupancy,208 thread blocks would be needed to saturate the SMson the GPU, even at 50% occupancy 104 thread blockscan be scheduled by K20 GPU at any instant of time.

On the other hand, the use of multiple-threads peratom by nbrs, hydrogen-bond, nonb and QEq kernels re-sults in a large number of thread blocks. For exam-ple, nbrs spawns 593 thread blocks, and SMs remainsaturated. However, since the number of thread blocksdecreases with increased number of sockets, SMs’ satu-ration level drops for these kernels, too; and this resultsin lower parallel efficiency in PuReMD-PGPU beyond 16sockets.

The kernels that typically achieve lower speedups onGPUs, i.e., init and bonded, are mapped onto the CPUcores in PuReMD-Hybrid. To reduce the total devicememory requirements, the hydrogen-bond kernel has alsobeen mapped to the CPU cores in PuReMD-Hybrid. Ascan be seen in Figures 5(a) and 5(b), this results in betteroverall performance and scalability by PuReMD-Hybridcompared to PuReMD and PuReMD-PGPU codes. Fig-ures 5(c) and 5(d) summarize speedups achieved bythe PuReMD-Hybrid implementation. PuReMD-Hybridis between 1.15x and 1.54x faster compared to PuReMD-PGPU code for the water-80K system and between



10

0.001

0.01

0.1

1

0 5 10 15 20 25 30 35 40

Tim

e(s

eco

nds)

# of Sockets

PuReMDPuReMD-PGPU

PuReMD-Hybrid-cpuPuReMD-Hybrid-gpu

(a) nbrs

0.001

0.01

0.1

1

0 5 10 15 20 25 30 35 40

Tim

e(s

eco

nds)

# of Sockets

PuReMDPuReMD-PGPU

PuReMD-Hybrid-cpuPuReMD-Hybrid-gpu

(b) init

0.01

0.1

1

0 5 10 15 20 25 30 35 40

Tim

e(s

eco

nds)

# of Sockets


(c) bonded

0.001

0.01

0.1

1

0 5 10 15 20 25 30 35 40

Tim

e(s

eco

nds)

# of Sockets


(d) nonb

0.01

0.1

1

0 5 10 15 20 25 30 35 40

Tim

e(s

eco

nds)

# of Sockets


(e) QEq

Fig. 6: Timings of each of the key components of ReaxFF for strong scaling scenarios using Water-80K system.

1

10

100

0 2 4 6 8 10 12 14 16 18

Sp

ee

du

p

# of Nodes

Water-80KWater-200K

ideal-speedup

Fig. 7: Strong scaling speedup of PuReMD-Hybrid,when compared to its performance on a single-node.

1.26x and 1.39x for the water-200K system. For both ofthese systems, speedup achieved by PuReMD-Hybridcompared to PuReMD is around 2x for small numberof sockets, but increases steadily for large number ofsockets, to around 3x on 36 sockets.

Figure 7 presents strong scaling speedup of PuReMD-Hybrid w.r.t. to its performance on a single-node. Asexpected, we notice a consistent increase in the effectivespeedup of the hybrid formulation, when compared toits time per time step on a single node. The speedupachieved for the larger water system is higher, comparedto the smaller system, and is closer to the ideal speedupcurve. This can be attributed to more favorable tradeoffbetween computation and communication cost, as thenumber of atoms per process increases.

5.3.1 Performance of key ReaxFF kernels

Figure 6 presents the performance of key ReaxFF kernelsfor various implementations in the water-80K simu-lation. Since, in PuReMD-Hybrid nbrs computation isduplicated on the CPUs and GPUs, and init computa-tions are split between the CPUs and GPUs, we showtwo separate lines, PuReMD-Hybrid-GPU and PuReMD-Hybrid-CPU, for this implementation in Figures 6(a) and6(b).

The plot for nbrs in Figure 6(a) shows a consistentdrop in its runtime as the number of sockets is increasedfor all implementations. As expected, the rate of dropslows down for larger number of sockets, because thereis not enough parallelism. Neighbor generation kernelusing the cell lists method involve mainly pointer deref-erencing, index lookups and branches. Therefore thiskernel is more efficiently performed on the CPU – see thePuReMD nbrs time. The timings of this component forPuReMD-Hybrid-GPU and PuReMD-PGPU are almostidentical because the same kernel is executed in thesetwo implementations. The performance of nbrs compu-tation on the CPU in PuReMD-Hybrid reports highertimings compared to PuReMD, because neighbor listpairs are computed redundantly in this case to exploitpthreads parallelism. Note also that the sub-domainsize per process in PuReMD-Hybrid is 10× comparedto PuReMD. However, nbrs kernel constitutes a smallpercentage of the overall execution time, and these com-putations are overlapped in PuReMD-Hybrid with GPUcomputations. Consequently, suboptimal performance inthis part does not severely impact overall performanceof PuReMD-Hybrid.



11

0.1

1

0 5 10 15 20 25 30 35 40

Tim

e(s

econds)

# of Sockets


(a) Total time per time step forwater-18K atoms/socket.

0.1

1

10

0 5 10 15 20 25 30 35 40

Tim

e(s

econds)

# of Sockets


(b) Total time per time step forwater-40K atoms/socket.

1

10

0 5 10 15 20 25 30 35 40

Speedup

# of Sockets


(c) Achieved speedups for water-18K atoms/socket.

1

10

0 5 10 15 20 25 30 35 40

Speedup

# of Sockets


(d) Achieved speedups for water-40K atoms/socket.

Fig. 8: Weak scaling results, time per time-step and speedup, using water systems for parallel implementations.

Figures 6(b), 6(c) and 6(d) present the timings ofinit, bonded, and nonb computations, respectively. initcomputations involve the generation of bond and hy-drogen bond lists, as well as the QEq matrix. Thesetasks are split in PuReMD-Hybrid between the GPUand CPU, therefore PuReMD-Hybrid significantly out-performs PuReMD and PuReMD-PGPU in this part.However, PuReMD-Hybrid’s timings for bonded compu-tation is higher compared to PuReMD-PGPU, as well asPuReMD. This is a result of its pthreads implementationon the CPU, which contains redundancies to leveragethread parallelism. Again, this computation is com-pletely overlapped with GPU computations in PuReMD-Hybrid and is not expected to represent a significantoverhead, in general. As expected, GPU implementationof the nonb kernel yields significantly better performancecompared to its CPU part, because of the highly parallelnature of this computation. The overall performance ofPuReMD-Hybrid is identical to that of PuReMD-PGPUfor nonb computations.

Figure 6(e) plots the timing for the charge equilibrationpart, QEq. The QEq part is one of the most expensiveparts of PuReMD, as it involves four communicationoperations (two message exchanges and two global re-ductions) of the linear solve in every iteration. With in-creasing number of MPI processes (and decreasing sub-domain size per process), the ghost region volume be-tween the nearby MPI processes increases relatively. Thisleads to increased communication volumes and sincethe charge equilibration solver typically requires 15-20iterations, communication overheads can become pro-nounced on large number of sockets. These overheadsare very significant for the PuReMD code which contains360-way MPI parallelism on 36 sockets. PuReMD-Hybridand PuReMD-PGPU versions exhibit superior perfor-mance for this kernel as a result of accelerated sparsematrix vector multiplications on GPUs, and relativelylower communication volumes.

5.4 Weak Scaling ResultsWe use water systems with 18K and 40K atoms persocket for benchmarking different implementations un-der weak scaling scenarios. Figures 8(a), 8(b), 8(c),

and 8(d) present the wall-clock times per time-stepand speedups for the two water systems. PuReMD-Hybrid outperforms PuReMD-PGPU by a factor of about1.29x for both systems, and both codes exhibit similarweak-scaling properties. As expected, PuReMD imple-mentation exhibits the slowest execution times for bothsystems, and its weak scaling efficiency is worse com-pared to the GPU implementations due to the increasedcommunication overheads (especially during the chargeequilibration phase) associated with the high number ofMPI ranks it uses. We obtain speedups ranging from1.9x (on small number of sockets) to 2.4x (on largenumber of sockets) with PuReMD-Hybrid over PuReMDfor the 18K atoms/socket system. Similarly, speedupsranging from 2.1x up to 2.4x are obtained for the 40Katoms/socket water system.

5.4.1 Performance of key ReaxFF kernels

Figure 9 presents the performance (runtime) of individ-ual components of ReaxFF under weak scaling scenariosfor all implementations when 40K atoms per socketsystems are used.

While PuReMD-Hybrid’s nbrs time on the GPU is sim-ilar to that of PuReMD-PGPU, our pthreads based imple-mentation for the CPU socket contains redundant com-putations, and performs worse than the original CPU im-plementation with MPI parallelism. On the other hand,since the work in init computations is split, PuReMD-Hybrid performs better than either implementation forthis expensive kernel. Timings for bonded computationare plotted in Figure 9(c). Similar to the nbrs compu-tations, PuReMD-Hybrid takes more time compared toPuReMD and PuReMD-PGPU because this computationinvolves redundancies to resolve race conditions amongthreads in the pthreads implementation. However, nbrs,bond related parts of init, and bonded computations runon the CPU in tandem with the GPU computations inPuReMD-Hybrid. The completely asynchronous execu-tion of these kernels and their relatively low computa-tional expense in comparison to non-bonded and QEqcomputations give performance advantages to PuReMD-Hybrid.



12

0.01

0.1

1

0 5 10 15 20 25 30 35 40

Tim

e(s

eco

nds)

# of Sockets

PuReMDPuReMD-PGPU

Hybrid-PuReMD-cpuHybrid-PuReMD-gpu

(a) nbrs

0.01

0.1

1

0 5 10 15 20 25 30 35 40

Tim

e(s

eco

nds)

# of Sockets

PuReMDPuReMD-PGPU

Hybrid-PuReMD-cpuHybrid-PuReMD-gpu

(b) init

0.1

1

0 5 10 15 20 25 30 35 40

Tim

e(s

eco

nds)

# of Sockets


(c) bonded

0.01

0.1

1

0 5 10 15 20 25 30 35 40

Tim

e(s

eco

nds)

# of Sockets


(d) nonb

0.01

0.1

1

0 5 10 15 20 25 30 35 40

Tim

e(s

eco

nds)

# of Sockets


(e) QEq

Fig. 9: Timings of each of the key components of ReaxFF for Water-40K atoms/socket system during weak scaling.

Results for nonb computation are plotted in Fig-ure 9(d). This computation simply requires iterating overthe neighbors list, and for each pair, significant num-ber of floating point operations are needed to computethe energies and forces, which makes this a compute-bound kernel. Effective use of SMs with multiple threadsper atom kernel implementation, resulting in coalescedmemory access and higher memory throughput, andthe high arithmetic intensity of non-bonded interactionscontribute to the excellent performance of PuReMD-PGPU and PuReMD-Hybrid (which execute the samekernel) over PuReMD. The speedups achieved by usingGPUs for this kernel is close to 10× over the CPU-onlyversion.

Figure 9(e) shows timings for charge equilibration inweak scaling scenarios. QEq computation is dominatedby matrix-vector products and communication duringthe linear solve (two local message exchanges and twoglobal reductions). We can see the impact of commu-nication cost on performance by observing the perfor-mance of PuReMD and GPU based implementations.For a small number of nodes PuReMD’s performanceis comparable to other implementations. However, asthe number of sockets is increased, PuReMD becomessignificantly slower (by almost 2x at 36 sockets) due toincreased communication costs.

To fully characterize weak scaling performance ofPuReMD-Hybrid, we execute large simulations withwater-50K atoms per socket. Figures 10(a) and 10(b)present the speedup and total time per time step ofthe large water system. From these results, we note thatPuReMD-Hybrid achieves an effective speedup of 3.46x com-

1

10

0 5 10 15 20 25 30 35 40

Sp

ee

du

p

# of Sockets

PuReMD v.s Hybrid-PuReMD

(a) Speedup

0.1

1

10

0 5 10 15 20 25 30 35 40T

ime

(se

co

nd

s)

# of Sockets

PuReMDPuReMD-Hybrid

(b) Total time per time step

Fig. 10: Weak scaling results for water systems with100K atoms per node.

pared to PuReMD. As discussed above, this figure is closeto optimal in terms of utilizing all compute resourceson the node. Furthermore, note that PuReMD-Hybridalso has a number of memory optimizations (storage ofbonds, 3-body and hydrogen-bond lists on main memoryinstead of the device memory) that allow scaling to suchlarge systems. This is not possible using PuReMD-PGPU.In terms of application performance, this correspondsto a 1.8 million atom water system running at ≈ 0.68seconds per time step, compared to 2.36 seconds byPuReMD using 360 MPI ranks.

5.4.2 Weak Scaling Efficiencies

We define efficiency as the ratio of time consumed peratom on a single node to the time consumed per atomon multiple nodes under weak scaling. Table 4 presentsefficiency results for PuReMD codes using 40K/socketwater systems. Our first observation is that PuReMD-



13

PuReMD-PGPU PuReMD-Hybrid PuReMD

Nodes Time Efficiency Time Efficiency Time Efficiency

1 7.68 100.00 5.96 100.00 12.90 100.00

2 7.72 99.50 6.07 98.11 13.27 97.17

4 7.99 96.21 6.18 96.42 14.03 91.93

8 8.05 95.47 6.30 94.56 14.47 89.12

12 8.15 94.25 6.35 93.89 14.54 88.69

18 8.39 91.62 6.48 91.94 15.62 82.55

TABLE 4: Efficiency results of 40K/socket water systemfor weak scaling simulations. Time is measured in

microseconds and indicates time spent on each atom.

Hybrid yields significantly better speed and efficiencycompared to PuReMD, and better performance com-pared to PuReMD-PGPU as well. This observation isgrounded in two distinct facts: (i) GPU implementationincreases computation speed at the nodes significantly.However, the communication substrate stays the same.This potentially leads to poorer scaling characteristics;and (ii) The subdomain size per MPI process in the paral-lel GPU implementations is much larger. This combinedwith higher intra-node bandwidth potentially yields bet-ter scaling characteristics. In balance, we observe thatPuReMD-Hybrid and PuReMD-PGPU both benefit fromthe second fact, resulting in better scalability.

6 CONCLUSION AND FUTURE WORK

In this paper, we presented an efficient and scalable par-allel implementation of ReaxFF using MPI, pthreads, andCUDA. Our ReaxFF implementation for heterogeneousarchitectures, PuReMD-Hybrid, is shown to achieve upto 565× speed up compared to single CPU implementa-tion, 3.46× speed up compared to the parallel CPU im-plementation PuReMD and 1.29× speed up compared tothe parallel GPU-only implementation PuReMD-PGPUon 36 sockets of CPUs and GPUs under weak-scalingscenarios. The accuracy of the resulting implementationshas been verified against the benchmark productionPuReMD code by comparing various energy and forceterms for large numbers of time-steps under diverseapplication scenarios and systems.

Our ongoing work focuses on the use of this softwarein a variety of applications, ranging from simulationof energetic materials to biophysical systems. ReaxFFrepresents a unique simulation capability, approximatingthe modeling fidelity of ab initio simulations and the su-perior runtime characteristics of conventional moleculardynamics techniques. With respect to software devel-opment, this code can be augmented with techniquessuch as accelerated and replica dynamics to enable longsimulations. Furthermore, identification of significantevents by post-processing PuReMD trajectories representsignificant challenges and opportunities. To this end,inlining analyses into PuReMD would further enhanceits application scope significantly.

ACKNOWLEDGMENTS

We thank Adri van-Duin for significant help in validat-ing our software on a variety of systems. We also thankJoe Fogarty at the University of Southern Florida forconstructing model systems for testing and validationand Michigan State University for providing us accessto their GPU cluster for benchmarking our application.

REFERENCES

[1] A. D. MacKerell, Jr., D. Bashford, M. Bellott, R. L. Dunbrack,Jr., J. D. Evanseck, M. J. Field, S. Fischer, J. Gao, H. Guo,S. Ha, D. Joseph-McCarthy, L. Kuchnir, K. Kuczera, F. T. K. Lau,C. Mattos, S. Michnick, T. Ngo, D. T. Nguyen, B. Prodhom,W. E. Reiher, III, B. Roux, M. Schlenkrich, J. C. Smith, R. Stote,J. Straub, M. Watanabe, J. Wiorkiewica-Kuczera, D. Yin, andM. Karplus, “All-atom empirical potential for molecular modelingand dynamics studies of proteins,” Journal of Physical Chemistry B,vol. 102, no. 18, pp. 3586–3616, 1998.

[2] W. D. Cornell, P. Cieplak, C. I. Bayly, I. R. Gould, K. M. Merz,Jr., D. M. Ferguson, D. C. Spellmeyer, T. Fox, J. W. Caldwell, andP. A. Kollman, “A second generation force field for the simulationof proteins, nucleic acids, and organic molecules,” Journal of theAmerican Chemical Society, vol. 117, no. 19, pp. 5179–5197, 1995.

[3] W. L. Jorgensen and J. Tirado-Rives, “The opls potential functionsfor proteins. energy minimizations for crystals of cyclic peptidesand crambin,” Journal of the American Chemical Society, vol. 110,no. 6, pp. 1657–1666, 1988.

[4] J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid,E. Villa, C. Chipot, R. D. Skeel, L. Kale, and K. Schulten, “Scalablemolecular dynamics with namd,” Journal of Computational Chem-istry, vol. 26, no. 16, pp. 1781–1802, 2005.

[5] G. Kresse and J. Hafner, “Ab initio molecular dynamics for liquidmetals,” Physical Review B, vol. 47, no. 1, p. 558, 1993.

[6] X. Gonze, J.-M. Beuken, R. Caracas, F. Detraux, M. Fuchs, G.-M.Rignanese, L. Sindic, M. Verstraete, G. Zerah, F. Jollet et al., “First-principles computation of material properties: the abinit softwareproject,” Computational Materials Science, vol. 25, no. 3, pp. 478–492, 2002.

[7] J. M. Soler, E. Artacho, J. D. Gale, A. Garcıa, J. Junquera, P. Or-dejon, and D. Sanchez-Portal, “The siesta method for ab initioorder-n materials simulation,” Journal of Physics: Condensed Matter,vol. 14, no. 11, p. 2745, 2002.

[8] M.-H. Du, A. Kolchin, and H.-P. Cheng, “Water–silica surfaceinteractions: A combined quantum-classical molecular dynamicstudy of energetics and reaction pathways,” The Journal of chemicalphysics, vol. 119, no. 13, pp. 6418–6422, 2003.

[9] M. J. Field, P. A. Bash, and M. Karplus, “A combined quantummechanical and molecular mechanical potential for moleculardynamics simulations,” Journal of Computational Chemistry, vol. 11,no. 6, pp. 700–733, 1990.

[10] M. J. Dewar, E. G. Zoebisch, E. F. Healy, and J. J. Stewart,“Development and use of quantum mechanical molecular models.76. am1: a new general purpose quantum mechanical molecularmodel,” Journal of the American Chemical Society, vol. 107, no. 13,pp. 3902–3909, 1985.

[11] Q.-S. Du, S.-Q. Wang, Y. Zhu, D.-Q. Wei, H. Guo, S. Sirois, andK.-C. Chou, “Polyprotein cleavage mechanism of sars cov m proand chemical modification of the octapeptide,” Peptides, vol. 25,no. 11, pp. 1857–1864, 2004.

[12] A. C. T. van Duin, S. Dasgupta, F. Lorant, and W. A. G. III, “Reaxff:A reactive force field for hydrocarbons,” in J Phys Chem A, vol.105, 2001, pp. 9396–9409.

[13] K. Chenoweth, A. C. T. van Duin, and W. G. III, “Reaxff reactiveforce field for molecular dynamics simulations of hydrocarbonoxidation,” Journal of Physical Chemistry, vol. 112, no. 5, pp. 1040–1053, 2008.

[14] T. P. Senftle, R. J. Meyer, M. J. Janik, and A. C. T. van Duin,“Development of a reaxff potential for pd/o and application topalladium oxide formation,” Journal of Chemical Physics, 2013.



14

[15] T. P. Senftle, S. Hong, M. M. Islam, S. B. Kylasa, Y. Zheng,Y. K. Shin, C. Junkermeier, R. Engel-Herbert, M. J. Janik, H. M.Aktulga, T. Verstraelen, A. Grama, and A. C. van Duin, “Thereaxff reactive force-field: Development, applications, and futuredirections,” Nature Computational Materials, 2015.

[16] K. D. Nielson, A. C. T. van Duin, J. Oxgaard, W.-Q. Deng, andW. A. G. III, “Development of the reaxff reactive force field for de-scribing transition metal catalyzed reactions, with application tothe initial stages of the catalytic formation of carbon nanotubes,”in J Phys Chem A, vol. 109, 2005, pp. 493–499.

[17] K. Chenoweth, S. Cheung, A. C. T. van Duin, W. A. G. III, andE. M. Kober, “Simulations on the thermal decomposition of apoly(dimethylsiloxane) polymer using the reaxff reactive forcefield,” in J Am Chem Soc, vol. 127, 2005, pp. 7192–7202.

[18] M. J. Buehler, “Hierarchical chemo-nanomechanics of proteins:Entropic elasticity, protein unfolding and molecular fracture,” inMech Material Struct, vol. 2(6), 2007, pp. 1019–1057.

[19] Y. Park, H. M. Aktulga, A. Y. Grama, and A. Strachan, “Strainrelaxation in si/ge/si nanoscale bars from md simulations,” J ApplPhys, vol. 106, p. 034304, 2009.

[20] J. C. Fogarty, H. M. Aktulga, A. C. T. van Duin, A. Y. Grama, andS. A. Pandit, “A reactive simulation of the silica-water interface,”in J Chem Phys, vol. 132, no. 174704, 2010.

[21] T.-R. Shan, R. R. Wixom, A. E. Mattsson, and A. P. Thomp-son, “Atomistic simulation of orientation dependence in shock-induced initiation of pentaerythritol tetranitrate,” The Journal ofPhysical Chemistry B, vol. 117, no. 3, pp. 928–936, 2013.

[22] J. Fogarty and S. Pandit, “Oxidative damage in lipid bilayers: Areactive molecular dynamics study,” Biophysical Journal, vol. 98,no. 3, p. 386a, 2010.

[23] H. M. Aktulga, S. Pandit, A. C. T. van Duin, and A. Grama, “Re-active molecular dynamics: Numerical methods and algorithmictechniques,” SIAM J. Sci. Comput, vol. 34(1), pp. C1–C23.

[24] H. M. Aktulga, J. C. Fogarty, S. A. Pandit, and A. Y. Grama,“Parallel reactive molecular dynamics: Numerical methods andalgorithmic techniques,” Parallel Computing, vol. 38(4-5), pp. 245–259.

[25] S. B. Kylasa, H. M. Aktulga, and A. Y. Grama, “Puremd-gpu:A reactive molecular dynamics package for gpus,” Journal ofComputational Physics, 2014.

[26] H. M. Aktulga, S. J. Plimpton, and A. Thompson. Lammps/user-reax/c. [Online]. Available: http://lammps.sandia.gov/doc/pair reax c.html

[27] S. J. Plimpton, “Fast parallel algorithms for short-range moleculardynamics,” in J Comp Phys, vol. 117, 1995, pp. 1–19.

[28] A. Thompson and H. Cho, “Lammps/reaxff potential,” April2010. [Online]. Available: http://lammps.sandia.gov/doc/pairreax.html

[29] A. Nakano, R. K. Kalia, K.-i. Nomura, A. Sharma, P. Vashishta,F. Shimojo, A. C. van Duin, W. A. Goddard, R. Biswas, and D. Sri-vastava, “A divide-and-conquer/cellular-decomposition frame-work for million-to-billion atom simulations of chemical reac-tions,” Computational Materials Science, vol. 38, no. 4, pp. 642–652,2007.

[30] A. Nakano, R. K. Kalia, K.-i. Nomura, A. Sharma, P. Vashishta,F. Shimojo, A. C. Van Duin, W. A. Goddard, R. Biswas, D. Srivas-tava, and L. H. Yang, “De novo ultrascale atomistic simulationson high-end parallel supercomputers,” International Journal of HighPerformance Computing Applications, vol. 22, no. 1, pp. 113–128,2008.

[31] K.-i. Nomura, Y.-C. Chen, W. Weiqiang, R. K. Kalia, A. Nakano,P. Vashishta, and L. H. Yang, “Interaction and coalescence ofnanovoids and dynamic fracture in silica glass: Multimillion-to-billion atom molecular dynamics simulations,” Journal of PhysicsD: Applied Physics, vol. 42, no. 21, p. 214011, 2009.

[32] Puremd software package. [Online]. Available: http://www.cs.purdue.edu/puremd

[33] M. Dittner, J. Muller, H. M. Aktulga, and B. Hartke, “Efficientglobal optimization of reactive force-field parameters,” Journal ofComputational Chemistry, in press.

[34] M. Zheng, X. Li, and L. Guo, “Algorithms of gpu-enabled reactiveforce field (reaxff) molecular dynamics,” in J Mol Graph Model,vol. 41, 2013, pp. 1–11.

[35] L. Verlet, “Computer” experiments” on classical fluids. i. thermo-dynamical properties of lennard-jones molecules,” Physical review,vol. 159, no. 1, p. 98, 1967.

[36] M. P. Allen and D. J. Tildesley, Computer simulation of liquids.Oxford university press, 1989.

[37] W. Mattson and B. M. Rice, “Near-neighbor calculations using amodified cell-linked list method,” Computer Physics Communica-tions, vol. 119, no. 2, pp. 135–148, 1999.

[38] Y. Saad, “Iterative methods for sparse linear systems,” in SIAM.SIAM, 2003.

[39] Y. Saad and M. H. Schultz, “Gmres: A generalized minimalresidual method for solving nonsymmetric linear systems,” inSIAM J Sci Stat Comput, vol. 7, 1986, pp. 856–869.

[40] D. Wolff and W. Rudd, “Tabulated potentials in molecular dynam-ics simulations,” Computer physics communications, vol. 120, no. 1,pp. 20–32, 1999.

[41] D. E. Shaw, “A fast, scalable method for the parallel evaluationof distance-limited pairwise particle interactions,” Journal of com-putational chemistry, vol. 26, no. 13, pp. 1318–1328, 2005.

[42] J. R. Shewchuk, “An introduction to the conjugate gradientmethod without the agonizing pain,” Carnegie Mellon University,Tech. Rep., 1994.

[43] D. Negrut, R. Serban, A. Li, and A. Seidl, “Unified memory incuda 6.0. a brief overview of related data access and transferissues,” SBEL Technical Report, Tech. Rep., 2014.

[44] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick,S. Morton, E. Phillips, Y. Zhang, and V. Volkov, “Parallel comput-ing experiences with cuda,” IEEE micro, no. 4, pp. 13–27, 2008.

[45] [Online]. Available: http://www.cs.virginia.edu/∼mwb7w/cuda support/memory transfer overhead.html

Sudhir B. Kylasa is currently a PhD studentat Purdue University. Prior to joining PurdueUniversity he worked in the software industry invarious roles primarily in the telecom market. Hisresearch focus is in the area of distributed andparallel computing, data analytics and machinelearning.

Hasan Metin Aktulga is an Assistant Profes-sor in the Department of Computer Scienceand Engineering at Michigan State University.His research interests are in the areas of highperformance computing, applications of parallelcomputing, big data analytics and numerical lin-ear algebra. Dr. Aktulga received his B.S. de-gree from Bilkent University in 2004, M.S. andPh.D. degrees from Purdue University in 2009and 2010, respectively; all in Computer Science.Before joining MSU, he was a postdoctoral re-

searcher in the Scientific Computing Group at the Lawrence BerkeleyNational Laboratory (LBNL).

Ananth Y. Grama is a Professor of ComputerScience at Purdue University. He also servesas the Associate Director of the recently initi-ated Science and Technology Center from theNational Science Foundation on the Science ofInformation, as well as the Center for Predic-tion of Reliability, Integrity, and Survivability ofMicrosystems (PRISM) of the US Departmentof Energy. He was a Purdue University FacultyScholar from 2002 - 07. Ananth received hisPh.D. in Computer Science from the University

of Minnesota in 1996. His interests span broad areas of Parallel andDistributed Systems and Computational Science and Engineering.

http://lammps.sandia.gov/doc/pair_reax_c.html

http://lammps.sandia.gov/doc/pair_reax_c.html

http://lammps.sandia.gov/doc/pair_reax.html

http://lammps.sandia.gov/doc/pair_reax.html

http://www.cs.purdue.edu/puremd

http://www.cs.purdue.edu/puremd

http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_transfer_overhead.html

http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_transfer_overhead.html

reactive molecular dynamics on massively parallel ... · abstract—we present a parallel...

Documents