arxiv:2009.12009v1 [cs.ms] 25 sep 2020

16
AMReX: Block-Structured Adaptive Mesh Refinement for Multiphysics Applications Journal Title XX(X):116 c The Author(s) 2020 Reprints and permission: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/ToBeAssigned www.sagepub.com/ SAGE Weiqun Zhang 1 , Andrew Myers 1 , Kevin Gott 2 , Ann Almgren 1 , John Bell 1 Abstract Block-structured adaptive mesh refinement (AMR) provides the basis for the temporal and spatial discretization strategy for a number of ECP applications in the areas of accelerator design, additive manufacturing, astrophysics, combustion, cosmology, multiphase flow, and wind plant modelling. AMReX is a software framework that provides a unified infrastructure with the functionality needed for these and other AMR applications to be able to effectively and efficiently utilize machines from laptops to exascale architectures. AMR reduces the computational cost and memory footprint compared to a uniform mesh while preserving accurate descriptions of different physical processes in complex multi-physics algorithms. AMReX supports algorithms that solve systems of partial differential equations (PDEs) in simple or complex geometries, and those that use particles and/or particle-mesh operations to represent component physical processes. In this paper, we will discuss the core elements of the AMReX framework such as data containers and iterators as well as several specialized operations to meet the needs of the application projects. In addition we will highlight the strategy that the AMReX team is pursuing to achieve highly performant code across a range of accelerator- based architectures for a variety of different applications. Keywords Adaptive mesh refinement; structured mesh/grids; particles; co-design; performance portability Introduction AMReX, a software framework developed and supported by the U.S. DOE Exascale Computing Project’s AMReX Co-Design Center, supports the development of block- structured adaptive mesh refinement (AMR) algorithms for solving systems of partial differential equations, in simple or complex geometries, on machines from laptops to exascale architectures. Block-structured AMR provides the basis for the temporal and spatial discretization strategy for ECP applications in accelerator design, additive manufacturing, astrophysics, combustion, cosmology, multiphase flow and wind energy. This suite of applications represents a broad range of different physical processes with a wide range of algorithmic requirements. Fundamental to block-structured AMR algorithms is a hierarchical representation of the solution at multiple levels of resolution. At each level the solution is defined on the union of data containers at that resolution, each of which represents the solution over a logically rectangular subregion of the domain. For mesh-based algorithms, AMReX provides support for both explicit and implicit discretizations. Support for solving linear systems includes native geometric multigrid solvers as well the ability to link to external linear algebra software. For problems that include stiff systems of ordinary differential equations that represent single-point processes such as chemical kinetics or nucleosynthesis, AMReX provides an interface to ODE solvers provided by SUNDIALS. AMReX supports algorithms that utilize particles and/or particle- mesh operations to represent component physical processes. AMReX also provides support for embedded boundary (cut cell) representations of complex geometries. Both particles and embedded boundary representations introduce additional irregularity and complexity in the way data is stored and operated on, requiring special attention in the presence of dynamically changing hierarchical mesh structures and AMR time stepping approaches. There are a number of different open-source AMR frameworks; see the 2014 survey paper by Dubey et al. (2014a) for those available at the time. Most of these frameworks originally used MPI for basic parallelization where data is distributed to MPI ranks with each rank performing operations on its own data. More recently, most of these frameworks, including BoxLib, the precursor to AMReX, introduced some level of hierarchical parallelism through the use of OpenMP directives. However, as we have been moving towards the exascale era, node architectures have been changing dramatically, with the advent and increasingly widespread use of many-core CPU- only nodes and hybrid nodes with GPU accelerators. Further complicating the landscape is the fact that different types of GPUs have different programming models. A key driver behind the development of AMReX is to provide a high performance framework for applications to run on a variety 1 CCSE, Lawrence Berkeley National Laboratory, Berkeley, CA, USA 2 NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA, USA Corresponding author: Weiqun Zhang, MS 50A-3111, Lawrence Berkeley, National Laboratory, Berkeley, CA 94720. Email: [email protected] Prepared using sagej.cls [Version: 2017/01/17 v1.20] arXiv:2009.12009v1 [cs.MS] 25 Sep 2020

Upload: others

Post on 21-Jan-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

AMReX: Block-Structured AdaptiveMesh Refinement for MultiphysicsApplications

Journal TitleXX(X):1–16c©The Author(s) 2020

Reprints and permission:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/ToBeAssignedwww.sagepub.com/

SAGE

Weiqun Zhang1, Andrew Myers1, Kevin Gott 2, Ann Almgren1, John Bell1

AbstractBlock-structured adaptive mesh refinement (AMR) provides the basis for the temporal and spatial discretizationstrategy for a number of ECP applications in the areas of accelerator design, additive manufacturing, astrophysics,combustion, cosmology, multiphase flow, and wind plant modelling. AMReX is a software framework that provides aunified infrastructure with the functionality needed for these and other AMR applications to be able to effectively andefficiently utilize machines from laptops to exascale architectures. AMR reduces the computational cost and memoryfootprint compared to a uniform mesh while preserving accurate descriptions of different physical processes in complexmulti-physics algorithms. AMReX supports algorithms that solve systems of partial differential equations (PDEs) insimple or complex geometries, and those that use particles and/or particle-mesh operations to represent componentphysical processes. In this paper, we will discuss the core elements of the AMReX framework such as data containersand iterators as well as several specialized operations to meet the needs of the application projects. In addition we willhighlight the strategy that the AMReX team is pursuing to achieve highly performant code across a range of accelerator-based architectures for a variety of different applications.

KeywordsAdaptive mesh refinement; structured mesh/grids; particles; co-design; performance portability

IntroductionAMReX, a software framework developed and supportedby the U.S. DOE Exascale Computing Project’s AMReXCo-Design Center, supports the development of block-structured adaptive mesh refinement (AMR) algorithms forsolving systems of partial differential equations, in simple orcomplex geometries, on machines from laptops to exascalearchitectures. Block-structured AMR provides the basis forthe temporal and spatial discretization strategy for ECPapplications in accelerator design, additive manufacturing,astrophysics, combustion, cosmology, multiphase flow andwind energy. This suite of applications represents a broadrange of different physical processes with a wide range ofalgorithmic requirements.

Fundamental to block-structured AMR algorithms isa hierarchical representation of the solution at multiplelevels of resolution. At each level the solution is definedon the union of data containers at that resolution,each of which represents the solution over a logicallyrectangular subregion of the domain. For mesh-basedalgorithms, AMReX provides support for both explicit andimplicit discretizations. Support for solving linear systemsincludes native geometric multigrid solvers as well theability to link to external linear algebra software. Forproblems that include stiff systems of ordinary differentialequations that represent single-point processes such aschemical kinetics or nucleosynthesis, AMReX provides aninterface to ODE solvers provided by SUNDIALS. AMReXsupports algorithms that utilize particles and/or particle-mesh operations to represent component physical processes.AMReX also provides support for embedded boundary (cut

cell) representations of complex geometries. Both particlesand embedded boundary representations introduce additionalirregularity and complexity in the way data is stored andoperated on, requiring special attention in the presenceof dynamically changing hierarchical mesh structures andAMR time stepping approaches.

There are a number of different open-source AMRframeworks; see the 2014 survey paper by Dubey et al.(2014a) for those available at the time. Most of theseframeworks originally used MPI for basic parallelizationwhere data is distributed to MPI ranks with each rankperforming operations on its own data. More recently, mostof these frameworks, including BoxLib, the precursor toAMReX, introduced some level of hierarchical parallelismthrough the use of OpenMP directives. However, aswe have been moving towards the exascale era, nodearchitectures have been changing dramatically, with theadvent and increasingly widespread use of many-core CPU-only nodes and hybrid nodes with GPU accelerators. Furthercomplicating the landscape is the fact that different typesof GPUs have different programming models. A key driverbehind the development of AMReX is to provide a highperformance framework for applications to run on a variety

1CCSE, Lawrence Berkeley National Laboratory, Berkeley, CA, USA2NERSC, Lawrence Berkeley National Laboratory, Berkeley, CA, USA

Corresponding author:Weiqun Zhang, MS 50A-3111, Lawrence Berkeley, National Laboratory,Berkeley, CA 94720.Email: [email protected]

Prepared using sagej.cls [Version: 2017/01/17 v1.20]

arX

iv:2

009.

1200

9v1

[cs

.MS]

25

Sep

2020

2 Journal Title XX(X)

of current- and next-generation systems without substantialarchitecture-specific code development / modification costsfor each new platform.

In the remainder of the paper, we first define what wemean by block-structured AMR followed by a discussionof the basic design issues considered in the developmentof AMReX. We then discuss the ECP applications thatare using AMReX and briefly describe their computationalrequirements. Next we discuss the data structures used todescribe the grid hierarchy and the data containers usedto hold mesh data. We then discuss additional features ofAMReX needed to support particle-based algorithms andembedded boundary discretizations. We also discuss linearsolvers needed for implicit discretizations. We close with abrief discussion of software engineering practice and someconcluding remarks.

Block-Structured AMRIn the context of this paper, block-structured AMR isconsidered to have the following defining features:

• The mesh covering the computational domain isdecomposed spatially into structured patches (grids)that each cover a logically rectangular region of thedomain.• Patches with the same mesh spacing are disjoint; the

union of such patches is referred to as a level. Only thecoarsest level (` = 0) is required to cover the domain,though finer levels can cover it as well.• The complete mesh hierarchy on which field variables

are defined is the union of all the levels. Proper nestingis enforced, i.e. the union of grids at level ` > 0 isstrictly contained within the union of grids at level `−1. We note, however, that AMReX does not imposeproper nesting at the individual grid level; i.e., a level` grid can span multiple grids at level `− 1.• The decomposition of the physical region covered by

each level can be different for mesh data vs. particledata; we refer to this as a ”dual grid” approach• The mesh hierarchy can change dynamically through-

out a simulation.

We note the contrast to block-structured AMR frameworksthat use a quadtree/octree structure, such as FLASH (Fryxellet al. 2000; Dubey et al. 2014b), where the patches have auniform size, each grid at level ` > 0 has a unique parentgrid at level `− 1, and algorithms are typically constrainedto use the same time step at all levels. We note that thesegrid patterns and algorithms are a subset of the AMReXcapabilities.

Design PhilosophyThere are two major factors that inform the design of theAMReX framework. First, AMReX should be able to supporta wide range of multiphysics applications with differentperformance characteristics. Second, AMReX should notimpose restrictions on how application developers constructtheir algorithms. Consequently, AMReX must provide arich set of tools with sufficient flexibility that the softwarecan meet the algorithmic requirements of many different

applications without sacrificing the performance of any.To achieve these goals, careful attention has been paidto separating the design of the data structures and basicoperations from the algorithms that use those data structures.The core software components provide the flexibility tosupport the exploration, development and implementation ofnew algorithms that might generate additional performancegains.

The AMReX design allows application developers tointeract with the software at several different levels ofabstraction. It is possible to simply use the AMReXdata containers and iterators and none of the higher-level functionality. A more popular approach is to usethe data structures and iterators for single- and multi-level operations but retain complete control over thetime evolution algorithm, i.e., the ordering of algorithmiccomponents at each level and across levels. In an alternativeapproach, the developer exploits additional functionality inAMReX that is designed specifically to support traditionalsubcycling-in-time algorithms. In this approach, stubs areprovided for the necessary operations such as advancing thesolution on a level, correcting coarse grid fluxes with time-and space-averaged fine grid fluxes, averaging data fromfine to coarse and interpolating in both space and time fromcoarse to fine. This layered design provides users with theability to have complete control over their algorithm or toutilize an application template that can provide higher-levelfunctionality.

The other major factor influencing the AMReX designis performance portability. As we enter the exascale era,a variety of new architectures are being developed withdifferent capabilities and programming models. Our goalis to isolate applications from any particular architectureand programming model without sacrificing performance. Toachieve this goal, we introduced a light weight abstractionlayer that effectively hides the details of the architecture fromthe application. This layer provides constructs that allowthe user to specify what operations they want to performon a block of data without specifying how those operationsare carried out. AMReX then maps those operations ontothe hardware at compile time so that the hardware isutilized effectively. For example, on a many-core node, anoperation would be mapped onto a tiled execution modelusing OpenMP to guarantee good cache performance whileon a different architecture the same operation might bemapped to a kernel launch appropriate to a particular GPU.

Similar in spirit to the other aspects of the design,applications are not required to use the AMReX-providedkernel launch functionality. Applications can use OpenMP,OpenACC or native programming models to implementtheir own kernel launches if desired. Furthermore, as partof the AMReX design, specific language requirements arenot imposed on users (although much of the AMReXinternal performance portability capability requires C++).Specifically, the project supports application modules writtenin Fortran, C, C++ or other languages that can be linked toC++.

Prepared using sagej.cls

Zhang et al. 3

AMReX-based ECP Applications

Seven ECP application projects, as well as numerous non-ECP projects, include codes based on AMReX. Here webriefly review the ECP applications and how they useAMReX.

WarpX (Vay et al. 2018) is a multilevel electromagneticPIC code for simulation of plasma accelerators; electrons aremodeled as AMReX particles while the electric and magneticfields are defined (at edges and faces, respectively) on thehierarchical mesh. The PIC algorithm heavily leverages theparticle-mesh functionality with particle data deposited ontothe mesh and mesh data interpolated to particles in every timestep.

The Castro (Almgren et al. 2010) code for compressibleastrophysics is part of the ExaStar project and uses the built-in support for subcycling in time. The basic time advanceincludes explicit hydrodynamics, solution of the Poissonequation for self-gravity, and time integration of stiff nuclearreaction networks.

Nyx (Almgren et al. 2013) also solves the equationsof compressible hydrodynamics with self-gravity, but herecoupled with an N-body representation of dark matter inthe context of an evolving universe. Dark matter particlesinteract with each other and with the mesh-based fluid onlythrough gravitational forcing. Time integration of the stiffsource terms is typically done with a call to CVODE, a solverprovided by the SUNDIALS project (Hindmarsh et al. 2005),one of the ECP Software Technology projects .

The MFiX-Exa multiphase modeling code also heavilyleverages both the mesh and particle functionality; theparticles represent solid particles within a gas. Unlikethe applications mentioned above, in MFiX-Exa particle/ particle interactions play a major role in the dynamics.Typical MFiX-Exa applications take place inside non-rectangular geometries represented using the EB datastructures. Multigrid solvers are used to solve for thedynamic pressure field in a projection formulation and forthe implicit treatment of viscous terms.

The compressible combustion code, PeleC, and the lowMach number combustion modeling code, PeleLM, areboth based on AMReX. Both use the EB methodology torepresent the problem geometry, and possibly CVODE toevolve the chemical kinetics. Like MFiX-Exa, PeleLM usesthe multigrid solvers to solve for the dynamic pressurefield in a projection formulation and for the semi-implicittreatment of viscous terms. Particles can be used both astracer particles and to represent sprays.

The ExaWind project combines AMR-Wind (), anAMReX-based multilevel structured flow solver with Nalu-Wind, an unstructured flow solver. The flow solvers arecoupled using an overset mesh approach handled by theTIOGA library. Both Nalu-Wind and AMR-Wind solvethe incompressible Navier-Stokes equations with additionalphysics to model atmospheric boundary layers. In a wind-farm simulation Nalu-Wind is designed to resolve thecomplicated geometry and flow near the wind turbine bladeswhile AMR-Wind solves for the flow in the full domain awayfrom the turbines.

One of the codes in the ExaAM project, TruchasPBF, isbased on AMReX. TruchasPBF is a finite volume code that

models free surface flows with heat transfer, phase changeand species diffusion. TruchasPBF is used for continuummodeling of melt pool physics.

Creating and managing the grid hierarchyBlock-structured AMR codes define the solution across ahierarchical representation of different levels of resolution.At each level, the solution is defined over a regionrepresented by the union of a collection of non-overlappingrectilinear regions with associated data containers. Thesedata containers can hold mesh data, particles, informationdescribing an embedded boundary geometry or other dataused to represent the solution. In this section, we introducethe data structures that describe the grid hierarchy and howthe hierarchy is generated.

Basic data structuresAMReX uses a number of data abstractions to represent thegrids at a given level. The basic objects are:

1. IntVect: a dimension-sized list of integers repre-senting a spatial location in index space. The useof IntVect allows the basic grid description to bedimension-independent.

2. Box: a logically rectangular (or rectangular cuboid,in 3D) region of cells defined by its upper and lowercorner indices. A Box also has an IndexType,a specialized IntVect that describes whether theindices refer to cell centers, faces, edges or nodes.

3. BoxArray: a collection of Boxes non-intersectingat the same resolution. Each individual Box in aBoxArray can be stored, accessed, copied andmoved independently of the other boxes.

4. DistributionMapping: a vector that lists theMPI rank that the data associated with each Box ina BoxArray will be defined on.

In an AMReX application, the BoxArray andDistributionMapping at each level describe howthe data at that level is decomposed and where it is allocated.The relationship between boxes at different levels, thusresolution different levels, is defined by a refinement ratio.In a single application, the refinement ratio between anytwo levels may differ but it must be prescribed at runtime.To facilitate efficient communication across MPI ranks, theBoxArray and DistributionMapping at each level,also referred to as the metadata, are stored on every MPIrank.

AMReX also includes distinct objects commonly used todescribe other critical simulation parameters. Of particularimportance are:

• Geometry: stores and calculates parameters thatdescribe the relationship between index space andphysical space. The Geometry object holds informa-tion that is the same at all levels, such as periodicityand coordinate system, but also level-specific informa-tion such as mesh size. A multi-level AMReX-basedsimulation will typically have a Geometry object ateach level.

Prepared using sagej.cls

4 Journal Title XX(X)

• BCRec: a multi-dimensional integer array that definesthe physical boundary conditions for the domain. Amulti-level AMReX-based simulation will only needone BCRec.

Regridding and load balancingAMR algorithms dynamically change the resolution of themesh to reflect the changing requirements of the problembeing solved. AMReX provides support for the creation ofnew grids or the modification of existing grids (regridding)at level `+ 1 and above at user-specified intervals. To doso, an error estimate is used to tag individual cells above agiven tolerance at level `, where the error is defined by user-specified routines. Typical error criteria might include, firstor second derivatives of the state variables or some type ofapplication-specific heuristic.

The tagged cells are grouped into rectangular grids atlevel ` using the clustering algorithm given in Berger andRigoutsos (1991). These rectangular patches are refined toform the grids at level `+ 1. Large patches are brokeninto smaller patches for distribution to multiple processorsbased on a user-specified max grid size parameter;AMReX enforces that no grid is longer in any directionthan max grid size cells. AMReX also provides theoption to enforce that the dimensions of each newly createdgrid is divisible by a specific factor, blocking factor;this is used to improve multigrid efficiency for single-levelsolves. (In this case the grid creation algorithm is slightlydifferent; we refer the interested reader to the AMReXdocumentation: Zhang et al. (2019, 2020).) The refinementprocess is repeated until either an error tolerance criterion issatisfied or a specified maximum level is reached. Finally,we impose the proper nesting requirements that enforce thatlevel `+ 1 is strictly contained in level ` except at physicalboundaries. A similar procedure is used to generate the initialgrid hierarchy. Once created, the BoxArray of new grids ateach level is broadcast to every MPI rank.

When creating a DistributionMapping associatedwith a BoxArray at a level it is important to equidistributethe work to the nodes. Some type of cost function that isassumed or prescribed by the user is used to estimate thework. The default methodology assumes a cost functionproportional to the number of cells per grid, and implementsa Morton-ordering space filling curve (SFC) algorithm todefine an initial data distribution across the MPI ranks.Once data has been allocated, applications can calculate animproved DistributionMapping throughout runtimebased on any desired load-balancing parameter, such as datasize, an estimation of the work, a user-defined cost functionor a timer.

Prescribed cost functions are often less than optimal. ForGPUs, both CUPTI timers and efficient hand-written timershave been developed to accurately capture device kerneltimes respective to the device. This eliminates launch delaysand other latencies from the resulting timer that preventaccurate distribution.

AMReX provides a knapsack algorithm in addition to aspace-filling curve algorithm for load balancing simulationsacross MPI ranks. The knapsack algorithm creates themost optimal distribution possible with respect to the load-balancing parameter, while the SFC tries to build contiguous

spatial regions with near uniform costs. See Figure 1 for anillustration of how the two approaches distribute data.

Figure 1. Comparison of the data distribution from theknapsack load balancing algorithm (left) and SFC (right). Imagecourtesy of Michael Rowan.

Mesh data and iteratorsAt the core of the AMReX software is a flexible set ofdata structures that can be used to represent block-structuredmesh data in a distributed memory environment. Multileveldata on the AMR grid hierarchy are stored in a vector ofsingle level data structures. In this section we describe thedata containers for mesh data. We also discuss operationssupported on these data structures including iterators foroperations at a level, a communications layer to handle ghostcell exchange and data distribution and tools for operationsbetween levels.

Containers for mesh dataThe basic single level data structure in AMReX is aC++ class template, FabArray<FAB>, a distributed datastructure for a collection of FABs, one for each grid. Thetemplate parameter FAB is usually a multidimensional arrayclass such as FArrayBox for floating point numbers. MostAMReX applications use a specialization of FabArray,called a MultiFab, that is used for storing floating pointdata on single level grids. However, FabArray can also beused to store integers, complex numbers or any user defineddata structures.

An FArrayBox contains the Box that defines its validregion, and a pointer to a multidimensional array. For three-dimensional calculations, the data accessed by the pointer inan FArrayBox or its base class BaseFab<T> is a four-dimensional array. The first three dimensions correspondto the spatial dimensions; the fourth dimension is forcomponents. An important feature of an FArrayBox is thatextra space is allocated to provide space for ghost cell data,which is often required for efficient stencil operations. Ghostcell data are typically filled by interpolating from coarserdata, copying from other FArrayBoxes at the same level, orby imposing boundary conditions outside the domain. Thedata in an FArrayBox (including ghost cells) are stored in acontiguous chunk of memory with the layout of the so-calledstruct of arrays (SoA) (i.e., struct of 3D arrays, one for eachcomponent).

AMReX provides a class template, Array4<T>, thatcan be used to access the data in an FArrayBox or other

Prepared using sagej.cls

Zhang et al. 5

similar objects. This class does not own the data and isnot responsible for allocating or freeing memory; it simplyprovides a reference to the data. This property makes itsuitable for being captured by a C++ lambda functionwithout memory management concerns. The Array4 classhas an operator() that allows the user to access itwith Fortran multi-dimensional array like syntax. The SoAlayout of FArrayBox is typically a good choice for mostapplications because it provides unit stride memory accessfor common operations. Nevertheless, some algorithms (e.g.,discontinuous Galerkin methods) perform computations thatinvolve relatively large matrices within a single element, andan array of struct (AoS) is a better layout for performance.This can be achieved with BaseFab<T>, where T is astruct.

IteratorsAMReX employs an owner-computes rule to operate onsingle AMR level data structures such as MultiFabs andFabArrays. A data iterator, MFIter, is provided to loopover the single box data structures such as FArrayBoxs andBaseFabs in a MultiFab / FabArray with the optionof logical tiling. The iterator can be used to access datafrom multiple MultiFabs / FabArrays built on the sameBoxArray with the same DistributionMapping.

For CPU computing, tiling provides the benefit of cacheblocking and parallelism for OpenMP threading. The iteratorperforms a loop transformation, thereby improving datalocality and relieving the user from the burden of manuallytiling (Zhang et al. 2016). For GPU computing, tiling isturned off by default because exposing more fine-grainedparallelism at the cell level and reducing kernel launchoverhead are more important than cache blocking.

Operations on a levelMany AMReX-based applications use stencil operations thatrequire access to the data on neighboring and/or nearby cells.When that data lies outside the valid region of the patchbeing operated on, pre-filling ghost cells is an efficient way tooptimize communication. The Boxes stored in FabArraysdetermine the “valid” region. The multidimensional arrayassociated with each FArrayBox covers a region largerthan the “valid” region by nGrow cells on the low andhigh sides in each direction, where nGrow is specifiedwhen the MultiFab is created. AMReX provides basiccommunication functions for ghost cell exchanges of data inthe same FabArray. It also provides routines for copyingdata between two different FabArrays at the same level.These routines use the metadata discussed in the previoussection to compute where the data is located. Data betweendifferent MPI ranks are organized into buffers to reduce thenumber of messages that need to be sent. Techniques tooptimize the communication are discussed in mote detailbelow.

Operations between levelsAMReX also provides functions for common communica-tion patterns associated with operations between adjacentlevels. The three most common types of inter-level opera-tions are interpolation from coarse data to fine, restriction

of fine data to coarse, and explicit refluxing at coarse/fineboundaries.

Interpolation is needed to fill fine level ghost cells thatare inside the domain but can’t be filled by copying fromanother patch at the same level, and is also needed inregridding to fill regions at the fine level that weren’tpreviously contained in fine level patches. Note that toavoid unnecessary communication we only obtain coarsedata to interpolate to the fine level when fine data is notavailable. Restriction is typically used to synchronize databetween levels, replacing data at a coarser level with anaverage or injection of fine data where possible. A number ofinterpolation and restriction operations are provided withinAMReX for cell-centered, face-centered or nodal data; adeveloper can also create their own application-specificinterpolation or restriction operation. For these operations,necessary communication is done first and then interpolation/ restriction is done locally, making it straightforward forusers to provide their own functions. AMReX only needs toknow the width of the stencil to gather the necessary amountof data.

Refluxing effectively synchronizes explicit fluxes acrosscoarse/fine boundaries, and is relevant only for cell-centereddata being updated with face-centered fluxes. Specializeddata structures, referred to as FluxRegisters, hold thedifferences between fluxes on a coarse level and time-and space-averaged fine level fluxes. Once the coarse andfine level have reached the same simulation time, thedata in the FluxRegisters are used to update the datain the coarse cells immediately next to the coarse-fineboundaries but not covered by fine cells. This operation iscommon in multi-level algorithms with explicit flux updates.FluxRegisters are also used to hold data needed tosynchronize implicit discretizations of parabolic and ellipticequations.

CommunicationAMReX grids can have a complicated layout, whichmakes the communication meta-data needed for ghostcell exchange and inter-level communication non-trivial toconstruct. AMReX uses a hash-based algorithm to performintersections between Boxes. The hash is constructed in anO(N) operation, where N is the number of global Boxes,and cached for later use. It then takes only O(n) operationsto construct the list of source and destination Boxes forcommunication, where n is the number of local Boxes.Furthermore, the communication meta-data is also cachedfor reuse.

AMReX supports GPU-aware MPI on GPU machines,if available. To reduce latency, the data needed tobe communicated through MPI are aggregated intocommunication buffers. For packing the buffer, we needto copy data from slices of multi-dimensional arrays tothe one-dimensional buffer, whereas for unpacking thebuffer, data are copied from the one-dimensional bufferto slices of multi-dimensional array. The straightforwardapproach of the copying is to launch a GPU kernel for eachslice. Unfortunately this simple approach is very expensivebecause there are often hundreds of very small GPU kernels.In AMReX, we have implemented a fusing mechanism that

Prepared using sagej.cls

6 Journal Title XX(X)

merges all these small GPU kernels into one kernel, thussignificantly reducing kernel launch overhead.

Performance portability

Performance portability across different architectures hasbeen identified as a major concern for the ECP. MostAMReX-based application codes either use or have plans touse a variety of platforms, including individual workstations,many-core HPC platforms such as NERSC’s Cori, andaccelerator-based supercomputers such as Summit andFrontier at OLCF and Aurora at ALCF. In this sectionwe discuss various features that enable applications toobtain high performance on different architectures withoutsubstantial recoding.

On-node parallelismAMReX is based on a hierarchical parallelism model.At a coarse-grained level, the basic AMReX paradigm isbased on distribution of one or more patches of data toeach node with an owner-computes rule to allocate tasksbetween nodes. For many use cases, a node is dividedinto a small number of MPI ranks and the coarse-graineddistribution is over MPI ranks. For example, on a systemwith 6 GPUs per node, the node would typically have 6MPI ranks. The nodes on modern architectures all havehardware for parallel execution within the node but there isconsiderable variability in the the details of intranode parallelhardware. For code executing on CPUs, AMReX supportslogical tiling for cache re-use using OpenMP threading.Tile size can be adjusted at run time to improve cacheperformance; tile size can also vary between operations.AMReX includes both a standard synchronous strategy forscheduling tiles as well as an asynchronous schedulingmethodology. AMReX also provides extensive support forkernel launching on GPU accelerators (using C++ lambdafunctions) and for the effective use of managed memory,that allows users to control where their data is stored.While much of the internal AMReX functionality currentlyuses CUDA/HIP/DPC++ for maximum performance oncurrent machines, AMReX supports the use of CUDA,HIP, DPC++, OpenMP or OpenACC in AMReX-basedapplications. Specific architecture-dependent aspects of thesoftware for GPUs are highly localized, enabling AMReX toeasily support other GPU architectures.

To isolate application code from the nuances of aparticular architecture, AMReX provides an abstractionlayer consisting of a number of ParallelFor loopingconstructs, similar to those provided by Kokkos (CarterEdwards et al. 2014) or RAJA (Beckingsale et al. 2019) buttailored to the needs of block-structured AMR applications.Users provide a loop body as a C++ lambda function thatdefines the task to be performed over a set of cells orparticles. The lambda function is then used in a kernel witha launching mechanism provided by CUDA, HIP or DPC++for GPU or standard C++ for CPU. The details of how to loopover the set of objects, and in particular how to map iterationsof the loop to hardware resources such as GPU threads, arehidden from users. By separating these details from the loopbody, the same user code can run on multiple platforms. In

addition to 1D loops, users can also specify 3D and 4D loopsusing by passing in an amrex::Box as the loop bounds.

Internally, the ParallelFor construct in AMReXis implemented using CUDA, HIP, and DPC++ as theparallel backend. A key difference between AMReX’sParallelFor functions and those provided by Kokkos orRAJA is the way OpenMP is handled. When using multi-threading, the AMReX ParallelFor does not includeany OpenMP directives (it does, however, incorporatevectorization through #pragma simd). Rather, OpenMP isincorporated at the MFIter level, as shown in Listing 1. Whenthis code is compiled to execute on CPUs with OpenMP,the MFIter loop includes tiling and the ParallelFortranslates to a serial loop over the cells in a tile. However,when compiling for GPU platforms, tiling at the MFIterlevel is switched off, and the ParallelFor translates to aGPU kernel launch.

1 # pragma omp p a r a l l e l i f ( Gpu : : no t InLaunchReg ion ( ) )2 f o r ( MFIter mfi ( mf , t r u e ) ; mfi . i s V a l i d ( ) ; ++ mfi ) {3 Array4<Real> c o n s t& f a b = mf . a r r a y ( mfi ) ;4 P a r a l l e l F o r ( mfi . t i l e b o x ( ) ,5 [ = ] AMREX GPU DEVICE ( i n t i , i n t j , i n t k ) {6 f a b ( i , j , k ) ∗= 3 . 0 ;7 } ) ;8 }

Listing 1: Example of ParallelFor. This code can be compiledto run on CPU with OpenMP or GPU with CUDA, HIP, orDPC++.

Parallel ReductionsParallel reductions are commonly needed in scientificcomputing. For example, in an explicit hydrodynamicssolver, one needs to calculate the time step by examining theCFL constraint in each cell and then finding the minimumover all cells. Similarly, in linear solvers, one needs tomeasure norms of the global residual. Performing reductionsefficiently in parallel in a way that is portable betweenCPU and various GPU platforms is non-trivial. To aid inthis task, AMReX provides generic functions for performingreduction operations in a performance-portable way. Thesefunctions can be used at the level of contiguous arraysby passing in data pointers, or they can work on higher-level AMReX data containers to e.g. perform reductionsover all the cells in a MultiFab, or all the particles in aParticleContainer (see Particles for more informationabout particles in AMReX).

A feature of the AMReX implementation of parallelreduce is that we provide an API for performing multiplereductions in one pass using a ReduceTuple datatype.See Listing 2 for a code example. The code in this snippetcomputes the Sum, Min, and Max of all the Real data ina MultiFab, as well as the Sum of all the int data inan iMultiFab (Long is used in the reduction to avoidoverflow). Any arbitrary combination of data types andreduction operators can be applied in this manner. Whenrunning on GPUs, all these operations would be done in asingle kernel launch.

1 ReduceOps<ReduceOpSum , ReduceOpMin , ReduceOpMax ,ReduceOpSum> r e d u c e o p ;

2 ReduceData<Real , Real , Real , Long> r e d u c e d a t a (r e d u c e o p ) ;

Prepared using sagej.cls

Zhang et al. 7

3 u s i n g ReduceTuple = typename d e c l t y p e ( r e d u c e d a t a ): : Type ;

4

5 f o r ( MFIter mfi ( mf ) ; mfi . i s V a l i d ( ) ; ++ mfi )6 {7 c o n s t Box& bx = mfi . v a l i d b o x ( ) ;8 a u t o c o n s t& f a b = mf . a r r a y ( mfi ) ;9 a u t o c o n s t& i f a b = imf . a r r a y ( mfi ) ;

10 r e d u c e o p . e v a l ( bx , r e d u c e d a t a ,11 [ = ] AMREX GPU DEVICE ( i n t i , i n t j , i n t k ) −>

ReduceTuple12 {13 Real x = f a b ( i , j , k ) ;14 Long i x = s t a t i c c a s t <Long>( i f a b ( i , j , k ) ) ;15 r e t u r n {x , x , x , i x } ;16 } ) ;17 }

Listing 2: Performing parallel reduce on mixed data types.

Note that the code in the above snippet does notperform MPI-level reductions. If that is needed, forconvenience AMReX provides wrappers such asParallelDescriptor::ReduceRealMin andParallelDescriptor::ReduceLongSum, so thatthe same code can work whether or not it is compiled withMPI support.

Memory managementAllocating and deallocating memory can be extremely costly,especially on CPU-GPU platforms where data needs to becopied back and forth between multiple memory spaces.Furthermore, different architectures support different typesof memory models. AMReX includes a variety of memorypools, referred to as Arenas, to improve the performanceof memory activities and properly track and handle memoryallocations. Arenas allocate a large block of memoryduring initialization and provide data pointers to piecesof that pre-allocated space as needed. This minimizesexpensive allocation calls, provides a portability wrapperaround memory operations and allows tracking of memoryuse across a simulation. In most cases, Arenas are usedautomatically according to AMReX’s memory strategy. Forexample, when creating a new MultiFab, the mesh datais allocated with the default AMReX Arena, The Arena,unless otherwise specified by the user.

Based on the system architecture, AMReX predefines anArena for each available memory type during initialization.A variety of Arena types are provided, including slaballocation, buddy memory and best-fit, and they can betargeted to any desired memory type, which currentlyincludes CPU memory, pinned memory, GPU devicememory and GPU managed memory. Users can buildtheir own Arenas, either to track a particular subset ofmemory usage or to manage memory according with auser-specific strategy. AMReX also provides specializedstd::vector variations that track particle memory use aswell as implement an improved memory allocation strategy.

AMReX’s default memory management strategy forCPU-GPU systems is based on placing floating pointdata on the device and moving it as little as possible.Using this paradigm, metadata is usually allocated tothe CPU to facilitate efficient host side solution controland pinned memory is used for temporary data that will

be transferred between host and device. This memorymanagement strategy is implemented automatically, andusers only interact directly with Arenas when it becomesnecessary to deviate from this strategy. AMReX Arenasinclude a robust design API so manually controllingdeviations is simple. For example, by default mesh andparticle data are allocated in managed memory, due to itsconvenience to developers and negligible overhead. But,this default can be overridden for any specific MultiFab,iMultiFab or ParticleContainer object during theobject’s constructions. The default memory pool can alsobe changed to traditional device memory through a runtimeflag. Additional runtime options include setting the initialallocation sizes of the Arenas and aborting if devicememory is oversubscribed, allowing users to adjust theAMReX memory management paradigm for their uniqueneeds.

ExampleAs an illustration of the performance of AMReX on asimple example, we carried out a weak scaling studyon Summit (https://www.olcf.ornl.gov/summit/) using anexplicit hydrodynamics code for solving the compressibleEuler equations. The time-stepping strategy uses a second-order Runge-Kutta method on each level with subcyclingin time between levels. A piecewise linear reconstructionmethod with a monotonized central slope limiter is usedto extrapolate the primitive variables to cell faces, andan approximate Riemann solver based on the two-shockapproximation is employed to compute fluxes. We haveperformed a series of tests simulating the Rayleigh-Taylorinstability. The runs have a base level and two refined levels,each with a refinement ratio of 2. On each node, there are128× 128× 256 cells on level 0. For multi-node runs, thedomain is replicated in the x and y-directions. In the initialmodel, levels 1 and 2 occupy 12.5% and 6.25% of thedomain, respectively. Regridding is performed every 2 stepson each fine level.

Figure 2 shows the results of a series of runs on Summit.We define the speedup as the time per step using CPUsonly divided by the time per step using 6 GPUs pernode. For single node runs, there is a ∼ 60× speedup forthe computational kernels themselves, and a 19× overallspeedup with GPUs. For the runs using 4096 nodes (24576GPUs), a 4.4× speedup is obtained. The reduced speedup isdue to increased communication costs relative to computecosts when using GPU accelerators; the effect is so largehere because relatively little floating-point work needed toadvance the solution in each cell.

ParticlesIn addition to mesh data, many AMReX applications useparticles to model at least some portion of the physicalsystem under consideration. At their simplest, particles canbe used as passive tracers that follow a flow field, trackingchanges to properties such as chemical composition butnot feeding back into the mesh-based solution. In mostcases, however, the particles in AMReX applications activelyinfluence the evolution of the mesh variables, through, forexample, drag or gravitational forces. Finally, particles can

Prepared using sagej.cls

8 Journal Title XX(X)

1 4 16 64 256 1024 4096Nodes

0 0

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8Ti

me

Per S

tep

Weak Scaling of a Compressible Hydro SolverGPUCPU

Figure 2. Weak scaling study of a compressiblehydrodynamics solver. Time per step is shown for a series ofGPU and CPU runs. The AMR runs have three total levels.

directly influence each other without an intervening mesh,for example through particle-particle collisions or DSMC-style random interactions. AMReX provides data containers,iterators, parallel communication routines, and other toolsfor implementing the above operations on particle data.

Particles present a different set of challenges thanstructured grid data for several reasons. First, particle datais inherently dynamic - even between AMR regriddingoperations, particles are constantly changing their positions.Additionally, the connectivity of particle data (e.g. nearestneighbors) is not as simple and in general cannot beinferred from metadata alone - rather, the particle dataitself must be examined and processed. Finally, data accesspatterns are more irregular, in that (unless special sorting isemployed) particles that are next to each other in memorydo not necessarily interact with the same set of grid cellsor with the same set of other particles. In what follows,we describe the particle data structures and supportingoperations provided by AMReX, and how they help usersmanage these challenges.

Data structures and iteratorsAMReX provides a ParticleContainer class for stor-ing particle data in a distributed fashion. Particles in AMReXare associated with a block-structured mesh refinement hier-archy based on their position coordinates. Internally, thishierarchy is represented by a Vector<BoxArray> thatdescribes the grid patches on each level, as well as aVector<DistributionMapping> that describes howthose grids are distributed onto MPI ranks. Particles arethen assigned to levels, grids and MPI ranks spatially, bybinning them on a given level using the appropriate cellspacing and physical domain offset. Storing the particles inthis fashion allows them to be accessed and iterated overlevel-by-level, which is convenient for many AMR time-stepping approaches. Additionally, splitting particles ontogrids and then assigning multiple grids per MPI rank allowsload balancing to performed by exchanging grids among theprocesses until an even number of particles is achieved. Notealso that although particles in AMReX always live on a set ofAMR grids, this does not need to the same set of grids that is

used to store the mesh data; in principle, the particle grids canbe different and can be load balanced separately. See Dualgrid approach for load balancing for more information.

Once the particles are stored in theParticleContainer, they can be iterated over inparallel in a similar manner to the MFIter for mesh data.See Listing 3 for a code sample. Specifically, when an MPIrank executes a ParIter loop, it executes the loop bodyonce for each grid (or tile, if tiling is enabled, see DataLayout) on the level in question. By launching the samecompute kernel on each grid, all the particles on the levelcan be updated in a manner that separates the numericaloperations from the details of the domain decomposition.Inside the ParIter, the same ParallelFor constructscan be used to operate on particles in a portable way. The 1Dversion of ParallelFor, which takes a number of itemsto iterate over instead of a Box, is particularly useful forparticle data.

1 # pragma omp p a r a l l e l i f ( Gpu : : no t InLaunchReg ion ( ) )2 f o r ( P a r I t e r p t i ( pc , l ev , T i l ing I fNo tGPU ( ) ) ;3 p t i . i s V a l i d ( ) ; ++ p t i )4 {5 a u t o& t i l e = p t i . G e t P a r t i c l e T i l e ( ) ;6 c o n s t a u t o np = t i l e . n u m P a r t i c l e s ( ) ;7 a u t o p d a t a = t i l e . g e t P a r t i c l e T i l e D a t a ( ) ;8

9 amrex : : P a r a l l e l F o r ( np ,10 [ = ] AMREX GPU DEVICE ( i n t i )11 {12 s o m e f u n c t i o n ( pda ta , i ) ;13 } ) ;14 }

Listing 3: Example of ParIter loop

Data LayoutAMReX supports particle types that are arbitrary mixturesof real (i.e., floating point data of either single or doubleprecision) and integer components. For example, a particleused to represent a star in a 3D N-body calculation mighthave four real components, a mass, and three velocitycomponents, while a particle that a represents an ion ina fuel cell might have an integer component that tracksits ionization level. In general, the particles in a givencontainer must all have the same type, but there is no limit tothe number of different types of ParticleContainersthat can be in the same simulation. For example, a singlecalculation might have one type of particle that representscollisionless dark matter, another that represents neutrinos,and another that represents active galactic nuclei, each ofwhich would model different physics and exhibit differentbehavior.

There is also the question of how the various componentsfor the particles on a single grid should be laid out inmemory. In an Array-of-Structs (AoS) representation, all thecomponents for particle 1 are next to each other, followed byall the components for particle 2, and so on. In a Struct-of-Arrays (SoA) representation, all the data over all the particlesfor component 1 would be next to other, followed by all thedata for component 2, and so on. AMReX allows users tospecify a mixture of AoS and SoA data when defining theparticle type. For example, consider an application where,

Prepared using sagej.cls

Zhang et al. 9

in addition to position and velocity, each particle stores themass fraction of a large number of chemical components. Formany operations, such as advancing the particle positionsin time or assigning particles to the proper grid in theAMR hierarchy, the values of the chemical concentrationare irrelevant. If all of these components were stored in theparticle struct, the data for the position and velocity of allthe particles would be spread out in memory, resulting inan inefficient use of cache for these common operations. Byseparating out these components from the core particle struct,we can avoid reading them into cache for operations in whichthey are not needed.

The AMReX particle containers also support a tilingoption. When tiling is enabled, instead of each grid havingits own separate set of particle data structures, the grids willbe further subdivided into tiles with a runtime-specified size.Unlike with mesh data, with particle data the tiles are notjust logical, they actually change the way the data is laidout in memory, with each tile getting its own set of set ofstd::vector-like data structures. This tiling strategy canmake working sets smaller and more cache-friendly, as wellas enable the use of OpenMP for particle-mesh depositionoperations, in which race conditions need to be considered(see Particle-Mesh Operations for more information).

Particle CommunicationThere are two main parallel communication patterns thatarise when implementing particle methods with AMReX.The first is redistribution. Suppose that at a given timein a simulation, the particles are all associated withthe correct level, grid, and MPI rank. The particlepositions are then updated according to some time-stepping procedure. At their new positions, the particlesmay no longer be assigned to the correct location inthe AMR hierarchy. The Redistribute() methodin AMReX takes care of assigning the correct level,grid, and tile to each particle and moving it to thecorrect place in the ParticleContainer, including anynecessary MPI communication. We provide two forms ofRedistribute(), one in which the particles have onlymoved a finite distance from the “correct” grids and thusare only exchanged between neighboring MPI ranks, andanother where the particles can in principle move betweenany two ranks in the MPI communicator. The former isthe version most often used during time evolution, whilethe latter is mostly useful after initialization, regridding, orperforming a load-balancing operation.

The process of particle/grid assignment is done as follows.First, consider a single level. Assigning a particle to agrid reduces to 1) binning the particle using the level’scell spacing, and 2) identifying the box in the BoxArraythat contains the associated cell. Internally, this operationis implemented using a uniform binning strategy to avoid adirect search over all the boxes on the level. By binning boxesusing the maximum box size as the bin size, only neighboringbins need to be searched to find the possible intersections fora given cell. Additionally, this operation can be performedpurely on the GPU by using a flattened map data structure tostore the binned BoxArray.

To extend this operation to multiple levels, we considereach level in a loop, from finest to coarsest. Usually, we

associate the particles with the finest level possible. Theexception is when subcycling in time is involved. In thiscase, we provide the option to exclude certain levels fromthe Redistribute() operation, and to leave the particlesalone if they are less than a specified number of cells outsidethe coarse/fine boundary.

The second communication pattern that comes upfor particle data is the halo pattern, similar to fillingghost cells for mesh data. In this operation, eachgrid or tile obtains copies of particles that are withinsome number of cells outside its valid region. This iscommonly needed in applications with direct particle-particle collisions, since there is no intermediate grid thatcan be used to communicate accumulated values (in contrastto the pattern described in Particle-Mesh Operations).The AMReX terminology for these copies of particlesfrom other grids is NeighborParticles. As withRedistribute(), any necessary parallel communicationis handled automatically, and when AMReX is compiledwith GPU support the operation runs entirely on the device.

When using neighbor particles, it is common to computea halo of particles (i.e., identify which particles need to becopied to which other grids) once, and then reuse it forseveral time steps before computing a new halo. To enablethis, AMReX separates computing the halo and filling itwith data for the first time (the FillNeighbors operation)from reusing the existing halo but replacing the data withthe most recent values from the “valid” particles on othergrids (the UpdateNeighbors operation). Additionally,AMReX provides a SumNeighbors function that takesdata from the ghosted particles and adds the contributionback to the corresponding “valid” particles.

The particle communication routines in AMReX beenoptimized for Summit and can weak scale up to nearlythe full machine. Figure 3 shows the results of a particleredistribution scaling study on up to 4096 Summit nodes,with and without mesh refinement. In this test, particlesare initialized uniformly throughout the domain and thenmoved in a random direction for 500 time steps, callingRedistribute() after each one. The initial drop in weakscaling efficiency from 1 to ∼ 16 nodes is due to the factthat in 3D, the average number of communication partnersper MPI rank grows until you have ≈ 27 neighbors pergrid. After this point, the scaling is basically flat up to 4096nodes (24576 MPI tasks). Finally, although this benchmarktests Redistribute() rather than FillNeighbors,the two routines share the same underlying communicationcode, so the scaling performance of each is similar.

Particle-Mesh OperationsMany of the application codes that use AMReX’s particledata structures implement some form of particle-meshmethod, in which quantities are interpolated between theparticle positions and the cells, edges, faces, or nodes ofa structured mesh. Example use-cases include Lagrangiantracers (in which this interpolation is one-way only), modelsof drag forces between a fluid and colloidal particles, andelectrostatic and electromagnetic Particle-in-Cell schemes.

To perform these operations in a distributed setting, sometype of parallel communication is required. We opt tocommunicate the mesh data rather than the particle data

Prepared using sagej.cls

10 Journal Title XX(X)

Figure 3. Weak scaling study of an AMReX particleredistribution benchmark on Summit. The x-axis is the numberof nodes, while the y-axis shows parallel efficiency, i.e. the runtime on 1 node divided by the run time on a scaled up version ofthe benchmark on N nodes. The total number of nodes onSummit is indiciated by the red vertical line. All runs used 6 MPIranks per node with 1 GPU per MPI rank. Results are shown forboth a 1-level and a 3-level version of the benchmark.

for these operations, because the connectivity of the meshcells is simpler and thus the operation can be made moreefficient. However, this may not be the best choice when theparticles are very sparse and/or the interpolation stencil isvery wide. Depending on the type and order of interpolationemployed, the mesh data involved in these interpolationsrequires some number of ghost cells to ensure that each gridhas enough data to fully perform the interpolation for theparticles inside it. For mesh-to-particle interpolation, we useFillBoundary to update the ghost cells on each grid priorto performing the interpolation. For particle-to-mesh (i.e.deposition), we use the analogous SumBoundary operationafter performing a deposition, which adds the valuesfrom overlapping ghost cells back to the correspondingvalid cell, ensuring that all the particle’s contributions areaccounted for. We provide generic, lambda-function basedParticleToMesh and MeshToParticle functions forperforming general interpolations. Additionally, users canalways write their own MFIter loops to implement theseoperations in a more customized way.

For particle-to-mesh interpolation, an additional factor isthat race conditions need to be taken into account whenmultiple CPU or GPU threads are in use. Our approach tothis depends on whether AMReX is compiled for a CPU orGPU platform. When built for CPUs, we use AMReX’s tilinginfrastructure to assign different tiles to different OpenMPthreads. Each thread then deposits its particles into a thread-local buffer, for which no atomics are needed. We thenperform a second reduction step where each tile’s local bufferis atomically added to the MultiFab storing the result ofthe deposition. For GPUs, no local buffer is used; each threadperforms the atomic writes directly into global memory. Thisstrategy works surprisingly well for NVIDIA V100 GPUssuch as those on Summit, provided that we periodically sortthe particles on each grid to better exploit the GPU’s memoryhierarchy.

Figure 4 shows the results of a weak scaling study on auniform plasma benchmark from the AMReX-using codeWarpX (Vay et al. 2018). Of all AMReX users, WarpXplaces the most stress on the particle-mesh operations, sincethe deposition of current from macro-particles is often thedominant cost. This benchmark also stresses several parallelcommunication routines, including particle redistribution,FillBoundary, and SumBoundary.

Additionally, the WarpX project has identified a plasma-acceleration run as its ECP KPP benchmark. At the timeof this writing, the figure of merit from this benchmarkmeasured on 4263 Summit nodes and extrapolated to the fullmachine is over 100 times larger than the baseline, whichwas measured using the original Warp code on Cori.

Figure 4. Weak scaling study of a WarpX uniform plasmabenchmark on up to 2048 nodes on Summit. The x-axis is thenumber of nodes, while the y-axis displays a type ofFigure-of-Merit: the total number of particles advanced per stepper ns. Perfect weak scaling would be indicated by dark dashedlines. Results are shown for both CPU-only andGPU-accelerated runs. The CPU-only runs used all 42 coresper compute node, while the GPU-accelerated runs used all 6GPUs per node. The overall speedup from using theaccelerators is ≈ 30 at all problem sizes.

Particle-Particle operationsAnother common need in AMReX particle applications -for example, MFIX-Exa and the fluctuating hydrodynamicscode DISCOS, is to pre-compute a neighbor list, or a setof possible collision partners for each particle over the nextN time steps. AMReX provides algorithms for constructingthese lists that run on both the CPU and the GPU basedon a user-provided decision function that returns whether apair of particles belong in each other’s lists or not. Thesealgorithms employ a spatial binning method similar to thatused in Redistribution that avoids an N2 search over theparticles on a grid. Tools for iterating over neighbor lists arealso provided; see Listing 4 for example code.

1 amrex : : P a r a l l e l F o r ( np ,2 [ = ] AMREX GPU DEVICE ( i n t i )3 {4 P a r t i c l e T y p e& p1 = p a r t s [ i ] ;5

6 f o r ( c o n s t a u t o& p2 : n l i s t . g e t N e i g h b o r s ( i ) )7 {8 Real dx = p1 . pos ( 0 ) − p2 . pos ( 0 ) ;9 Real dy = p1 . pos ( 1 ) − p2 . pos ( 1 ) ;

10 Real dz = p1 . pos ( 2 ) − p2 . pos ( 2 ) ;

Prepared using sagej.cls

Zhang et al. 11

11 . . .12 }13 }

Listing 4: Iterating over particles in a neighbor list

Parallel ScanWhen running on GPUs, many particle operations can beexpressed in terms of a parallel prefix sum, which computesa running tally of the values in an array. For example, acommonly-needed operation is to compute a permutationarray that puts the particles on a grid into a bin-sorted orderbased on some user-specific bin size. This array can thenbe used to re-order the particles in memory, or it can beused directly to access physically nearby particles in an out-of-order fashion. To perform this operation in AMReX, weuse a counting sort algorithm, which involves as one of itsstages an exclusive prefix sum. Other operations that canbe expressed in terms of a prefix sum include partitioningthe particles in a grid, various stream compaction operations(filtering, scattering, gathering), and constructing neighborlists.

AMReX provides an implementation of prefix sum basedon Merrill and Garland (2016) that works with CUDA, HIP,and DPC++. Our API allows callables to be passed in thatdefine arbitrary functions to be called when reading in andwriting out data. See Listing 5 for an example. This allowsfor the expression of, e.g., a partition function that avoidsunnecessary memory traffic for temporary variables.

1 Pref ixSum<T>(n ,2 [ = ] AMREX GPU DEVICE ( i n t i ) −> T3 { r e t u r n i n [ i ] ; } ,4 [ = ] AMREX GPU DEVICE ( i n t i , T c o n s t& x )5 { o u t [ i ] = x ; } ,6 Type : : e x c l u s i v e ) ;

Listing 5: API for PrefixSum.

Using this scan implementation, AMReX provides toolsfor performing the above binning and stream compactionoperations. These tools are used internally in AMReXwhen, for example, redistributing particles or constructingneighbor lists. They are also used by AMReX applications,for example to implement DSMC-style collisions whereparticles in the same bin undergo random collisions, or tomodel a variety of particle creation processes in WarpX,such as ionization, pair production, and quantum synchrotronradiation.

Dual grid approach for load balancingIn AMReX-based applications that have both mesh dataand particle data, the mesh work and particle work havevery different requirements for load balancing. Rather thanusing a combined work estimate to create the same gridsfor mesh and particle data, AMReX supplies the optionto pursue a dual grid approach. With this approach themesh (MultiFab) and particle (ParticleContainer)data are allocated on different BoxArrays with differentDistributionMappings. This enables separate loadbalancing strategies to be used for the mesh and particlework. An MFiX-Exa simulation, for example, may have verydense and very dilute particle distributions within the same

domain, while the dominant cost of the fluid evolution –the linear solver for the projections – is evenly distributedthroughout the domain. The cost of this strategy, of course,is the need to copy mesh data onto temporary MultiFabsdefined on the particle BoxArrays when mesh-particlecommunication is required.

Embedded boundary representationAMReX supports complex geometries with an embeddedboundary (EB) approach that fits naturally within the block-structured AMR approach. In this approach, the problemgeometry is represented as an interface that cuts througha regular computational mesh as shown in Figure 5. Thus,although the computational domain has an irregular shape,the underlying computational mesh is uniform at each levelof the AMR hierarchy.

Figure 5. Sketch illustrating embedded boundaryrepresentation of geometry.

Data structures and algorithmsWhen an embedded boundary is present in the domain, thereare three types of cells: regular, cut and covered, as seenin Figure 5. AMReX provides data structures for accessingEB information, and supports various algorithms and solversfor solving partial differential equations on AMR grids withcomplex geometries.

The EB information is precomputed and stored in adistributed database at the beginning of the calculation. It isavailable for AMR meshes at each level and for coarsenedmeshes in multigrid solvers. The information includes celltype, volume fraction, volume centroid, face area fractionsand face centroids. For cut cells, the information alsoincludes the centroid, normal and area of the EB face.Additionally, there is connectivity information betweenneighboring cells.

The implementation of the EB data structures usesC++ std::shared ptrs. Regular data structures such asMulitFabs and FArrayBoxs have the option to store ashared copy of the EB data. One can query an FArrayBoxto find out if it has any cut or covered cells within a givenBox. This allows the user to adapt an existing non-EB codeto support EB by adding support to handle regions with EB.Computations with EB are usually not balanced in work load.Regions with cut cells are more expensive than regions withonly regular cells, whereas the completely covered regions

Prepared using sagej.cls

12 Journal Title XX(X)

have no work. Applications can measure the wallclock timespent in different regions and use that information for loadbalancing when regridding is performed.

Numerical algorithms that use an embedded boundarytypically use specialized stencils in cells containing and nearthe boundary; in particular they may pay special attention tothe treatment of the solution in “small cells” that can arisewhen the problem domain is intersected with the mesh.

For example, due to the stability constraint for cut cellswith very small volume fractions, EB-aware algorithms oftenneed to redistribute part of the update from small cut cellsto their neighbors. Furthermore, to maintain conservationwhen relevant, both the fluxes and redistribution correctionsat AMR coarse/fine boundaries need to be synchronized.AMReX provides a class EBFluxRegister for refluxingin solvers for hyperbolic conservation laws. AMReX alsoprovides EB-aware interpolation functions for data ondifferent AMR levels.

Geometry GenerationAMReX provides an implicit function approach forgenerating the geometry information. In this approach, thereis an implicit function that describes the surface of theembedded object. It returns a positive value, a negative valueor zero, for a given position inside the body, inside thefluid, or on the boundary, respectively. Implicit functionsfor various simple shapes such as boxes, cylinders, spheres,etc., as well as a spline based approach, are provided.Furthermore, basic operations in constructive solid geometry(CSG) such as union, intersection and difference are usedto combine objects together. Geometry transformations (e.g.,rotation and translation) can also be applied to these objects.Figure 6 shows an example of geometry generated with CSGfrom simple shapes using the implicit function defined inListing 6. Besides, the implicit function based approach, anapplication code can also use its own approach to generategeometry information and then store it in AMReX’s EBdatabase.

1 EB2 : : S p h e r e I F s p h e r e ( 0 . 5 , { 0 . 0 , 0 . 0 , 0 . 0} , f a l s e ) ;2 EB2 : : BoxIF cube ({−0.4 ,−0.4 ,−0.4} , { 0 . 4 , 0 . 4 , 0 . 4} ,

f a l s e ) ;3 a u t o c u b e s p h e r e = EB2 : : m a k e I n t e r s e c t i o n ( sphe re ,

cube ) ;4 EB2 : : C y l i n d e r I F c y l i n d e r x ( 0 . 2 5 , 0 ,

{ 0 . 0 , 0 . 0 , 0 . 0} , f a l s e ) ;5 EB2 : : C y l i n d e r I F c y l i n d e r y ( 0 . 2 5 , 1 ,

{ 0 . 0 , 0 . 0 , 0 . 0} , f a l s e ) ;6 EB2 : : C y l i n d e r I F c y l i n d e r z ( 0 . 2 5 , 2 ,

{ 0 . 0 , 0 . 0 , 0 . 0} , f a l s e ) ;7 a u t o t h r e e c y l i n d e r s = EB2 : : makeUnion ( c y l i n d e r x

, c y l i n d e r y , c y l i n d e r z ) ;8 a u t o csg = EB2 : : m a k e D i f f e r e n c e ( cubes phe re ,

t h r e e c y l i n d e r s ) ;

Listing 6: Example of constructing a complex implicitfunction based on simple objects (sphere, box and cylinder)with transformations(intersection, union and difference).

Mesh pruningWhen the simulation domain is contained by EB surfaces,“mesh pruning” can reduce the memory required byeliminating fully covered grids at each level (i.e. grids that

Figure 6. Example of complex geometry generated with CSGfrom simple shapes.

are fully outside of the region in which the solution willbe computed) from the BoxArray at that level. As longas the MultiFabs holding the solution are created afterthe BoxArray has been pruned, no data will be allocatedin those regions. This capability has been demonstrated fora simplified Chemical Looping Reactor (CLR) geometry bythe MFiX-Exa project; see Figure 7 for an example of howmesh pruning becomes even more effective as the base meshis resolved.

Particle-Wall interactionsIn cases where particles interact with EB surfaces, AMReXprovides the functionality to build a level set, with valuesdefined at nodes, that represents the distance function fromthe EB surface. Particle-wall collisions can then be computedbased on the level set values at the nodes surrounding the cellholding the particle. In some applications, such as MFiX-Exa, the level set can be defined at a finer level than the restof the simulation in order to exploit a finer representation ofthe geometry for calculating collisions.

Linear SolversA key feature of a number of applications based on AMReXis that they require solution of one or more linear systemsat each time step. AMReX includes native single-level andmulti-AMR-level geometric multigrid and Krylov (CG andBiCG) solvers for nodal or cell-centered data, as well asinterfaces to external solvers such as those in hypre (Falgoutet al. 2006) and PETSc (Balay et al. 2019). The externalsolvers can be called at the finest multigrid level, or as a”bottom solve” where the native solver coarsens one or morelevels, then calls the external solver to reduce the residual bya specified tolerance at that level.

Discretizations include variable coefficient Poisson andHelmholtz operators as well as the full viscous tensor forfluid dynamics (requiring solution of a multi-componentsystem). The AMReX multigrid solvers include aggregation(merging boxes at a level in the multigrid hierarchy to enable

Prepared using sagej.cls

Zhang et al. 13

Figure 7. Mesh pruning of a Chemical Looping Reactorgeometry. Shown here is the EB geometry and the individualgrids at level 0 for a relatively coarse simulation. Courtesy ofMFiX-Exa team.

additional coarsening within the V-cycle) and consolidation(reducing the number of ranks to reduce communicationcosts at coarser multigrid levels) strategies to reduce totalcost.

The linear solvers in AMReX also support complexgeometries represented using the EB approach. For cell-centered data, the stencil uses EB information such asthe boundary normal, and employs a geometric coarseningstrategy. The nodal linear solver in AMReX is basedon a finite-element approach. The construction of thestencil uses various pre-computed moment integrals (e.g.,∫∫∫

xαyβzγ dx dy dz, with an algebraic mulitgrid approachto coarsening the operator.

Figure 8 shows a weak scaling study of the cell-centeredlinear solve for

aαφ− b∇ · (β∇φ) = rhs, (1)

where φ is the unknown defined at cell centers, a and b arescalars, α is defined at cell centers and may vary in space,and β is a variable coefficient defined on cell faces. Thecalculations were performed on Summit with 6 GPUs pernode, and on Cori Haswell nodes with 32 CPU cores pernode. The weak scaling tests had 2563 cells per node, andthere were 40963 cells for the 4096 nodes jobs on Summit.On a single node, the GPU run was 6 times faster thanthe CPU run. The scaling behavior is consistent with thecharacterization that multigrid solvers are communicationheavy especially because of the floating point operationperformance of GPUs.

1 8 64 512 4096Nodes

0.00 0.00

0.25 0.25

0.50 0.50

0.75 0.75

1.00 1.00

1.25 1.25

1.50 1.50

1.75 1.75

Tim

e

Linear Solver Weak scalingSummitCori Haswell

Figure 8. Weak scaling study of linear solver. There are 2563

cells per node.

I/O

AMReX provides a native file format for plotfiles that storethe solution at a given time step for visualization. Plotfilesuse a well-defined format that is supported by a varietyof third-party tools for visualization. AMReX providesfunctions that directly write plotfiles from mesh and particleobjects. Mesh and particle plotfiles are written independentlyfor improved flexibility and performance. AMReX alsoprovides users the option to use HDF5 for data analysis andvisualization.

Writing a plotfile requires coordination between MPIranks to prevent overwhelming the I/O system with toomany simultaneous writes to the file system. AMReXhas implemented multiple output methodologies to provideefficient I/O across a variety of applications and simulations.A static output pattern prints in a pre-determined pattern thateliminates unnecessary overhead, which is useful for well-balanced or small simulations. A dynamic output patternimproves write efficiency for complex cases by assigningranks to coordinate the I/O in a task-like fashion. Finally,asynchronous output assigns the writing to a backgroundthread, allowing the computation to continue uninterruptedwhile the write is completed on a stored copy of the data. TheI/O output methodology and other standard features, such asthe number of simultaneous writes, can be chosen throughrun-time and compile-time flags.

Asynchronous I/O is currently the targeted I/O method forexascale systems because it is a portable methodology thatsubstantially reduces I/O impact on total run-time. AMReX’snative Async I/O has reduced write time by a factor ofaround 80 on full-scale Summit simulations, as shown inFigure 9. However, asynchronous I/O requires a scalableMPI THREAD MULTIPLE implementation to achieve thebest results. Scalable Async I/O requires one thread passingzero-size messages while another thread performs it’s owncommunication with as little overhead as possible. Onsystems where MPI THREAD MULTIPLE implementationsscale poorly, asynchronous I/O may not be the best choice.

AMReX also provides the infrastructure, examples anddocumentation to allow applications to write checkpointfiles for simulation restarts. Each application requires aunique set of data to restart successfully that is not known

Prepared using sagej.cls

14 Journal Title XX(X)

Figure 9. Weak scaling study of possible MPI implementationsfor Async I/O. Async I/O requires a zero-size message betweenI/O ranks. A variety of possibilities were tested while AMReXReduction calls amrex::Min() and amrex::Max() arecalled continuously on another thread to testMPI THREAD MULTIPLE overhead. The results show onlyMPI IBarrier yields scalable results. MPI IBarrier’sperformance is equivalent to less than 2% overhead onfull-scale Summit.

to AMReX, such as run-time determined constants or anypre-calculated data sets that cannot be easily reconstructed.Therefore, each application has to build their own checkpointsystem. However, numerous examples and helpful functionsare provided to help facilitate its construction.

Visualization

The AMReX plotfile format is supported by Paraview, Visitand yt. AMReX also supports a native visualization tool,AMRVis, which users and developers find useful for testing,debugging and other quick-and-dirty visualization needs.The AMREX team is working with the ALPINE and Senseiteams, part of the ECP Software Technologies, to providesupport for in situ data analysis and visualization; see, e.g.Biswas et al. (2020). Preliminary interfaces to both exist inthe AMReX git repo.

Software Engineering

AMReX is fully open source; source code, documentation,tutorials, and a gallery of examples are all hostedon Github (https://amrex-codes.github.io/amrex/). AMReXsupports building applications with both GNUMake andCMake and can also be built as a library. Development takesplace using a fork-and-pull request workflow, with stabletags released monthly. Github CI is used to verify buildswith a variety of compilers on Windows, Linux and Macfor each pull request. In addition, a suite of regression testsare performed nightly, both on AMReX and on selectedAMReX-based applications. These tests cover both CPUarchitectures and NVIDIA GPUs.

In addition, AMReX has been part of the xSDKreleases starting in 2018, and is available throughspack. AMReX is also one of the software packagesfeatured in the Argonne Training Program forExtreme-Scale Computing (ATPESC); see https://xsdk-project.github.io/MathPackagesTraining2020/lessons/amrexfor the AMReX tutorial presented in 2020.

Profiling and DebuggingProfiling is a crucial part of developing and maintaininga successful scientific code, such as AMReX and itsapplications. The AMReX community uses a wide varietyof compatible profilers, including VTune, CrayPat, ARMForge, Amrvis and the Nsight suite of tools. AMReXincludes its own lightweight instrumentation-based profilingtool, TinyProfiler. TinyProfiler consists of a few simplemacros to insert scoped timers throughout the application.At the end of a simulation, TinyProfiler reports the total timespent in each region, the number of times it was called andthe variation across MPI ranks. This tool is on by default inmany AMReX applications and includes NVTX markers toallow instrumentation when using Nsight.

AMReX’s make system includes a debug mode with avariety of useful features for AMReX developers. Debugmode includes array bound checking, detailed built-inassertions, initialization of floating point numbers to NaNs,signal handling (e.g. floating point exceptions) and backtracecapability for crashes. This combination of features hasproven extremely helpful for catching common bugs andreducing development time for AMReX users.

Conclusions and Future WorkIn this paper we have discussed the main features of theAMReX framework for the development of block-structuredAMR algorithms. In addition to core capability to supportdata defined on a hierarchical mesh, AMReX also providessupport for different particle and particle/mesh algorithmsand for an embedded boundary representation of complexproblem geometry. We have also discussed linear solvers thatare needed to support implicit discretizations and AMReX’sI/O capabilities that are used to write data for data analysisand visualization and for checkpoint / restart.

A cross-cutting feature in all AMReX development isthe need to provide performance portability. With thepush toward exascale we have seen the emergence of avariety of different architectures with disparate hardwarecapabilities and programming models. AMReX incorporatesan abstraction layer that isolates applications from the detailsof a particular architecture without sacrificing performance.The framework also provides custom support for parallelreductions and memory management. We have shownexamples that illustrate that AMReX provides scalableperformant implementations for a range of tasks typicallyfound in multiphysics applications.

There are a number of additional capabilities that couldbe added to AMReX. Currently, the framework is restrictedto one, two or three spatial dimensions. Extending themethodology to higher dimensions would be relativelystraightforward. In a similar vein, the framework couldalso be generalized to handle mixed dimensions in whichdifferent components of the physical model are solved indifferent dimensions.

The framework currently assumes grid cells are rectilinearwith an embedded boundary representation for complexgeometries. A useful generalization to increase geometriccapabilities would be to incorporate curvilinear coordinatesin which the grid is logically rectangular but the individualcells can be generic hexahedra. This type of capability

Prepared using sagej.cls

Zhang et al. 15

could be further generalized to a mapped multiblock oroverset mesh model in which multiple curvilinear meshes arecoupled in a single simulation.

Finally, one potentially exciting area is interfacingAMReX with a code generation system. The basic idea hereis to use a symbolic manipulation system that can generatecomplilable AMReX code directly from equations. This typeof capability is particularly useful for problems that canbe stated concisely mathematically but expand dramaticallywhen translated into code. We have used a prototype codegeneration system for discretization of general relativityand a spectral representation of Boltzmann’s equation. Theability to describe complex problems concisely has potentialto further increase the range of applications that can useAMReX.

Acknowledgements

This research was supported by the Exascale Computing Project(17-SC-20-SC), a collaborative effort of the U.S. Departmentof Energy Office of Science and the National Nuclear SecurityAdministration. This work was supported by the U.S. Departmentof Energy, Office of Science, Office of Advanced ScientificComputing Research, Exascale Computing Project under contractDE-AC02-05CH11231. This research used resources of theNational Energy Research Scientific Computing Center, a DOEOffice of Science User Facility supported by the Office of Scienceof the U.S. Department of Energy under Contract No. DE-AC02-05CH11231; and resources of the Oak Ridge LeadershipComputing Facility at the Oak Ridge National Laboratory, whichis supported by the Office of Science of the U.S. Department ofEnergy under Contract No. DE-AC05-00OR22725.

References

Almgren AS, Beckner VE, Bell JB, Day MS, Howell LH, JoggerstCC, Lijewski MJ, Nonaka A, Singer M and Zingale M(2010) CASTRO: A New Compressible Astrophysical Solver.I. Hydrodynamics and Self-gravity. Astrophysical Journal715(2): 1221–1238. DOI:10.1088/0004-637X/715/2/1221.

Almgren AS, Bell JB, Lijewski MJ, Lukic Z and Andel EV(2013) Nyx: A massively parallel AMR code for computationalcosmology. The Astrophysical Journal 765(1): 39. DOI:10.1088/0004-637x/765/1/39. URL https://doi.org/

10.1088%2F0004-637x%2F765%2F1%2F39.AMR-Wind (2020) AMR-Wind Documentation. URL https:

//amr-wind.readthedocs.io/en/latest/.AMRVis (2020) AMRVis documentation. URL https:

//amrex-codes.github.io/amrex/docs_html/

Visualization.html.Balay S, Abhyankar S, Adams MF, Brown J, Brune P, Buschelman

K, Dalcin L, Dener A, Eijkhout V, Gropp WD, Karpeyev D,Kaushik D, Knepley MG, May DA, McInnes LC, Mills RT,Munson T, Rupp K, Sanan P, Smith BF, Zampini S, ZhangH and Zhang H (2019) PETSc Web page. https://www.

mcs.anl.gov/petsc. URL https://www.mcs.anl.

gov/petsc.Beckingsale DA, Burmark J, Hornung R, Jones H, Killian W,

Kunen AJ, Pearce O, Robinson P, Ryujin BS and ScoglandTR (2019) Raja: Portable performance for large-scale scientificapplications. In: 2019 IEEE/ACM International Workshop on

Performance, Portability and Productivity in HPC (P3HPC).pp. 71–81.

Berger MJ and Rigoutsos J (1991) An algorithm for point clusteringand grid generation. IEEE Transactions on Systems, Man, andCybernetics 21: 1278–1286.

Biswas A, Ahrens J, Dutta S, Almgren A, Musser J and TurtonT (2020) Feature analysis, tracking, and data reduction: Anapplication to multiphase reactor simulation mfix-exa for in-situ use case. Computing in Science & Engineering DOI:10.1109/MCSE.2020.3016927.

Carter Edwards H, Trott CR and Sunderland D (2014)Kokkos: Enabling manycore performance portabilitythrough polymorphic memory access patterns. Journalof Parallel and Distributed Computing 74(12): 3202 –3216. DOI:https://doi.org/10.1016/j.jpdc.2014.07.003.URL http://www.sciencedirect.com/science/

article/pii/S0743731514001257. Domain-SpecificLanguages and High-Level Frameworks for High-PerformanceComputing.

Dubey A, Almgren A, Bell J, Berzins M, Brandt S, Bryan G,Colella P, Graves D, Lijewski M, Lffler F, OShea B, SchnetterE, Van Straalen B and Weide K (2014a) A survey of highlevel frameworks in block-structured adaptive mesh refinementpackages. Journal of Parallel and Distributed Computing74(12): 3217 – 3227.

Dubey A, Antypas K, Calder AC, Daley C, Fryxell B, GallagherJB, Lamb DQ, Lee D, Olson K, Reid LB, Rich P, Ricker PM,Riley KM, Rosner R, Siegel A, Taylor NT, Weide K, TimmesFX, Vladimirova N and ZuHone J (2014b) Evolution of flash, amulti-physics scientific simulation code for high-performancecomputing. International Journal of High PerformanceComputing Applications 28(2): 225–237. DOI:http://dx.doi.org/10.1177/1094342013505656.

Falgout RD, Jones JE and Yang UM (2006) The design andimplementation of hypre, a library of parallel high performancepreconditioners. In: Bruaset AM and Tveito A (eds.)Numerical Solution of Partial Differential Equations onParallel Computers. Springer–Verlag, pp. 267–294. Alsoavailable as LLNL technical report UCRL-JRNL-205459.

Fryxell B, Olson K, Ricker P, Timmes FX, Zingale M, LambDQ, MacNeice P, Rosner R, Truran JW and Tufo H (2000)Flash: An adaptive mesh hydrodynamics code for modelingastrophysical thermonuclear flashes. Astrophysical Journal,Supplement 131: 273–334. DOI:http://dx.doi.org/10.1086/317361.

Hindmarsh AC, Brown PN, Grant KE, Lee SL, Serban R, ShumakerDE and Woodward CS (2005) SUNDIALS: Suite of nonlinearand differential/algebraic equation solvers. ACM Transactionson Mathematical Software (TOMS) 31(3): 363–396.

Merrill D and Garland M (2016) Single-pass parallel prefix scanwith decoupled lookback. Technical Report NVR2016-001,NVIDIA Research. URL https://research.nvidia.

com/sites/default/files/pubs/2016-03_

Single-pass-Parallel-Prefix/nvr-2016-002.

pdf.MFiX-Exa (2020) MFiX-Exa Documentation. URL https://

amrex-codes.github.io/MFIX-Exa/.PeleC (2020) PeleC Documentation. URL https://pelec.

readthedocs.io/en/latest/.

Prepared using sagej.cls

16 Journal Title XX(X)

PeleLM (2020) PeleLM Documentation. URL https://

amrex-combustion.github.io/PeleLM/.Vay J, Almgren A, Bell J, Ge L, Grote D, Hogan M, Kononenko O,

Lehe R, Myers A, Ng C, Park J, Ryne R, Shapoval O, ThvenetM and Zhang W (2018) Warp-x: A new exascale computingplatform for beam-plasma simulations. Nuclear Instrumentsand Methods in Physics Research Section A AcceleratorsSpectrometers Detectors and Associated Equipment DOI:10.1016/j.nima.2018.01.035.

Zhang W, Almgren A, Beckner V, Bell J, Blaschke J, Chan C,Day M, Friesen B, Gott K, Graves D, Katz M, Myers A,Nguyen T, Nonaka A, Rosso M, Williams S and Zingale M(2019) AMReX: a framework for block-structured adaptivemesh refinement. Journal of Open Source Software 4: 1370.

Zhang W, Almgren A, Beckner V, Bell J, Blaschke J, Chan C, DayM, Friesen B, Gott K, Graves D, Katz M, Myers A, Nguyen T,Nonaka A, Rosso M, Williams S and Zingale M (2020) Amrexdocumentation. URL https://amrex-codes.github.

io/amrex/docs_html/.Zhang W, Almgren A, Day M, Nguyen T, Shalf J and Unat D (2016)

Boxlib with tiling: An adaptive mesh refinement softwareframework. SIAM Journal on Scientific Computing 38(5):S156–S172. DOI:10.1137/15M102616X. URL https://

doi.org/10.1137/15M102616X.

Author biography

Weiqun Zhang is a Computer Systems Engineer at LawrenceBerkeley National Laboratory. He is interested in high-performancecomputing, computational physics, and programming in general.Currently, he works on the AMReX software framework andWarpX, an advanced electromagnetic Particle-in-Cell code.

Andrew Myers is a Computer Systems Engineer in the Center forComputational Sciences and Engineering at Lawrence BerkeleyNational Laboratory. His work focuses on scalable particle methodsfor emerging architectures in the context of adaptive meshrefinement. He is an active developer of the AMReX frameworkand of the electromagnetic Particle-in-Cell code WarpX.

Kevin Gott is an application performance specialist at NERSC.His primary focus is supporting the AMReX Co-Design Centeras it develops performance portable solutions for next generationExascale supercomputers. His interests include C++ programming,portable GPU code design, MPI performance and advanced parallelprogramming techniques.

Ann Almgren is a Senior Scientist at Lawrence Berkeley NationalLaboratory, and the Group Lead of LBL’s Center for ComputationalSciences and Engineering. Her primary research interest is incomputational algorithms for solving PDEs in a variety ofapplication areas. Her current projects include the developmentand implementation of new multiphysics algorithms in high-resolution adaptive mesh codes that are designed for the latesthybrid architectures. She is a Fellow of the Society of Industrialand Applied Mathematics and the Deputy Director of the AMReXCo-Design Center.

John Bell is a Senior Scientist at Lawrence Berkeley NationalLaboratory. His research focuses on the development and analysis

of numerical methods for partial differential equations arising inscience and engineering. He is a Fellow of the Society of Industrialand Applied Mathematics, a member of the National Academy ofSciences, and the Director of the AMReX Co-Design Center..

Prepared using sagej.cls