massively parallel cosmological simulations with changa · massively parallel cosmological...

Massively Parallel Cosmological Simulations with ChaNGa

Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale,Dep. of Computer Science, University of Illinois at Urbana-Champaign, USA

{pjetley2, gioachin, cmendes, kale}@uiuc.edu

Thomas QuinnDep. of Astronomy, University of Washington, USA

[email protected]

Abstract

Cosmological simulators are an important com-ponent in the study of the formation of galaxiesand large scale structures, and can help answermany important questions about the universe. De-spite their utility, existing parallel simulators donot scale effectively on modern machines containingthousands of processors. In this paper we presentChaNGa, a recently released production simulatorbased on the Charm++ infrastructure. To achievescalable performance, ChaNGa employs various op-timizations that maximize the overlap between com-putation and communication. We present experi-mental results of ChaNGa simulations on machineswith thousands of processors, including the IBMBlue Gene/L and the Cray XT3. The paper goeson to highlight efforts toward even more efficientand scalable cosmological simulations. In particu-lar, novel load balancing schemes that base decisionson certain characteristics of tree-based particle codesare discussed. Further, the multistepping capabil-ities of ChaNGa are presented, as are solutions tothe load imbalance that such multiphase simulationsface. We outline key requirements for an effectivepractical implementation and conclude by discussingpreliminary results from simulations run with ourmultiphase load balancer.

1 Introduction

The scientific case for performing cosmologicalsimulations is compelling. Theories of structureformation built upon the nature of Dark Matter

and Dark Energy need to be constrained with com-parisons to the observed structure and distributionof galaxies. However, galaxies are extremely non-linear objects, hence the connection between thetheory and the observations requires following thenon-linear dynamics using numerical simulations.Thus, cosmological simulators are important com-ponents for studying the formation of galaxies andplanets.

The simulation of galaxy formation is a challeng-ing computational problem, requiring high resolu-tions and dynamic timescales. As an example, toform a stable Milky Way-like galaxy, tens of mil-lions of resolution elements must be simulated tothe current epoch [5]. Locally adaptive timestepsmay reduce the CPU work by orders of magnitude,but not evenly throughout the computational vol-ume, thus posing a considerable challenge for par-allel load balancing.

To address these issues, various cosmological sim-ulators have been created recently. PKDGRAV [3],developed at the University of Washington, canbe considered among the state-of-the-art in thatarea. However, due to load imbalance, PKDGRAVdoes not scale efficiently on the newer machineswith thousands of processors. In this work, wepresent a new N-body cosmological simulator that(like other simulators) utilizes the Barnes-Hut al-gorithm [1] to compute gravitational forces. Ourrecently released simulator, named ChaNGa, isbased on the Charm++ infrastructure [9]. Weleverage the object-based virtualization [8] andthe data-driven style of computation, inherent inCharm++, to obtain adaptive overlap of communi-cation and computation, as well as to perform auto-

matic measurement-based load balancing. ChaNGaadvances the state-of-the-art in N-Body simulationsby allowing the programmer to achieve higher levelsof resource utilization on large systems.

The remainder of this paper is organized as fol-lows. Section 2 presents an overview of previouswork in the development of parallel simulators forcosmology. Section 3 describes the major compo-nents of ChaNGa. Section 4 presents scalabilityresults obtained with ChaNGa on various parallelplatforms, and Section 5 illustrates some of the moreadvanced features in ChaNGa. Finally, Section 6has our conclusions and future work directions.

2 Related Work

The study of the evolution of interacting particlesunder the effects of Newtonian gravitational forces,also known as the N-Body problem, has been exten-sively reported in the literature. A popular methodto simulate such problems was proposed by Barnesand Hut [1]. That method associates particles to ahierarchical structure comprising a tree. By prop-erly traversing the tree, one can compute the forcesbetween a pair of particles or between a particleand a whole branch of the tree that is sufficientlyfar away. By using such an approximation, thismethod reduces the complexity of the problem fromO(N2) to O(N logN), where N is the number ofparticles. O(N) methods for calculating the grav-itational force have also been developed, e.g. theFast Multipole Method [6, 2]. However, the highlyclustered nature of typical astrophysical datasets re-quires the organization of the data into some hier-archical structure. This is essentially a “sort”, thusthe overall complexity is still O(N logN).

Hierarchical methods for N-Body simulationshave been adopted for quite some time by as-tronomers [12, 18]. A pioneering parallel tree codeis the HOT code [19], which won the Gordon Bellprize in 1992 and 1997. This code does not havelocally adaptive timesteps. Another widely usedcode in this area is PKDGRAV [3], a tree-basedsimulator that works on both shared-memory anddistributed-memory parallel systems. However, de-spite its success in the past, PKDGRAV has shownlimited potential to scale to larger machine config-urations, due to load balancing constraints. Thislimitation makes PKDGRAV’s effective use on fu-ture petascale systems very questionable.

Other recent cosmological simulators are GAD-

GET [17], developed at the Max-Planck-Institut furAstrophysik, Garching and falcON [2], developedat the Max-Planck-Institut fur Astronomie, Heidel-berg. falcON has shown good scalability with thenumber of particles, but it is a serial simulator, anddoes not have locally adaptive timesteps. Attemptshave been made to parallelize the algorithm, butonly to 16 processors [13]. GADGET, now in itssecond version (GADGET-2), has a rich modelingcapability. The published scalability results havebeen limited to 128 processors [15], but it has beenused on at least 512 processors with 10 billion parti-cles for published scientific results [16]. Direct com-parisons with GADGET will be in our future work.

In addition, some special purpose machines havebeen developed in the past to simulate cosmologi-cal N-body problems. Notable examples are Grape-5 [11], which has 32 pipeline processors special-ized for the gravitational force calculation, and anFPGA-based system [10] constructed as an add-incard connected to a host PC. Both systems achievegreat price/performance points, but have severememory constraints: the largest datasets that onecan use on them are limited to two million particles.

3 Structure of the ChaNGa Code

ChaNGa is implemented as a Charm++ appli-cation, leveraging all the advanced features exist-ing in the Charm++ runtime system. In this sec-tion, after reviewing the basic Charm++ charac-teristics, we describe ChaNGa’s internal organiza-tion and point to its major optimizing features.

3.1 Charm++ Infrastructure

Charm++ is a parallel software infrastructurethat implements the concept of processor virtual-ization [9]. In this abstraction, an application pro-grammer decomposes the underlying problem into alarge number of objects and the interactions amongthose objects. The Charm++ runtime systemautomatically maps objects, also called chares, tophysical processors. Typically, the number of charesis much greater than the number of processors. Thisvirtualization technique makes the number of chares(and therefore the problem of decomposition) in-dependent of the number of available processors.With this separation between logical and physicalabstractions, Charm++ provides higher program-mer productivity. This scheme has allowed the cre-

Figure 1. Components for parallel calculation of gravitational forces in ChaNGa

ation of parallel applications that scale efficiently tothousands of processors, such as the molecular dy-namics NAMD code [14], a winner of a Gordon Bellaward in 2002.

Besides mapping and scheduling chares on avail-able processors, the Charm++ runtime system canalso dynamically migrate chares across processors.This capability is used to implement a powerfulmeasurement-based load balancing mechanism inCharm++ [20]. Chares can be migrated based onobserved values of various metrics, such as compu-tation loads or communication patterns. This dy-namic redistribution is critical to achieve good over-all processor utilization.

3.2 ChaNGa Organization

The major task in any cosmological simulator isto compute gravitational forces generated by theinteraction between particles and integrating thoseforces over time to compute the movement of eachparticle. As we previously described in [4], ChaNGaaccomplishes this task with an algorithm that uses atree to represent the space. This tree is constructedglobally over all particles in the dataset, and seg-mented into elements named TreePieces. TheseTreePieces are distributed by the Charm++ run-time system to the available processors for parallelcomputation of the gravitational forces. ChaNGaallows various distribution schemes, including Mor-ton and Peano-Hilbert Space-Filling-Curve (SFC)and Oct-tree-based (Oct) methods. Each TreePieceis implemented in ChaNGa as a Charm++ chare.

At the lowest level of the tree, particles corre-sponding to tree leaves are grouped into bucketsof a predefined maximal size, according to spatialproximity. To compute the forces on particles of agiven bucket, one must traverse the tree to collect

force contributions from all tree nodes. If a certainnode is sufficiently far in space from that bucket, anaggregated contribution corresponding to all parti-cles under that node is used; otherwise, the node isopened and recursively traversed.

The parallel implementation of gravity calcula-tion in ChaNGa is represented in Figure 1. Eachprocessor contains a group of TreePieces. To pro-cess a certain bucket, the processor needs to col-lect information from the entire tree. This mightcorrespond to interactions with TreePieces in thesame processor (local work) or with TreePieces fromremote processors (global work). To optimize theaccess to identical remote information by bucketsin the same processor, ChaNGa employs a parti-cle caching mechanism. This cache is essential toachieve acceptable performance [4].

3.3 Major Optimizations in ChaNGa

Besides the particle caching mechanism men-tioned in the previous subsection, ChaNGa imple-ments various optimizations that contribute to itsparallel scalability. Some of the major optimiza-tions are the following (for more details, see [4]):

Data Prefetching: Given the availability ofthe particle caching mechanism, ChaNGa usesprefetching to further optimize the access to remoteTreePieces. Before starting the gravity calculationfor a certain bucket, a full tree traversal is done,to estimate the remote data that will be needed inthe regular tree traversal. The prefetch of this re-mote data overlaps with local computation, therebymasking latency.

Tree-in-Cache: To avoid the overhead of com-munication between chares when accessing datafrom a different TreePiece in the same processor,ChaNGa employs another optimization that ex-

ploits the cache. Before computing forces, eachTreePiece registers its data with the software cache.Thus, a larger tree, corresponding to the union ofall local TreePieces, is assembled in the local cache.When any piece of that tree is needed during forcecomputation, it is immediately retrieved.

Selectable Computation Granularity:ChaNGa accepts an input parameter that defineshow much computation is performed before theprocessor is allowed to handle requests from remoteprocessors. This enables a good tradeoff betweenresponsiveness to communication requests andprocessor utilization.

4 Scalability Experiments

To evaluate ChaNGa’s effectiveness as a produc-tion simulator, we conducted a series of tests withreal cosmological datasets. These tests intendedboth to assess the code’s portability across differentsystems and to measure its performance scalabilityin each particular type of system. We used thethree systems described in Table 1, and ran testswith the following datasets:

lambs: Final state of a simulation of a 71Mpc3

volume of the Universe with 30% dark matter and70% dark energy. Nearly three million particlesare used. This dataset is highly clustered on scalesless than 5 Mpc, but becomes uniform on scalesapproaching the total volume. Three subsets of thisdataset are obtained by taking random subsamplesof size thirty thousand, three hundred thousand,and one million particles, respectively.dwarf: Snapshot at z = .3 of a multi-resolutionsimulation of a dwarf galaxy forming in a 28.5Mpc3

volume of the Universe with 30% dark matter and70% dark energy. Although the mass distributionin this dataset is uniform on scales approaching thevolume size, the particle distribution is very cen-trally concentrated and therefore highly clusteredon all scales above the resolution limit. The totaldataset size is nearly five million particles, butthe central regions have a resolution equivalent to20483 particles in the entire volume.hrwh lcdms: Final state of a 90Mpc3 volumeof the Universe with 31% dark matter and 69%dark energy realized with 16 million particles. Thisdataset is used in [7], and is slightly more uniformthan lambs.dwarf-50M: Same physical model as dwarf except

(a) lambs dataset

(b) dwarf dataset

Figure 2. Pictorial view of datasets

that it is realized with 50 million particles. Thecentral regions have a resolution equivalent to61443 particles in the entire volume.lambb: Same physical model as lambs except thatit is realized with 80 million particles.drgas: Similar to lambs and lambb except that it isthe high redshift (z = 99) state of the simulation,and it is realized with 730 million particles. Theparticle distribution is very uniform.

To illustrate some of the features in thesedatasets, Figure 2(a) presents a pictorial view oflambs, which has a reasonably uniform particle dis-tribution, whereas Figure 2(b) presents dwarf, con-taining a much more clustered distribution. In thesepictures the color scale indicates the log of the massdensity and covers six orders of magnitude.

We conducted serial executions of ChaNGa andPKDGRAV on NCSA’s Tungsten to compare scala-

Table 1. Characteristics of the parallel systems used in the experimentsSystem Number of Processors CPU CPU Memory Type ofName Location Processors per Node Type Clock per Node Network

Tungsten NCSA 2,560 2 Xeon 3.2 GHz 3 GB MyrinetBlue Gene/L IBM-Watson 40,960 2 Power440 700 MHz 512 MB Torus

Cray XT3 Pittsburgh 4,136 2 Opteron 2.6 GHz 2 GB Torus

bility with varying numbers of particles. Figure 3(a)shows the results of this comparison using subsam-ples of lambs. As expected, the performance forboth systems follows a linear aspect in this figure,because of the O(N logN) algorithm employed byboth codes.

Despite their structural similarities PKDGRAV’sserial performance is about 20% better thanChaNGa’s. PKDGRAV is written in C whereasChaNGa is a C++ code. The Intel icc compilerused for the tests uses different components to com-pile C and C++. As a result, similar code fragmentsin PKDGRAV and ChaNGa lead to differing objectcodes. This difference in object code creates theobserved performance gap.

4.1 Parallel Performance on a Com-modity Cluster

We conducted tests with parallel executions ofChaNGa on NCSA’s Tungsten cluster, a typicalrepresentative of current Linux-based commodityclusters. We compared ChaNGa’s performance onthe gravity calculation phase to the performance ofPKDGRAV. We used the lambs dataset, with 3 mil-lion particles, in a strong scaling test. Figure 3(b)shows the observed results. In this kind of plot,horizontal lines represent perfect scalability, whilediagonal lines correspond to no scalability.

Figure 3(b) shows that ChaNGa’s performanceis lower than PKDGRAV’s performance when thenumber of processors is small. This is due to the su-perior serial performance of PKDGRAV, as we hadobserved in Figure 3(a). However, for 8 or 16 pro-cessors, the two codes already reach similar perfor-mance levels. Beyond that range, ChaNGa clearlyhas superior performance. Since we did not use anyload balancers with ChaNGa in this test, the maincauses for the better performance of ChaNGa areits various optimizations that overlap computationand communication.

We can also observe from Figure 3(b) thatChaNGa’s scalability starts to degrade at the right

extreme of the horizontal axis. This happens be-cause the lambs dataset is not large enough toeffectively use 256 or more Tungsten processors.However, a similar degradation occurs for a muchsmaller number of processors in PKDGRAV. Hence,even for this relatively small problem size, ChaNGapresents significantly better parallel scalability thanPKDGRAV.

4.2 Parallel Performance on BlueGene/L

In addition to commodity clusters, we consid-ered the scalability of ChaNGa on high end par-allel machines. Figure 4(a) reports scalability onthe IBM Blue Gene/L system These results includea large variety of datasets, from the smallest dwarf(5 million particles) to the largest drgas (730 mil-lion particles). Due to Blue Gene/L memory limita-tions, larger datasets require larger minimal config-urations. Given the difficulty in securing large num-bers of Blue Gene/L processors for extended exper-imentation, we focused on running only the largestdataset on 32,768 processors. Near that level, theother datasets already start to show degraded scal-ability.

Over the entire range of simulation, we can seegood scalability. Naturally, larger datasets scalebetter, having more computation to parallelize. Inparticular, on 8,192 processors, lambb performs bet-ter than dwarf-50M, despite lambb’s bigger size.This occurs because lambb has a more uniform cos-mological particle distribution. Like in the case ofNCSA’s Tungsten, we did not consider advancedload balancing actions to improve performance inthese tests. We will address this issue in section 5.1

4.3 Parallel Performance on the CrayXT3

We conducted executions of ChaNGa on the up-graded 2068-node Cray XT3 at PSC (bigben). Dueto unavailability of the full machine for our tests,

0

0.05

0.1

0.15

0.2

10000 100000 1e+06 1e+07

Tim

e pe

r Par

ticle

(ms)

Number of Particles

ChaNGaPKDGRAV

(a) Scalability with the dataset size, in serial mode, showingtime per particle against number of particles.

256

512

1024

2048

2 4 8 16 32 64 128 256 512 1024

Num

ber o

f Pro

cess

ors

x Ti

me

per I

tera

tion

(s)

Number of Processors

PKDGRAVChaNGa

(b) Scalability with the number of processors (lambs dataset,3 million particles)

Figure 3. Comparisons between ChaNGaand PKDGRAV on NCSA’s Tungsten

we could not use more than 2048 processors. Fig-ure 4(b) illustrates the results of ChaNGa withthree distinct datasets: dwarf (5 million particles),hrwh lcdms (16 million particles) and dwarf-50M (50million particles). With the smallest dataset, we ob-serve good scalability up to 256 processors. Beyondthat point, performance degrades because the prob-lem size is not large enough to utilize processors ef-fectively. For the 16 million particle dataset, thereis good scalability across the entire range of configu-rations used in the test. Remember that hrwh lcdmsis a uniform dataset, so that the load is even acrossprocessors and is sufficiently high to keep processorsbusy in all configurations tested.

Meanwhile, in the case of the largest dataset

4096

8192

16384

32768

65536

131072

262144

524288

1.04858e+06

2.09715e+06

16 64 256 1024 4096 16384 65536

Num

ber o

f Pro

cess

ors

x Ti

me

per I

tera

tion

(s)


drgaslambb

dwarf-50Mhrwh_lcdms

dwarf

(a) IBM Blue Gene/L

512

1024

2048

4096

8192

16384

32768

65536

4 16 64 256 1024 4096

Num

ber o

f Pro

cess

ors

x Ti

me

per I

tera

tion

(s)


dwarf-50Mhrwh_lcdms

dwarf

(b) Cray XT3

Figure 4. Parallel performance of ChaNGaon the IBM Blue Gene/L and Cray XT3.

(dwarf-50M), the scalability is good up to 2048 pro-cessors and the data shows superlinear speedupsfrom 128 to 512 processors. The program does notrun on 64 or fewer processors due to memory lim-itations. We are still studying in detail the mem-ory behavior of ChaNGa to understand and correctthese anomalies. Nevertheless, these results showthat ChaNGa achieves very good scalability on theCray XT3 when adequate machine configurationsare used for the various datasets.

5 Towards Greater Scalability

The results from the previous section demon-strate that scaling simulations beyond a few thou-sand processors remains a challenge in some cases,

even with the optimizations described in Section 3.In this section, we describe techniques that can fur-ther improve execution efficiency, especially on largesystem configurations. The first technique consistsof specialized load balancers that take into accountcertain characteristics of particle codes. The sec-ond, multistepping, reduces operation count.

5.1 Specialized Load Balancing

The results reported in the previous section useda very simple load balancing strategy. The only dis-tribution of work among processors was that of par-ticles among TreePieces, which were then assignedby Charm++ uniformly to each processor. Sincedensity can vary vastly over the simulated space,different particles may require significantly differ-ent amounts of work to compute the force they aresubject to. Thus, even SFC-based domain decom-positions, which assign an equal number of particlesto each TreePiece, fail to obtain good load balance.Figure 5 shows an example of this imbalance. Theshaded regions in the figure represent processor ac-tivity during different iterations of execution. In thefirst iteration there is significant imbalance of loadacross processors.

Among the Charm++ suite of load balancers,GreedyLB is one that attempts to minimize loadimbalance by continually assigning the heaviest re-maining object to the least loaded processor. How-ever, in our initial tests (not reported here), theresults obtained with GreedyLB were not encour-aging: in many cases, the execution time increasedafter load balancing, especially in large processorconfigurations. The reason for this lies in the CPUoverhead to handle increased communication. InChaNGa, due to the processor-level optimizations,TreePieces residing on the same processor do notneed to exchange messages. Given that TreePiecescontaining particles close together in the simulationspace need to exchange more data, a higher vol-ume of communication will be generated if theseTreePieces are all placed in different processors thanin the case where some of them are coresident.

To verify our reasoning, we executed ChaNGawith the dwarf dataset on 1,024 Blue Gene/L pro-cessors. Table 2 summarizes the observed numberof messages, amount of data transfered during oneiteration, as well as the iteration time, includingvalues both before and after load balancing. In ad-dition to GreedyLB, we used other load balancingschemes that we describe later in this section. As

Figure 5. Overview of three iterations ofdwarf simulation on 1,024 Blue Gene/L pro-cessors with OrbLB. Horizontal axis istime and each horizontal line representsactivity on a processor.

shown, the communication volume after GreedyLBgrows nearly three times.

The connection between this increase in com-munication volume and the reduced performance isshown graphically in the processor utilization viewof Figure 6. Here, orange and blue regions (the twolargest) constitute the global and local force com-putations, respectively. The different componentsof communication overhead are shown in red andgreen. Notice that the second iteration, which be-gins after GreedyLB load balancing, suffers a higherCPU overhead for communication, as evidenced bythe larger red and green regions. The numbers inTable 2 confirm this fact. Due to this overhead,processors are not fully utilized by the application,as communication is handled by the runtime sys-tem. On the other hand, the steeply sloped profileof iteration two reflects better load balance.

These results demonstrate that an effective loadbalancer must take communication into account.However, basing decisions on communication be-tween objects is very difficult, since the particlecache, acting as a proxy between TreePieces, ob-scures this information. Furthermore, because com-munication depends on the location of the otherTreePieces in the system, the communication datacan only be estimated. To complicate matters fur-ther, creating a precise communication graph is ex-pensive, and requires a large memory system.

Instead of estimating the point-to-point commu-nication, we use SFC domain decomposition andthe OrbLB load balancer in tandem to arrive at aheuristic for object placement. The SFC decom-position procedure assigns particles to TreePieces

Table 2. Communication volume and execution time with dwarf on 1,024 Blue Gene/L proces-sors using different load balancing strategies

Perf. parameter (per iteration) Original GreedyLB OrbLB OrbRefineLB

Messages exchanged (x 1,000) 5,480 16,233 5,590 8,370Bytes transfered (MB) 7,125 23,891 7,227 9,873Execution time (s) 5.6 6.1 5.3 5.0

Figure 6. Two iterations of dwarf simulation on 1,024 Blue Gene/L processors, separated by aGreedyLB load balancing phase. Horizontal axis is time and vertical axis is combined proces-sor utilization. Colors represent different activities.

preserving an interesting property: it ensures thatcontiguous regions of the simulation space are rep-resented by TreePieces having adjacent Charm++identifiers (objId’s). Further, OrbLB begins bysorting objects according to their objId’s. It thenproceeds recursively by assigning roughly balanced‘halves’ of this sorted set to groups of processors,one ‘half’ to each. Therefore, when applied toTreePieces generated by such a decomposition, Or-bLB would implicitly base its placement decisionon the simulation space coordinates of a TreePiece.In other words, OrbLB makes use of the admit-tedly rough correspondence between a TreePiece’sobjId and its location in the simulation space. Inthis manner, TreePieces likely to exchange largeamounts of information are assigned to contiguouslyranked processors. Our placement heuristic is acoarse one, since we use the position of TreePiecesalong a unidimensional SFC line, whereas the simu-lation space is three dimensional. The results fromOrbLB in Table 2 show that the communicationvolume remains about the same, while the overallperformance is improved. Figure 5 shows processorload in more detail, with execution iterations sepa-rated by OrbLB phases.

From Figure 5, it is clear that while global loaddistribution is improved by OrbLB, processors still

have very different loads. This observation mo-tivates the use of a hybrid approach, which wename OrbRefineLB: the high level distribution ofTreePieces among processors is performed by Or-bLB, but neighboring processors (i.e. processorswith contiguous ranks) exchange objects accordingto a greedy algorithm. This strategy yielded theresults in the last column of Table 2. As can beseen, a small increase in communication volume re-sults from the ‘smoothing,’ but the gains in loadbalance offset this slight disadvantage. With thishybrid load balancer, we re-ran three simulations,dwarf, dwarf-50M and lambb, on Blue Gene/L. Fig-ure 7 shows the corresponding reductions in execu-tion times due to OrbRefineLB.

Although we obtain a reduction in execution timeby using OrbRefineLB, the total communicationvolume increases. For this reason, we continue tolook into more advanced load balancing techniquesthat will allow us to balance work across the sys-tem without disrupting the application’s communi-cation pattern. In particular, we are looking at loadbalancers that are aware of the spatial layout of ob-jects in all three dimensions. In another direction,we are looking to reduce the cost per message ex-changed by using the communication optimizationframework of Charm++.

32768

65536

131072

262144

32 64 128 256 512 1024 2048 4096 8192 16384

Num

ber o

f Pro

cess

ors.

x T

ime

per I

tera

tion

(s)


lambblambb-LB

dwarf-50Mdwarf-50M-LB

(a) dwarf-50M and lambb datasets

2048

4096

8192

16 32 64 128 256 512 1024 2048

Num

ber o

f Pro

cess

ors

x Ti

me

per I

tera

tion

(s)


dwarfdwarf-LB

(b) dwarf dataset

Figure 7. Scalability of ChaNGa before andafter load balancing on Blue Gene/L.

5.2 Multistepping

Gravitational instabilities in the context of cos-mology lead to a large dynamic range of timescales:from many gigayears on the largest scales to lessthan a million years in molecular clouds. Inself gravitating systems this dynamical time scalesroughly as the square root of the reciprocal of thelocal density. Hence density contrasts of a million ormore from the centers of galaxies to the mean den-sity of the Universe correspond to a factor of 1000in dynamical times. Indeed, large density contrastsare the primary reason to use tree-based gravity al-gorithms in cosmology.

Typically, only a small fraction of the particlesin a cosmological simulation are in the densest en-

Table 3. Ratio between singlestepped andmultistepped gravity calculation times fordwarf simulations on NCSA’s Tungsten

Processors 1 4 8 16 32 64Ratio 3.29 2.65 2.11 1.79 1.39 1.55

vironments; therefore, the majority of the particlesdo not need their forces evaluated on the smallesttimestep. Large efficiencies can be gained by a mul-tistep integration algorithm which on the smallesttimestep updates only those particles in dense re-gions, and only infrequently calculates the forces onall the particles. This type of algorithm poses anobvious challenge for load balancing. Much of thetime, work from a small spatial region of the simu-lation must be distributed across all the processors,but on the occasional large timestep, the work in-volves all the particles.

To study the influence of multistepping onChaNGa’s performance, we ran simulations of thedwarf dataset on NCSA’s Tungsten. We started ourtests with two sequential executions, selecting ap-propriate values of the simulated timestep such thatone execution would be in singlestepping mode andthe other in multistepping mode. Both executionssimulated the same amount of cosmological time.For our chosen parameters, four timesteps in thesinglestepping case corresponded to one big step inthe multistepping case. Next, we repeated the pairof executions, varying the number of Tungsten pro-cessors used in each case. Table 3 compares theexecution times obtained under those two modes.

It becomes clear from Table 3 that multisteppingcan greatly accelerate a simulation. On a singleprocessor, we observe more than a three-fold sav-ings in execution time when multistepping is used.The serial execution performance gain comes exclu-sively from the algorithm used for multistepping.This result confirms the importance of multistep-ping capability for the computational performanceof a cosmological simulator.

Table 3 indicates that multistepping providesperformance gains across the entire range of proces-sors tested. Although the gain with 64 processorsis still quite significant, the table reveals a trend ofslight decrease in the gain as the number of proces-sors increases. We investigated this issue and foundthat load imbalance was the cause of that effect, aswe explain in the next subsection.

(a) Duration: 613.3 s

(b) Duration: 428.7 s (c) Duration: 228 s

Figure 8. Comparison of phase durationsamong (a) singlestepped, (b) multisteppedand (c) load balanced multistepped sim-ulations of the dwarf dataset on 32 BlueGene/L processors.

5.3 Balancing Load in MultisteppedExecutions

To better understand the load imbalance prob-lem in multistepped executions, we studied ChaNGaexecutions on 32 processors of Blue Gene/L. Fig-ures 8(a) and 8(b) correspond to parts of the execu-tion of the singlestepping and multistepping cases,respectively. In both figures, the horizontal axis rep-resents time, the vertical axis corresponds to pro-cessors (32 in both cases). Shaded horizontal linesrepresent processor activity during the simulation.

In Figure 8(a), where singlestepping was used,there are four steps displayed. Figure 8(b) repre-sents a big step, corresponding to the same cosmo-logical duration as the four steps of singlestepping.The computation for this big step is divided intofour substeps, and only select particles are active ineach. In turn, only the corresponding TreePieces areactive, leaving most with no work to do at all! AsFigures 8(a) and 8(b) show, multistepping affordssignificant performance gains.

However, this scheme is not without its draw-backs. The first three substeps of Figure 8(b)demonstrate severe load imbalance – only a few pro-

cessors are busy, while the others remain mostlyidle. Such imbalance is stressed by the characteris-tics of the dwarf dataset, which has a highly non-uniform particle distribution. As the number of pro-cessors grows, this imbalance increases, making par-allelism less effective than in singlestepping.

From our discussion, it is clear that measure-ment based load balancing in multistepped execu-tions is significantly more challenging than for thesinglestepped case. Since different numbers of par-ticles are active in each substep, load informationfrom an immediately preceding substep is unusablein the current one. We must, therefore, correlateload across instances of corresponding substeps.

Each substep can be viewed as a different phaseof execution. In Figure 8(b), the first and thirdsubsteps correspond to phase 0, since they involveonly the fastest particles. Similarly, substep twoconstitutes phase 1 and the last represents phase 2.While the principle of persistence does not apply toneighboring substeps, it does hold across differentinstances of any phase. That said, the set of parti-cles involved in the force computation of a certainphase is not a static entity. It changes during thecourse of the simulation, with variations in particlevelocities. Clearly, a load balancer must take themultiphase nature of the simulation into accountwhen considering a placement strategy.

5.3.1 Balancing Techniques

We now outline a few techniques we devised in meet-ing the challenges posed by the problem of severeload imbalance during multiphase simulations.Different strategies for different phases. Dif-ferent phases of a computation present the load bal-ancer with vastly differing load profiles. In Fig-ure 8(b), the utilization profile for the first substepshows that only a few TreePieces were active dur-ing this phase. On the other hand, the last substepinvolves all the TreePieces. In such an event, signif-icant gains in performance can be achieved by usingfast and simple schemes for lower ranked phases, i.e.those involving fewer TreePieces. The time taken bythe load balancer to compute placement decisionsfor such phases is critical, since they occur very of-ten. Using a fast, greedy load balancing algorithmfor the smaller phases, and the OrbRefineLB strat-egy (Section 5.1) when all particles are active, weobtain promising reductions in execution time, asa comparison between Figures 8(b) and 8(c) illus-trates. There are marked improvements both be-

Table 4. (a) Comparison between execution times for one big step of single- and multisteppedruns on 512 and 1024 processors of Blue Gene/L, (b) Reduction in execution time with increasein number of TreePieces on 512 Blue Gene/L processors.

Number of Processors 512 1,024

Singlestepped time(s) 4,652.16 2,387.52Multistepped time(s) 2,023.98 1,286.37

Number of TreePieces 4,096 8,192 16,384

Time (s) 2,524.97 2,198.91 2,023.98

(a) (b)

tween corresponding phases and in the overall time.Multiphase instrumentation. We measure andstore object loads during different phases separately.This is required since certain TreePieces are activeover multiple phases. Indeed, TreePieces with thequickest particles are active over all computationphases and might induce a different load in each.Model-based load estimation. We noted earlierthat the principle of persistence holds over multipleinstances of the same phase. However, the quick-est particles, in moving from one region of spaceto another, might switch across distinct TreePiecesbetween iterations. This might lead to large dis-crepancies between estimated and actual loads ofTreePieces, especially for lower-ranked phases. Forthis reason, we allow for a model-based estimationof phase load. In our multiphase load balancer, weestimate the load of a TreePiece during phase φ asthe corresponding value from the immediately pre-ceding instance of φ. If that information is unavail-able, we use a fraction f of the load from the firststep. This first step involves all particles and exe-cutes before any other. The fraction f is the ratioof active particles to total particles in a TreePiece.

5.3.2 Preliminary Results

We now describe some preliminary results obtainedby using the principles outlined previously. We sim-ulated the 80M particle dataset, lambb, using multi-timestepping and our multiphase load balancer. Forbrevity, we only present results from 512 and 1024processor runs on Blue Gene/L.

Table 4(a) illustrates the complementary effectsof multistepping and load balancing. It comparesexecution times of a load balanced multisteppedsimulation and the same simulation when run insinglestepped mode, without load balancing. Thereare savings of nearly 50% on 1024 processors.

Table 4(b) brings a related issue to light, thatof the degree and grain of parallelism. We no-

ticed a significant reduction in computation timeas the number of TreePieces was increased for agiven number of processors. With more TreePiecesin the system, the load balancer is able to main-tain a more even balance of work in the most fre-quently repeated phases. This prevents such phasesfrom becoming the execution bottleneck. Indeed,the overhead of finer grained work distribution isoffset by the prospect of greater balance in systemload. For this reason multistepped executions standto benefit even more from overdecomposition thantheir singlestepped counterparts.

These results motivate more detailed analyses ofmultiphase simulations in the context of load bal-ance. In particular, we continue to develop more so-phisticated load balancing strategies. For instance,the ORB assignment of TreePieces to processors inthe OrbRefineLB strategy could be three dimen-sional. Further, the exchange of objects thereincould be sensitive to machine topology. Finally,multiphase computations could benefit from betterheuristics and more precise load estimation models.

6 Conclusions and Future Work

In this paper, we described our highly scalableparallel gravity code ChaNGa, based on the well-known Barnes-Hut algorithm. Scaling this compu-tation to tens of thousands of processors requiresa sophisticated load balancing strategy. To enablesuch automatic load balancing schemes, we devel-oped ChaNGa using the Charm++ parallel pro-gramming system and its adaptive Run-Time Sys-tem. Since Charm++ requires overdecompositionto empower its load balancing strategies, ChaNGadecomposes the particles into a large number ofTreePieces. Multiple TreePieces assigned to a singleprocessor share a software cache that stores nodes ofthe tree requested from remote processors as well aslocal nodes. Requests for remote data are effectivelypipelined to mask any communication latencies.

Although using Charm++ enables load balanc-ing, an appropriate load balancer must be chosenor written for each application. A pure load-basedbalancer was not adequate for this problem. Weshowed that a combination of a coordinate-basedbalancer, along with a measurement based refine-ment strategy, led to good load balance without in-creasing communication overhead significantly. Wedemonstrated unprecedented speedups in cosmolog-ical simulations based on real datasets, scaling wellto over 8,000 processors.

A new multiple time-stepping scheme was de-scribed that reduces operation count, but is evenmore challenging to load balance. We demonstrateda multi-phase load balancing scheme that meets thischallenge, on several hundred processors.

Our code is currently being used for gravity com-putations by astrophysicists. We intend to add hy-drodynamics to it in the future. Alternative decom-position schemes to further reduce communication,and runtime optimizations to mitigate the impactof communication costs are also planned.

AcknowledgmentsThis work was supported in part by NFS grant ITR-0205611 and by Teragrid allocations ASC050039Nand ASC050040N. We thank Shawn Brown for helpwith the Cray XT3 at PSC and Glenn Martyna andFred Mintzer at IBM Research for the use of BGW(BlueGene/L at Watson).

References

[1] J. Barnes and P. Hut. A hierarchical O(N log N)force-calculation algorithm. Nature, 324:446–449,December 1986.

[2] W. Dehnen. A hierarchical O(N) force calcula-tion algorithm. Journal of Computational Physics,179:27–42, 2002.

[3] M. D. Dikaiakos and J. Stadel. A performancestudy of cosmological simulations on message-passing and shared-memory multiprocessors. InProceedings of the International Conference on Su-percomputing - ICS’96, pages 94–101, Philadelphia,PA, December 1996.

[4] F. Gioachin, A. Sharma, S. Chakravorty,C. Mendes, L. V. Kale, and T. R. Quinn. Scalablecosmology simulations on parallel machines. InVECPAR 2006, LNCS 4395, pp. 476-489, 2007.

[5] F. Governato, B. Willman, L. Mayer, A. Brooks,G. Stinson, O. Valenzuela, J. Wadsley, andT. Quinn. Forming disc galaxies in ΛCDM sim-ulations. MNRAS, 374:1479–1494, Feb. 2007.

[6] L. Greengard and V. I. Rokhlin. A fast algorithmfor particle simulations. Journal of ComputationalPhysics, 73, 1987.

[7] K. Heitmann, P. M. Ricker, M. S. Warren, andS. Habib. Robustness of Cosmological Simulations.I. Large-Scale Structure. ApJSup, 160:28–58, Sept.2005.

[8] L. V. Kale. Performance and productivity in par-allel programming via processor virtualization. InProc. of the First Intl. Workshop on Productiv-ity and Performance in High-End Computing (atHPCA 10), Madrid, Spain, February 2004.

[9] L. V. Kale and S. Krishnan. Charm++: Paral-lel Programming with Message-Driven Objects. InG. V. Wilson and P. Lu, editors, Parallel Program-ming using C++, pages 175–213. MIT Press, 1996.

[10] A. Kawai and T. Fukushige. $158/Gflops astro-physical N -body simulation with a reconfigurableadd-in card and a hierarchical tree algorithm. InSC ’06: Proceedings of the 2006 ACM/IEEE con-ference on Supercomputing, New York, NY, USA,2006. ACM Press.

[11] A. Kawai, T. Fukushige, and J. Makino.$7.0/Mflops astrophysical N -body simulation withtreecode on GRAPE-5. In SC ’99: Proceedings ofthe 1999 ACM/IEEE conference on Supercomput-ing, New York, NY, USA, 1999. ACM Press.

[12] G. Lake, N. Katz, and T. Quinn. CosmologicalN-body simulation. In Proceedings of the SeventhSIAM Conference on Parallel Processing for Scien-tific Computing, pages 307–312, Philadelphia, PA,February 1995.

[13] P. Londrillo, C. Nipoti, and L. Ciotti. A parallelimplementation of a new fast algorithm for N-bodysimulations. Memorie della Societa AstronomicaItaliana Supplement, 1:18–+, 2003.

[14] J. C. Phillips, G. Zheng, S. Kumar, and L. V. Kale.NAMD: Biomolecular simulation on thousands ofprocessors. In Proceedings of the 2002 ACM/IEEEconference on Supercomputing, pages 1–18, Balti-more, MD, September 2002.

[15] V. Springel. The cosmological simulation codeGADGET-2. MNRAS, 364:1105–1134, 2005.

[16] V. Springel, S. D. M. White, A. Jenkins, C. S.Frenk, N. Yoshida, L. Gao, J. Navarro, R. Thacker,D. Croton, J. Helly, J. A. Peacock, S. Cole,P. Thomas, H. Couchman, A. Evrard, J. Colberg,and F. Pearce. Simulations of the formation, evolu-tion and clustering of galaxies and quasars. Nature,435:629–636, June 2005.

[17] V. Springel, N. Yoshida, and S. White. GADGET:A code for collisionless and gasdynamical simula-tions. New Astronomy, 6:79–117, 2001.

[18] M. S. Warren and J. K. Salmon. Astrophysical n-body simulations using hierarchical tree data struc-tures. In Proceedings of Supercomputing 92, Nov.1992.

[19] M. S. Warren and J. K. Salmon. A parallel hashedoct-tree n-body algorithm. In Proceedings of Su-percomputing 93, Nov. 1993.

[20] G. Zheng. Achieving High Performance on Ex-tremely Large Parallel Machines: PerformancePrediction and Load Balancing. PhD thesis, De-partment of Computer Science, University of Illi-nois at Urbana-Champaign, 2005.

massively parallel cosmological simulations with changa · massively parallel cosmological...

Documents