enzo-p / cello - xsedecello enzo-p / cello scalable adaptive mesh refinement for astrophysics and...

28
CELLO Enzo-P / Cello Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology James Bordner 1 Michael L. Norman 1 Brian O’Shea 2 1 University of California, San Diego San Diego Supercomputer Center 2 Michigan State University Department of Physics and Astronomy Extreme Scaling Workshop 2012 Blue Waters / XSEDE Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 1 / 21

Upload: others

Post on 22-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

CELLO

Enzo-P / CelloScalable Adaptive Mesh Refinementfor Astrophysics and Cosmology

James Bordner1 Michael L. Norman1 Brian O’Shea2

1University of California, San Diego

San Diego Supercomputer Center

2Michigan State University

Department of Physics and Astronomy

Extreme Scaling Workshop 2012Blue Waters / XSEDE

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 1 / 21

CELLO

Enzo overview

Parallel astrophysics and cosmologyimplemented in C++ / Fortranapproximately 150K SLOCparallelized using MPI / OpenMP

Vast range of scalesastrophysical fluid dynamicshydrodynamic cosmology

Adaptive mesh refinement (AMR)

Growing development community[ Norman et al ]

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 2 / 21

CELLO

Enzo’s physics and algorithms

Eulerian hydrodynamicspiecewise-parabolic method (PPM)

Lagrangian dark matterparticle-mesh method (PM)

Self-gravityFFT’s on root-level gridmultigrid on refinement patches

Local physicsheating, cooling, chemistry, etc.

MHD, RHD (ray-tracing or implicit FLD) [ John Wise ]

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 3 / 21

CELLO

Enzo’s pursuit of scalability

Enzo born in early 1990’s

“Extreme” meant 100 processors

Continual scalability improvementsMPI/OpenMP parallelism“neighbor-finding” algorithmI/O optimizations

Further improvement getting harderincreasing scalability requirementseasy improvements already made

Motivates concurrent rewritingEnzo-P “petascale” Enzo forkCello AMR framework

[ Sam Skillman, Matt Turk ]

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 4 / 21

CELLO

Enzo’s AMR data structure

root patchrefinement

patchroot grid

Patch-based SAMR

Each patch is a C++ object

Patches assigned to processes

NP root patches, grid size ≈ 643,1283Refinement patches generally smaller

Refinement patches initially local to parent

Load balancing relocates refinement patches

Patch data (grids, particles) are distributed

AMR hierarchy structure is replicated[ Tom Abel, John Wise, Ralf Kaehler ]

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 5 / 21

CELLO

Enzo’s timestepping

Adaptive timestepping by level

parallel within level� less computation by O(NL)� reduced parallel efficiency

1 11

15

3

6

9

12

16

13

14

19

17

20

18

8

2

7

5

4

10

x

L=0 L=1 L=2 L=3

t

dt

dt0

1

2dt

EvolveLevel(L,dtL−1)1 refresh level L ghosts2 compute timestep dtL3 advance level L by dtL4 EvolveLevel(L+1,dtL)5 correct fluxes

x

t

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 6 / 21

CELLO

Enzo’s scaling issues

Memory usageAMR structure is non-scalableghost zone layer three zones deepmemory fragmentation

Data localitydisrupted by load balancing

Parallel task definitionwidely varying patch sizesgranularity determined by AMR

Parallel task schedulingparallel within a levelsynchronization between levels [ Elizabeth Tasker ]

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 7 / 21

CELLO

Talk outline

1 Enzo1 overview2 design3 scaling issues

2 Talk outline3 Enzo-P / Cello

1 overview2 design3 scaling solutions4 implementation using Charm++5 recursively generated parallel data structures

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 8 / 21

CELLO

Enzo-P / Cello overview

Enzo-P intended to be “petascale Enzo”

Cello is a scalable AMR framework

Parallelism using Charm++

MPI as a backup

25K SLOC

Work in progressPPM HD / PPML MHD on distributed Cartesian gridsprototyping Charm++ AMR implementations

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 9 / 21

CELLO

Cello’s AMR data structure

Patch BlockTreeForest

Tree-based SAMR

Patches define unit of refinement

cubical, varying size

Blocks define parallel tasks

flexible size and shape

One Block per Charm++ chare

Tree structure optionally distributed

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 10 / 21

CELLO

Cello’s timestepping

Optional adaptive timesteppingincludes by Patch or Blocklocal synchronizationparallelism between levels

1t

x

5

6

7

8

1

2

3

43

5

7

1

1

L=0 L=1 L=2 L=3

4

dt

dt

dt2

0

1

Quantized timesteps� avoids “sliver timesteps” dt ≈ 0� but dt’s smaller than optimal

reduced numerical roundoff

t

x

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 11 / 21

CELLO

Cello’s improvements to scalingReducing replicated AMR structure

Fewer replicated patches

“patch-merging” technique

truncates full subtrees

≈ 2 to 3× reduction

original patch-merged

Smaller replicated patches

Enzo: �grid� = 1544 bytes

Cello: �Node� = 16 bytes

≈ 100× reduction!

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 12 / 21

CELLO

Cello’s improvements to scalingReducing replicated AMR structure

Fewer replicated patches

“patch-merging” technique

truncates full subtrees

≈ 2 to 3× reduction

original patch-merged

Smaller replicated patches

Enzo: �grid� = 1544 bytes

Cello: �Node� = 16 bytes

≈ 100× reduction!

Enzo grid Cello Node

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 12 / 21

CELLO

Cello’s improvements to scalingDistributed AMR Structure

Forest of Trees

each assigned a process range

simple indexing

load balancing issues

Space-filling curve

improved load balancing

requires global scan

greater surface area

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 13 / 21

CELLO

Cello’s improvements to scalingDistributed AMR Structure

Forest of Trees

each assigned a process range

simple indexing

load balancing issues

Space-filling curve

improved load balancing

requires global scan

greater surface area

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 13 / 21

CELLO

Cello’s improvements to scalingDynamically load balancing AMR block data

Dynamic Load Balancing

Use Charm++ load balancing

Space-filling curvesequally distribute loadmaintain data localityno parent-child communication

Use measured performancecomputationmemory usage

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 14 / 21

CELLO

Cello’s improvements to scalingParallel tasks

Task definition

Flexible choice of size / shape

Reduced size variability� constant grid Block sizes� variable subcycling, # particles

Enzo Cello

Task scheduling

Charm++: asynchronous, data-driven

Blocks advance when ghosts refreshed

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 15 / 21

CELLO

Cello’s improvements to scalingParallel tasks

Task definition

Flexible choice of size / shape

Reduced size variability� constant grid Block sizes� variable subcycling, # particles

Enzo Cello

Task scheduling

Charm++: asynchronous, data-driven

Blocks advance when ghosts refreshed

t

x

L=0 L=1 L=2 L=3

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 15 / 21

CELLO

Cello’s implementation using Charm++Charm++ program structure

pX()

Main

ChareC

pW()

Main()

ChareA

pZ()

pY()

ChareB

pV()

A Charm++ Program

Charm++ programCharm++ objects are chares

invoke entry methods

communicate via messages

Charm++ runtime systemmaps chares to processorsschedules entry methodsmigrates chares to load balance

Additional scalability featuresfault tol.: checkpoint/restartdynamic load balancing

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 16 / 21

CELLO

Cello’s implementation using Charm++Charm++ collections of chares

Chare Arrays

distributed array of chares

migratable elements

flexible indexing

Chare Groups

one chare per processor (non-migratable)

Chare Nodegroups

one chare per node (non-migratable)

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 17 / 21

CELLO

Cello’s implementation using Charm++Charm++ collections of chares

Chare Arrays

distributed array of chares

migratable elements

flexible indexing

Chare Groups

one chare per processor (non-migratable)

Chare Nodegroups

one chare per node (non-migratable)

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 17 / 21

CELLO

Cello’s implementation using Charm++Charm++ collections of chares

Chare Arrays

distributed array of chares

migratable elements

flexible indexing

Chare Groups

one chare per processor (non-migratable)

Chare Nodegroups

one chare per node (non-migratable)

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 17 / 21

CELLO

Cello’s implementation using Charm++Three implementation strategies

1. Single chare array

efficient: single access

restricted Tree depth

PTB BBBTPBBTBBPB BPPBBP

Hierarchy

PT

H

2. Composite chare arrays

“tree” of chare arrays

less restricted Tree depth

3. Singleton chares

no depth restrictions

possible performance issues

how to generate...?

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 18 / 21

CELLO

Cello’s implementation using Charm++Three implementation strategies

1. Single chare array

efficient: single access

restricted Tree depth

PTB BBBTPBBTBBPB BPPBBP

Hierarchy

PT

H

2. Composite chare arrays

“tree” of chare arrays

less restricted Tree depth

T T P P B B

TreeForest PatchTF P

3. Singleton chares

no depth restrictions

possible performance issues

how to generate...?

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 18 / 21

CELLO

Cello’s implementation using Charm++Three implementation strategies

1. Single chare array

efficient: single access

restricted Tree depth

PTB BBBTPBBTBBPB BPPBBP

Hierarchy

PT

H

2. Composite chare arrays

“tree” of chare arrays

less restricted Tree depth

T T P P B B

TreeForest PatchTF P

3. Singleton chares

no depth restrictions

possible performance issues

how to generate...?

B BBBTPBBTBBPB BPPBBPPT PT

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 18 / 21

CELLO

Recursively generated parallel data structuresGenerating a “software network”

1 Start with single Seedconceptually complete p.d.s.with processor range

2 grow() spawns remote Seedsconceptually partitioned p.d.s.with processor subrangesSeeds interlinked

3 Recurse to individual elementsfinal Seeds complete p.d.s.scaffolding : previous Seeds data structure

topology

root seed

scaffolding

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 19 / 21

CELLO

Recursively generated parallel data structuresUsing the software network

Three types of Seed linksparent: reductionsneighbor : collaborationschild : distributions reduce link

broadcast links

neighbor links

Seed

parent Seed

neighbor Seeds

child Seeds

Scalablelinks per node: O(1)generation: ≈ O(logN)

Load balancingspace-filling curveshierarchical

Usable in Cellogrids / octrees “easy”SeedGrid, SeedTree, etc.

Usable with MPIsend encoded seed creationlink: process rank + pointer

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 20 / 21

CELLO

Summary

Enzo Enzo-P / CelloParallelization MPI Charm++

SAMR patch-based tree-basedAMR structure replicated optionally distributedRemote patches 1544 bytes 16 bytesTimestepping level-adaptive block-adaptiveBlock sizes ×1000 variation constantTask scheduling level-parallel dependency-drivenLoad balancing patch migration space-filling curvesData locality LB conflict no LB conflict

http://cello-project.org

NSF PHY-1104819, AST-0808184

Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 21 / 21