![Page 1: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/1.jpg)
1
Alex Aiken Stanford
Legion: Programming Heterogeneous, Distributed Parallel Machines
Joint work involving LANL, NVIDIA & Stanford
![Page 2: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/2.jpg)
2
Modern Supercomputers
Heterogeneity Processor kinds Relative performance
Distributed Memory Non-uniform Distinct from processors
Growing disparities FLOPS >> bandwidth bandwidth >> latency
![Page 3: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/3.jpg)
3
Programming System Goals
High Performance We must be fast
Performance Portability
Across many kinds of machines and over many generations
Programmability
Sequential semantics, parallel execution
![Page 4: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/4.jpg)
4
Can We Fulfill These Goals Today? Yes … at great cost:
Task graph for one time step on one node… … of a mini-app
Do you want to schedule that graph? (High Performance)
Do you want to re-schedule that graph for every new
machine? (Performance Portability)
Do you want to be responsible for generating that graph?
(Programmability)
Today: programmer’s responsibility
Tomorrow: programming system’s responsibility
![Page 5: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/5.jpg)
5
The Crux
How do we describe the data? Most programming systems focus on control Minimal facilities for organization/structure of data
Why?
Answer: Solve the aliasing problem Can two references refer to the same data?
Answer: Decouple naming data from layout/location
![Page 6: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/6.jpg)
6
Legion Approach
Capture the structure of program data
Decouple specification from mapping
Asynchronous tasking
Automate data movement parallelism discovery synchronization hiding long latency
![Page 7: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/7.jpg)
7
![Page 8: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/8.jpg)
8
Example: Circuit Simulation
![Page 9: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/9.jpg)
9
Example: Circuit Simulation
task simulate_circuit(Region[Node] N, Region[Wires] W)
{ }
Tasks are the unit of parallel execution.
Logical regions are (typed) collections Logical: no implied layout no implied location
![Page 10: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/10.jpg)
10
Partitioning
![Page 11: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/11.jpg)
11
Partitioning task simulate_circuit(Region[Node] N, Region[Wires] W)
{ [ P, S ] = par??on(ps_map, N)
}
SP
N
![Page 12: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/12.jpg)
12
Partitioning task simulate_circuit(Region[Node] N, Region[Wires] W)
{ Array[Region[Node]] private, shared, ghost; … [ P, S ] = par??on(ps_map, N) private = par??on(private_map, P) shared = par??on(shared_map, S) ghost = par??on(ghost_map, S)
}
N
s1 s3 …
SP
g1 g3 … p1 p3 …
![Page 13: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/13.jpg)
13
Region Trees
SP
p1 p3 … s1 s3 … g1 g3 …
N
![Page 14: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/14.jpg)
14
… g1 g3 … s1 s3 … p3
Region Trees
N
Locality
S
p1
disjoint
Independence
P
N
![Page 15: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/15.jpg)
15
Region Trees
N
Locality
N
SP
p1 p3 … s1 s3 … g1 g3 …
disjoint
Independence
![Page 16: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/16.jpg)
16
Region Trees
N
Locality
N
SP
p1 p3 … s1 s3 … g1 g3 …
Independence Aliasing
![Page 17: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/17.jpg)
17
Region Trees
N
Locality
N
SP
p1 p3 … s1 s3 … g1 g3 …
Independence Aliasing
possibly overlap
![Page 18: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/18.jpg)
18
Legion Tasks
task calc_currents(Piece p) :
task distribute_charge(Piece p) :
subtasks
task simulate_circuit(Region[Node] N, Region[Wires] W) :
{ … calc_currents(piece[0]); calc_currents(piece[1]); distribute_charge(piece[0]); distribute_charge(piece[1]); … }
![Page 19: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/19.jpg)
19
Legion Tasks task simulate_circuit(Region[Node] N, Region[Wires] W) :
{ … calc_currents(piece[0], , , ); calc_currents(piece[1], , , ); distribute_charge(piece[0], , , ); distribute_charge(piece[1], , , ); … }
task calc_currents(Piece p) :
task distribute_charge(Piece p) :
N, W
p.private, p.shared, p.ghost, p.wires
p.private, p.shared, p.ghost, p.wires
A task must declare the regions it will use.
Subtask containment: A subtask can only use (sub)regions accessible to its parent task.
Tasks appear to execute in program order.
p0
p1 s1
s0 g0
g1
p0 s0 g0
p1 s1 g1
![Page 20: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/20.jpg)
20
Interference
If T1#T2 then T1 and T2 can execute in parallel.
Informal definition: Two tasks T1 and T2 are non-interfering T1#T2 if there is no dependence between them on their region arguments.
![Page 21: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/21.jpg)
21
task simulate_circuit(Region[Node] N, Region[Wires] W) : { … calc_currents(piece[0], , , ); calc_currents(piece[1], , , ); distribute_charge(piece[0], , , ); distribute_charge(piece[1], , , ); … }
Execution Model
Interferes?
Tasks are issued in program order.
Interferes? Interferes?
p0
p1 s1
s0 g0
g1
p0 s0 g0
p1 s1 g1
![Page 22: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/22.jpg)
22
task simulate_circuit(Region[Node] N, Region[Wires] W) :
{ … calc_currents(piece[0], , , ); calc_currents(piece[1], , , ); distribute_charge(piece[0], , , ); distribute_charge(piece[1], , , ); … }
Privileges
N, W
p.private, p.shared, p.ghost, p.wires
p.private, p.shared, p.ghost, p.wires
ReadWrite(p.wires), Read(p.private, p.shared, p.ghost)
ReadOnly(p.wires), Reduce(p.private, p.shared, p.ghost)
ReadWrite(N,W)
task calc_currents(Piece p) :
task distribute_charge(Piece p) :
p0
p1 s1
s0 g0
g1
p0 s0 g0
p1 s1 g1
![Page 23: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/23.jpg)
23
Non-Interference Dimensions
Several dimensions of # operator
Entries (rows) Privileges Fields (columns)
Logical regions are a relational data model
Partitioning is selection (σ) Field-slicing is projection (π) Don’t support all relational operators
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Voltage Capac. Induct. Charge
![Page 24: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/24.jpg)
24
Legion Summary Logical regions: a relational data model
Support partitioning and slicing Convey locality, independence, aliasing
Implicit task parallelism Task may have arbitrary sub-tasks Tasks declare region usage including privileges and fields
Tasks appear to execute in program order Execute in parallel when non-interference established
Machine independent specification of application
![Page 25: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/25.jpg)
25
![Page 26: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/26.jpg)
26
Legion Runtime System
t0:: r5, r7 t1: r0, r2 t2: r1, r2 t3: r4, r6
Legion Runtime Parallel
Distributed Execution
Tasks : Regions :: Instructions : Registers
Dep. Analysis Distribute Map Execute Resolve
Spec. Complete Commit
A Distributed Hierarchical Out-of-Order Task Processor
T
T T T
T T T T
Parent Task
![Page 27: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/27.jpg)
27
Dependence Analysis
CC[0] CC[1]
task simulate_circuit(Region[Node] N, Region[Wires] W) : ReadWrite(N,W)
{ … calc_currents(piece[0], ); calc_currents(piece[1] ); distribute_charge(piece[0] ); distribute_charge(piece[1] ); … } task calc_currents(Piece p) : ReadWrite(p.wires), Read(p.private, p.shared, p.ghost)
task distribute_charge(Piece p) : ReadOnly(p.wires), Reduce(p.private, p.shared, p.ghost)
N
SP
p0 p1 s0 s1 g0 g1
disjoint
read only
![Page 28: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/28.jpg)
28
Dependence Analysis
DC[0]
task simulate_circuit(Region[Node] N, Region[Wires] W) : ReadWrite(N,W)
{ … calc_currents(piece[0], ); calc_currents(piece[1], ); distribute_charge(piece[0], ); distribute_charge(piece[1], ); … } task calc_currents(Piece p) : ReadWrite(p.wires), Read(p.private, p.shared, p.ghost)
task distribute_charge(Piece p) : ReadOnly(p.wires), Reduce(p.private, p.shared, p.ghost)
N
SP
p0 p1 s0 s1 g0 g1
WAR w/CC[1]
CC[1]
CC[0]
read-after-write of wires w/CC[0]
![Page 29: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/29.jpg)
29
Mapping Interface Programmer selects:
Where tasks run Where regions are placed
Mapping computed dynamically
Decouple correctness from performance
29
t1
t2
t3
t4t5
rc
rw
rw1 rw2
rn
rn1 rn2
$
$
$
$
N U M A
N U M A
FB
D R A M
x86
CUDA
x86
x86
x86 Dep.
Analysis Distribute Map Execute Resolve Spec. Complete Commit
![Page 30: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/30.jpg)
30
Correctness Independent of Mapping N
SP
p0 p1 s0 s1 g0 g1
Node 1 Node 0
CC[1]
CC[0]
DC[0]
copy
task simulate_circuit(Region[Node] N, Region[Wires] W) : ReadWrite(N,W)
{ … calc_currents(piece[0], ); calc_currents(piece[1], ); distribute_charge(piece[0], ); distribute_charge(piece[1], ); … } task calc_currents(Piece p) : ReadWrite(p.wires), Read(p.private, p.shared, p.ghost)
task distribute_charge(Piece p) : ReadOnly(p.wires), Reduce(p.private, p.shared, p.ghost)
![Page 31: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/31.jpg)
31
Distribution
Dep. Analysis Distribute Map Execute Resolve
Spec. Complete Commit
Node 0 Node 1
T1 T2 After tasks are mapped they are distributed to target node
Task execution can generate sub-tasks
Do we need inter-node dependence checks?
Subtask containment: A subtask can only use (sub)regions accessible to its parent task.
t1 t2
![Page 32: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/32.jpg)
32
Independence Theorem
Let t1 be a subtask of T1 and t2 be a subtask of T2. Then
T1#T2 => t1#t2
T1 T2
t1 t2
#
#
![Page 33: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/33.jpg)
33
Independence Theorem
Let t1 be a subtask of T1 and t2 be a subtask of T2. Then
T1#T2 => t1#t2
Proof: Use subtask containment.
Note: Similar property holds in functional languages, but it holds in Legion even though we may imperatively mutate regions.
Observation: It is sufficient to test interference only of sibling tasks.
![Page 34: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/34.jpg)
34
Runtime Summary
A distributed hierarchical out-of-order task processor Analogous to hardware processors
Can exploit parallelism implicitly: Task-, data-, and nested-parallelism
Runtime builds task graph ahead of execution to hide latency and costs of dynamic analysis
Decouples mapping decisions from correctness Enables efficient porting and (auto) tuning
![Page 35: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/35.jpg)
35
![Page 36: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/36.jpg)
36
S3D
Production combustion simulation
Written in ~200K lines of Fortran
Direct numerical simulation using explicit methods
![Page 37: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/37.jpg)
37
S3D Versions Supports many chemical mechanism
DME (30 species) Heptane (52 species)
Fortran + MPI Vectorizes well MPI used for multi-core
“Hybrid” OpenACC Recent work by Cray/Nvidia/DoE
Legion interoperates with MPI
25
Detailed chemical kinetics are expensive
component in the simulation of chemically reacting flows. It isimportant because the fidelity of all subsequent steps of mecha-nism reduction depends on the fidelity of the detailed mechanism.In other words, the comprehensiveness of a reduced mechanismcannot exceed that of the detailed mechanism from which it isdeduced. This is a challenging task because, firstly, it is difficult tobe certain that all possible important species and reactions areidentified and included in the detailed mechanism. Furthermore,the number of reactions and species involved is large, and thedetermination of the rate constants of each of the identified reac-tions, either experimentally or computationally, is not a trivial task.
Lacking a systematic, first-principle procedure to identify allrelevant species and reactions that would render a mechanismcomprehensive, comprehensiveness can be considered based on theability of the mechanism to describe combustion phenomena asextensively as possible. There are two levels of considerations. First,since the nature of the collision dynamics is determined by theidentity of the colliding molecules as well as the frequency andenergetics of the collision, a comprehensive chemical description interms of themacroscopic thermodynamic properties would requireextensive coverage in the range of temperature, pressure, andcomposition of the reacting mixture. Second, in terms of combus-tion phenomena, comprehensiveness would require considerationsof homogeneous and diffusive ignition which cover low-, interme-diate- and high-temperature chemistry, steady burning andextinction which cover high-temperature chemistry, and premixedand nonpremixed flames which cover the relative concentrationsand mixedness of fuel and oxidizer. The global combustionresponses of interest would include the laminar flame speed, igni-tion and extinction strain rates, detonation induction length,detailed thermal and concentration structures of flames and deto-nations, oscillatory and pulsed unsteady effects to potentiallydiscriminate reactions of different time scales, and pollutantchemistry.
A final requirement for comprehensiveness is fuel hierarchy. Forexample, since hydrogen and CO oxidation constitute a part ofmethane oxidation, a methane mechanism must degenerate tothose for hydrogen and CO when all elementary reactions notrelated to them are stripped away. Thus amechanism developed fora fuel must contain descriptions of its intermediates and simplerfuels as its sub-mechanisms.
It is clear that since the size of a mechanism depends on theextent of comprehensiveness, some reduction can be achieved forrestricted comprehensiveness. Perhaps the most obvious restriction
is to fix the pressure to atmospheric because many fundamentaland practical combustion phenomena and processes take placeunder atmospheric pressure. Other restrictions can also beimposed, such as lean combustion, high-temperature flameswithout considering the possible presence of ignition described bylow-temperature chemistry, and homogeneous charge combustion.However, except for well-controlled laboratory-scale experiments,the combustion mode is frequently a mixed one in most complexand practical combustion situations, involving for example bothpremixed and nonpremixed reactants, or both ignition and flames.Consequently it is more conservative to apply unrestrictedcomprehensive mechanisms in simulations of complex flows.
4. Overview of mechanism reduction andfacilitated computation
The availability of a comprehensive detailed reaction mecha-nism does not mean that it can be readily adopted for computa-tional simulation. In fact, except for the smallest of fuels such ashydrogen andmethane, and for such simple combustion systems asthe 1-D laminar flame, detailed mechanisms of the larger fuels aresimply too large for simulation without substantial reduction.Fig. 10 shows the size of more than 20 detailed and moderatelyreduced skeletal mechanisms for hydrocarbon fuels of variousmolecular complexities compiled over the last two decades [15].Several interesting observations can be made here. First, thenumber of species, K, and reactions, I, increase with the size of themolecule, roughly in an exponential trend. Specifically, it is seenthat while typical mechanisms for C1 and C2 species consist of lessthan about a hundred species, those for realistic engine fuels consistof hundreds of species and thousands of reactions. Mechanisms ofsuch sizes are even difficult to apply in 1-D flame simulations. As anextreme example, the size of the compiled detailed mechanism formethyl decanoate [16], a biomass fuel surrogate, consists of 3036species and 8555 reactions. Computation using this mechanism istime consuming even for 0-D simulations.
The second observation from Fig. 10 is that the size of themechanisms tends to grow with time, as new discoveries inchemical kinetics are continuously being made. Furthermore, theemergence of computer-aided automatic mechanism generation[17–20] and computer software for rate parameter evaluation, such
10-5
10-4
10-3
10-2
10-1
0.5 0.6 0.7 0.8 0.9 1.0
Methane/Air, φ=1.0, p=1 atm
GRI-Mech 1.212-Step10-Step4-Step
Auto
-Igni
tion
Del
ay (s
ec)
1000/T (1/K)
Fig. 9. Comparison of predicted ignition delay times of atmospheric, stoichiometricmethane–air mixtures using various reduced mechanisms and the detailed mecha-nism, showing the inadequacy of the four-step class of mechanisms.
101 102 103 104
102
103
104
before 20002000 to 2005after 2005
iso-octane (LLNL)
iso-octane (ENSIC-CNRS)
n-butane (LLNL)
CH4 (Konnov)
neo-pentane (LLNL)
C2H4 (San Diego)
CH4 (Leeds)
MethylDecanoate(LLNL)
C16 (LLNL)
C14 (LLNL)C12 (LLNL)
C10 (LLNL)
USC C1-C4USC C2H4
PRF
n-heptane (LLNL)
skeletal iso-octane (Lu & Law)skeletal n-heptane (Lu & Law)
1,3-ButadieneDME (Curran)C1-C3 (Qin et al)
GRI3.0
Num
ber o
f rea
ctio
ns, I
Number of species, K
GRI1.2
I = 5K
Fig. 10. Size of selected detailed and skeletal mechanisms for hydrocarbon fuels,together with the approximate years when the mechanisms were compiled.
T.F. Lu, C.K. Law / Progress in Energy and Combustion Science 35 (2009) 192–215196
From Lu and Law, PECS, 2009
• Chemical source term evaluation is computationally intensive
• Thousands of elementary reaction steps accumulated to global species reaction rates
• Often the target for model reductions or algorithmic improvements
• How fast can we compute detailed chemical kinetics on accelerators?
5
Planning the science simulation
• Recent 3D simulation on Jaguar was used to extrapolate and plan a target Titan simulation
• Planned simulation will have more grid points and/or larger chemistry
• Will need a month on 12,000 hybrid nodes of Titan
Figure 5: Computational domain and grid to be used for simulations of the CRF HCCI engine.
Figure 6: Reaction and diffusion structures for OH radical during the third stage thermal explosion of a high-pressureDME fueled autoignition process.Recent 3D DNS of auto-ignition with 30-species
DME chemistry (Bansal et al. 2011)
![Page 38: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/38.jpg)
38
Parallelism in S3D
Data is large 3D cartesian grid of cells
Typical per-node subgrid is 483 or 643 cells Nearly all kernels are per-cell Embarrassingly data parallel
Hundreds of tasks Significant task-level parallelism
Except… Computational intensity is low Large working sets per cell (1000s of temporaries) Performance limiter is data, not compute
![Page 39: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/39.jpg)
39
S3D Tasks in Legion
rhs = rhsf(q)
![Page 40: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/40.jpg)
40
S3D Task Parallelism
One call to Right-Hand-Side-Function (RHSF) as seen by the Legion runtime
Called 6 times per time step by Runge-Kutta solver Width == task parallelism H2 mechanism (only 9 species) Heptane (52 species) is significantly wider
Manual task scheduling would be difficult!
![Page 41: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/41.jpg)
41
Mapping for Heptane 483
4 AMD Interlagos
Integer cores for Legion Runtime
8 AMD Interlagos FP
cores for application
NVIDIA Kepler K20
Dynamic Analysis for (rhsf+2) Clean-up/meta tasks
![Page 42: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/42.jpg)
42
Heptane Mapping for 963 Handle larger problem sizes per node
Higher computation-to-communication ratios More power efficient
Not enough room in 6 GB GPU framebuffer OpenACC requires code changes
Legion analysis is independent of problem size Larger tasks -> fewer runtime cores
![Page 43: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/43.jpg)
43
![Page 44: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/44.jpg)
44
Legion S3D DME Performance
1.71X - 2.33X faster between 1024 and 8192 nodes Larger problem sizes have higher efficiency
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192Nodes
0
50000
100000
150000
200000
250000
Thro
ughp
utPe
rNod
e(P
oint
s/s)
Titan Legion 483
Titan Legion 643
Titan Legion 963
Keeneland Legion 483
Keeneland Legion 643
Keeneland Legion 963
Titan OpenACC 483
Titan OpenACC 643
![Page 45: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/45.jpg)
45
Legion Heptane Performance 1.73X - 2.85X faster between 1024 and 8192 nodes Higher throughput on Keeneland (balanced CPU+GPUs)
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192Nodes
0
50000
100000
150000
200000
Thro
ughp
utPe
rNod
e(P
oint
s/s)
Titan Legion 483
Titan Legion 643
Titan Legion 963
Keeneland Legion 483
Keeneland Legion 643
Keeneland Legion 963
Titan OpenACC 483
Titan OpenACC 643
![Page 46: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/46.jpg)
46 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192
Nodes
0
10000
20000
30000
40000
50000
60000
70000
80000
Thro
ughp
utPe
rNod
e(P
oint
s/s)
Mixed CPU-GPU 483
Mixed CPU-GPU 643
All-GPU 323
All-GPU 483
Legion PRF Performance 116 species mechanism, >2X as large as heptane Legion uses different mapping approach
![Page 47: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/47.jpg)
47
![Page 48: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/48.jpg)
48
Liszt Legion
local liszt InitLength(e : dragon.edges) var delta = e.head.pos - e.tail.pos e.rest_len = L.len(delta) end
dragon.edges:foreach(InitLength)
Phase Analysis Legion Permissions
edges.head : LEGION_READ edges.tail : LEGION_READ edges.rest_len : LEGION_READ_WRITE
CF AF IL CF AF
ME ME
Legion Task Graph
dragon.edges:map(InitLength) for i = 1,300 do dragon.vertices:foreach(ComputeForces) dragon.vertices:foreach(ApplyForces) dragon.vertices:foreach(MeasureEnergy) end
Seq. Bulk Data-Parallel
![Page 49: Legion: Programming Heterogeneous, Distributed Parallel ...hpc.pnl.gov/conf/wolfhpc/2015/talks/aiken.pdf1 % g 1 % 23 Non-Interference Dimensions Several dimensions of # operator Entries](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f92d4b7bb83294ca31b4b5d/html5/thumbnails/49.jpg)
49
Legion
Legion website: http://legion.stanford.edu
Github repo: http://github.com/stanfordlegion
Questions?