improved pipelining and domain decomposition in quickpic chengkun huang (ucla/lanl) and members of...
TRANSCRIPT
Improved pipelining and domain decomposition in QuickPIC
Chengkun Huang (UCLA/LANL)and members of FACET
collaboration
SciDAC COMPASS all hands meeting 2009
LA-UR 09-06300
Help in designing linear colliders based on staging PWFA or LWFA sections
Develop integrated codes for modeling a staged wakefield “system”.
Develop efficient and high fidelity modeling for optimizing a single LWFA or PWFA stage.
Develop PIC algorithms for advanced architectures
Enable routine modeling of existing and future experiments:
BELLA
FACET
Others
Real time steering of experiments?
Code validation against numerous worldwide experiments
Code Verification
Goals and vision for advancedaccelerator modeling under COMPASS
QuickPIC Implementation (1D domain)
The driver evolution can be calculated in a 3D moving box, while the plasma response can be solved for slice by slice with the being a time-like variable.
Exploiting more parallelism: Pipelining
• Pipelining technique exploits parallelism in a sequential operation stream and can be adopted in various levels.
• Modern CPU designs include instruction level pipeline to improve performance by increasing the throughput.
• In scientific computation, software level pipeline is less common due to hidden parallelism in the algorithm.
• We have implemented a software level pipeline in QuickPIC to exploit the parallelism in a quasi-static algorithm.
Moving Window
plasma response
Instruction pipeline Software pipeline
Operand Instruction stream Plasma slice
OperationIF, ID, EX, MEM,
WBPlasma/beam
update
Stages 5 ~ 31 1 ~(# of slices)
Ponderomotive guiding center approximation: Big 3D time step
Plasma evolution: Maxwell’s equations Lorentz Gauge Quasi-Static Approximation
Implementation of laser envelope model
ct z
This can be pipelined too
solve plasma
response
update beam
solve plasma
response
update beam
solve plasma
response
update beam
solve plasma
response
update beam
beam
1 2 3 4Initial plasma slab
Scaling to 100,000+ processors and enabling high resolution capability: Pipelining
Stage 1 Stage 2 Stage 3 Stage 4
Schematic of Pipelining
Implementation in QuickPIC
• Communication overlap with computation• Particles leaving pipeline stage are
buffered• Overall efficiency as high as 85% (2048
processors in 64 pipeline stages)
Basic Quasi-Static mode
0
2
4
6
8
10
12
14
0 50 100 150number of processor used
co
mp
uta
tio
n s
peed
up non-pipeline
Pipeline(4procs/group)
Performance in pipeline mode (with 1d decomp. for the beam)
• Fixed problem size, strong scaling study, increase number of processors by increasing pipeline stages
• In each stage, the number of processors is chosen according to the transverse size of the problem.
• Benchmark shows that pipeline operation can be scaled to at least 1,000+ processors with substantial throughput improvement.
256 256 1024 cells 64 64 1024 cells
Feng et al, 228(15), 5340, JCP 2009
Domain decomposition and pipelining
Increase throughput through more execution units (similar to CPU design).
Pipeline stages separated in time/space, it can be viewed as a special case of domain decomposition.
Each pipeline stage can work with arbitrary domain decomposition (for the beam) :
1D decomposition along propagation direction works well for small problem, but it limits the amount of CPUs that can be used.
Domain decomposition in plasma solver is unaffected (no benefit for using 2D decomposition).
General 2d decomposition or 3D decomposition for the beam requires complicate data redistributions, but is good for load-balancing.
Matching domain decomposition for the beam and the plasma is simple and avoid data redistributions.
1D decomposition
2D decomposition
Verification Parallel scaling
W/O Pipelining
With pipelining
Pipeline algorithm verification and scaling
2048×2048×256 grids, 4 particles/cell, 128 cores/stage, smallest domain 2048×16×2.