improved pipelining and domain decomposition in quickpic chengkun huang (ucla/lanl) and members of...

9
Improved pipelining and domain decomposition in QuickPIC Chengkun Huang (UCLA/LANL) and members of FACET collaboration SciDAC COMPASS all hands meeting 2009 LA-UR 09-06300

Upload: doris-barber

Post on 29-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improved pipelining and domain decomposition in QuickPIC Chengkun Huang (UCLA/LANL) and members of FACET collaboration SciDAC COMPASS all hands meeting

Improved pipelining and domain decomposition in QuickPIC

Chengkun Huang (UCLA/LANL)and members of FACET

collaboration

SciDAC COMPASS all hands meeting 2009

LA-UR 09-06300

Page 2: Improved pipelining and domain decomposition in QuickPIC Chengkun Huang (UCLA/LANL) and members of FACET collaboration SciDAC COMPASS all hands meeting

Help in designing linear colliders based on staging PWFA or LWFA sections

Develop integrated codes for modeling a staged wakefield “system”.

Develop efficient and high fidelity modeling for optimizing a single LWFA or PWFA stage.

Develop PIC algorithms for advanced architectures

Enable routine modeling of existing and future experiments:

BELLA

FACET

Others

Real time steering of experiments?

Code validation against numerous worldwide experiments

Code Verification

Goals and vision for advancedaccelerator modeling under COMPASS

Page 3: Improved pipelining and domain decomposition in QuickPIC Chengkun Huang (UCLA/LANL) and members of FACET collaboration SciDAC COMPASS all hands meeting

QuickPIC Implementation (1D domain)

The driver evolution can be calculated in a 3D moving box, while the plasma response can be solved for slice by slice with the being a time-like variable.

Page 4: Improved pipelining and domain decomposition in QuickPIC Chengkun Huang (UCLA/LANL) and members of FACET collaboration SciDAC COMPASS all hands meeting

Exploiting more parallelism: Pipelining

• Pipelining technique exploits parallelism in a sequential operation stream and can be adopted in various levels.

• Modern CPU designs include instruction level pipeline to improve performance by increasing the throughput.

• In scientific computation, software level pipeline is less common due to hidden parallelism in the algorithm.

• We have implemented a software level pipeline in QuickPIC to exploit the parallelism in a quasi-static algorithm.

Moving Window

plasma response

Instruction pipeline Software pipeline

Operand Instruction stream Plasma slice

OperationIF, ID, EX, MEM,

WBPlasma/beam

update

Stages 5 ~ 31 1 ~(# of slices)

Page 5: Improved pipelining and domain decomposition in QuickPIC Chengkun Huang (UCLA/LANL) and members of FACET collaboration SciDAC COMPASS all hands meeting

Ponderomotive guiding center approximation: Big 3D time step

Plasma evolution: Maxwell’s equations Lorentz Gauge Quasi-Static Approximation

Implementation of laser envelope model

ct z

This can be pipelined too

Page 6: Improved pipelining and domain decomposition in QuickPIC Chengkun Huang (UCLA/LANL) and members of FACET collaboration SciDAC COMPASS all hands meeting

solve plasma

response

update beam

solve plasma

response

update beam

solve plasma

response

update beam

solve plasma

response

update beam

beam

1 2 3 4Initial plasma slab

Scaling to 100,000+ processors and enabling high resolution capability: Pipelining

Stage 1 Stage 2 Stage 3 Stage 4

Schematic of Pipelining

Implementation in QuickPIC

• Communication overlap with computation• Particles leaving pipeline stage are

buffered• Overall efficiency as high as 85% (2048

processors in 64 pipeline stages)

Page 7: Improved pipelining and domain decomposition in QuickPIC Chengkun Huang (UCLA/LANL) and members of FACET collaboration SciDAC COMPASS all hands meeting

Basic Quasi-Static mode

0

2

4

6

8

10

12

14

0 50 100 150number of processor used

co

mp

uta

tio

n s

peed

up non-pipeline

Pipeline(4procs/group)

Performance in pipeline mode (with 1d decomp. for the beam)

• Fixed problem size, strong scaling study, increase number of processors by increasing pipeline stages

• In each stage, the number of processors is chosen according to the transverse size of the problem.

• Benchmark shows that pipeline operation can be scaled to at least 1,000+ processors with substantial throughput improvement.

256 256 1024 cells 64 64 1024 cells

Feng et al, 228(15), 5340, JCP 2009

Page 8: Improved pipelining and domain decomposition in QuickPIC Chengkun Huang (UCLA/LANL) and members of FACET collaboration SciDAC COMPASS all hands meeting

Domain decomposition and pipelining

Increase throughput through more execution units (similar to CPU design).

Pipeline stages separated in time/space, it can be viewed as a special case of domain decomposition.

Each pipeline stage can work with arbitrary domain decomposition (for the beam) :

1D decomposition along propagation direction works well for small problem, but it limits the amount of CPUs that can be used.

Domain decomposition in plasma solver is unaffected (no benefit for using 2D decomposition).

General 2d decomposition or 3D decomposition for the beam requires complicate data redistributions, but is good for load-balancing.

Matching domain decomposition for the beam and the plasma is simple and avoid data redistributions.

1D decomposition

2D decomposition

Page 9: Improved pipelining and domain decomposition in QuickPIC Chengkun Huang (UCLA/LANL) and members of FACET collaboration SciDAC COMPASS all hands meeting

Verification Parallel scaling

W/O Pipelining

With pipelining 

Pipeline algorithm verification and scaling

2048×2048×256 grids, 4 particles/cell, 128 cores/stage, smallest domain 2048×16×2.