development of algorithms for an electromagnetic particle

Alma Mater Studiorum · Università di Bologna

FACOLTÀ DI SCIENZE MATEMATICHE, FISICHE E NATURALI

Corso di Laurea in Fisica

Development of algorithms

for an electromagnetic particle in cell code

and implementation on a hybrid

architecture (CPU+GPU)

Tesi di Laurea Magistrale in Fisica

Relatore:

Chiar.mo Prof.

Giorgio Turchetti

Presentata da:

Francesco Rossi

Sessione II

Anno Accademico 2010-2011

Indice

Acknowledgements 1

Introduction 2

Summary 5

1 The Physics of plasma based laser accelerators 6

1.1 Basic plasma physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.1 Debye Length and plasma parameter . . . . . . . . . . . . . . . . . 7

1.1.2 Kinetic and fluid plasma descriptions . . . . . . . . . . . . . . . . . 9

1.1.3 Collective motion: plasma electron waves . . . . . . . . . . . . . . 11

1.1.4 The collision frequency . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.1.5 Collisionless limit and the plasma parameter Λ . . . . . . . . . . . 12

1.1.6 Electromagnetic waves propagation in a cold plasma . . . . . . . . 14

1.2 Laser plasma acceleration physical concepts and regimes . . . . . . . . . . 16

1.2.1 Electron acceleration: LWFA . . . . . . . . . . . . . . . . . . . . . 16

1.2.2 Electron acceleration: bubble regime . . . . . . . . . . . . . . . . . 18

1.2.3 Ion acceleration: TNSA and RPA . . . . . . . . . . . . . . . . . . . 21

2 The numerical modelization 27

2.1 The particle-in-cell method . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.1.1 Phase space representation . . . . . . . . . . . . . . . . . . . . . . 27

2.1.2 Passes of an electromagnetic PIC code and numerical parameters

of a laser plasma interaction simulation . . . . . . . . . . . . . . . 30

2.1.3 Interpolation and deposition using shape functions . . . . . . . . . 31

2.1.3.1 Force interpolation . . . . . . . . . . . . . . . . . . . . . . 31

2.1.3.2 Charge and current deposition . . . . . . . . . . . . . . . 32

2.1.3.3 Common shapefunctions and shapefactors . . . . . . . . . 32

2.2 The “standard” second-order PIC: leapfrog and FDTD . . . . . . . . . . . 38

2.2.1 Solving the Maxwell equations numerically using the Yee Lattice . 38

ii

Indice

2.2.2 Boris pusher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.2.3 Bringing all togheter . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.3 Charge conservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.3.1 Spectral Poisson corrector . . . . . . . . . . . . . . . . . . . . . . . 45

2.3.2 Esirkepov Shape functions . . . . . . . . . . . . . . . . . . . . . . . 47

2.4 Envelope model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.4.1 Laser envelope equation derivation . . . . . . . . . . . . . . . . . . 50

2.4.2 Plasma motion and wakefield equations in cylindrical comoving

coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3 INF&RNO 54

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.2 Numerical Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.2.1 Plasma modelization . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.2.1.1 Fluid plasma model . . . . . . . . . . . . . . . . . . . . . 56

3.2.1.2 PIC plasma model . . . . . . . . . . . . . . . . . . . . . . 56

3.2.1.3 Numerical choices . . . . . . . . . . . . . . . . . . . . . . 56

3.3 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3.1 Shared memory (OpenMP) . . . . . . . . . . . . . . . . . . . . . . 60

3.3.2 Message-Passing (MPI) . . . . . . . . . . . . . . . . . . . . . . . . 61

3.3.2.1 1D domain decomposition . . . . . . . . . . . . . . . . . . 62

3.3.2.2 2D domain decomposition . . . . . . . . . . . . . . . . . . 64

3.3.3 GPU parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.4 3D Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework 70

4.1 Structure of the framework . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 GPU Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.2.1 GPU architecture and CUDA programming model . . . . . . . . . 72

4.2.1.1 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.2.1.2 Parallelism and memory model (hierarchy) . . . . . . . . 74

4.2.1.3 Programming GPUs: CUDA libraries and our multi ar-

chitecture approach . . . . . . . . . . . . . . . . . . . . . 76

4.2.2 Our approach for writing cross architecture code . . . . . . . . . . 78

4.2.3 Current and density deposition algorithm . . . . . . . . . . . . . . 79

4.3 Hybrid parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.3.1 Inter-node & intra-node parallelization . . . . . . . . . . . . . . . . 80

iii

Indice

4.3.2 Inter-node communication . . . . . . . . . . . . . . . . . . . . . . . 83

4.4 Performance benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Conclusions 85

iv

Acknowledgements

For giving me the opportunity of performing the work in this thesis, I would like to thank

my supervisor Prof. Giorgio Turchetti, a person of reference during all the period of my

studies.

Particular thanks are for Dr. Carlo Benedetti, who has introduced and guided me,

with contagious enthusiasm, through this field of research and gave me the possibility to

collaborate, for a while, with the LOASIS group at Lawrence Berkeley National Lab, one

of the excellences in the field. I also thank him personally for the great time he made

me spend in Berkeley.

During this period, it has always been a pleasure to work with both the groups in

Bologna and in Berkeley, and I want to give a special thank to all the people I have been

working with, whose essential advices and support have been always fundamental and

encouraging. It was a pleasure to meet my supervisor in Berkeley Carl Schroeder and to

collaborate with Andrea Sgattoni and Pasquale Londrillo, who helped me with incredible

altruism.

It has been an honor and incredibly motivating to work for experts of the likes of Wim P.

Leemans, Eric Esarey and Cameron Geddes. It has been a pleasure to be in contact with

Professor Graziano Servizi, Nicholas Matlis, Daniel Bakker, Lule Yu and Joshua Renner,

who gave me the chance for very fruitful discussions, though not always centered on

theme of this thesis.

I thank CINECA, and in particular Carlo Cavazzoni and Riccardo Brunino, for showing

great interest in my work and giving me the possibility to benchmark my codes on their

new machine PLX.

Sofia, Emma and Rolando have been always next to me and the support they have always

been giving me is invaluable.

I acknowledge support by the INFN, the “Collegio Superiore dell’Università di Bologna”

and by the Department of Energy under the Office of Science contract No. DE-AC02-

05CH11231.

1

Introduction

The development of laser-plasma accelerators was inspired by the work of Tajima and

Dawson [32, 1979] and it was boosted by the rapid upgrowth of high intensity laser

systems, which allow to explore interactions that drive electrons up to ultra-relativistic

velocities. The particle beams generated in such accelerators are expected to find appli-

cation in a wide range of contexts, including high energy physics, proton therapy for the

treatment of cancers and generation of intense X-ray radiation.

Plasma-based particle acceleration is of great interest because of their capability to sus-

tain extremely high acceleration gradients. The accelerating electrical field in a con-

ventional, radio frequency, linear accelerator (linac) is limited, for structure breakdown

reasons, to approximately∼ 100MV/m. A ionized plasma (having electron number den-

sity n0), instead, can sustain electron oscillations that generate electric wakefields that

can exceed the non-relativistic wave-breaking limit:

E0(V/m) ≃ 96√

n0(cm−3) (0.1)

This limit can be various orders of magnitudes greater than the limit in a conventional

linac, for example n0 = 1018cm−3 yields E0 = 100GV/m. Wakefields of this order,

generated by the plasma electron waves, are able to accelerate electrons and static fields

can be used to accelerate protons and ions.

New laser technologies are able to provide high power (PW scale) and ultra-short (∼10fs) pulses. These features make lasers ideal acceleration drivers, as they can be used

for exciting these plasma electron waves in quasi-resonance conditions.

Experiments have demonstrated, so far, the acceleration of ultra-short, monochromatic

electron bunches up to 1GeV ([1]) and ion bunches up to ∼ 100MeV . Electron acceler-

ation up to 10GeV in ∼ 1m of plasma, is currently under development at LBNL by the

LOASIS group ([8],[11]).

2

Indice

The theoretical study of laser plasma interaction phenomena requires to solve systems of

nonlinear partial differential equations which, in general, can not be solved analytically.

Numerical modeling and computer simulation codes are therefore fundamental tools for

understanding the Physics of laser-plasma acceleration and for supporting the design

of the experiments. The use of simulations, in particular the so called particle-in-cell

codes, has made possible to discover new physical regimes (as the bubble one) and has

had, in general, a very significant impact in designing, optimizing and understanding

laser-plasma accelerators.

These simulations are usually very demanding from the point of view of the required CPU

time and memory. In fact, full dimensionality (3D) models are required for describing

the dynamics correctly and it is required to resolve many physical length scales, which

can differ by various orders of magnitude.

Given the crucial importance of simulations in this field, the effort for making them more

accurate and faster (which means, in most of the cases, possible) is definitely worth.

Those objectives can be achieved via efficient code parallelization and numerical scheme

improvement.

Since 2007, our group in Bologna has been developing an MPI-parallel particle in cell

code, ALaDyn ([5], acronym for Acceleration by Laser and Dynamics of charged parti-

cles), featuring high order integration (Runge-Kutta 4) and derivative schemes, which

allowed to reduce computational requirements. It has been used for modeling the INFN

laser plasma acceleration experiments LILIA (proton acceleration) and PLASMON-X

(electrons acceleration).

Furthermore, customized codes for specific acceleration regimes have been developed.

The INF&RNO code framework, developed by Carlo Benedetti at Lawrence Berkeley

National Laboratory, is designed to simulate laser plasma electron accelerators that work

in underdense plasma regime, which, in certain cases, allows the use of the laser envelope

approximation and the boosted Lorentz frame.

In this thesis I will discuss the parallelization of the INF&RNO code - which required to

find some original and physically guided solutions for solving some recursive dependencies

in the numerical scheme - and my implementation (in modern C++) of a generic particle

in cell codes framework, named jasmine, in which I have implemented some of the

schemes of ALaDyn and INF&RNO, capable to run full 3D PIC simulations on (general-

3

Indice

purpose-)graphics processing unit (GP)GPU clusters.

The GPGPU architecture, developed in the recent years, represents a very good opportu-

nity for efficient parallelization. It implements the parallelism at the very chip level and

it provides a parallel-efficient memory hierarchy, hundreds of lightweight cores (providing

more floating point computational power) and more memory bandwidth compared to a

conventional processor (CPU).

Many among the world’s largest high performance computing clusters (4 in the worldwide

top ten while writing this thesis) have, already, most of their computational power in

their GPU nodes, as the new machine PLX at CINECA, which was used for jasmine

benchmarking.

Furthermore, in the last years, the development of manycore, shared memory, architec-

tures has been the only way to increase processors’ computational power, and therefore,

in the future, they are likely to increase in importance. Parallelizing for these archi-

tectures is not trivial: for a particle-in-cell code it is required to write some algorithms

(such as densities deposition) using original and completely different approaches respect

to their serial implementation.

My work for developing a GPU version of an electromagnetic particle in cell code is

giving us the possibility to run the simulations exploiting the efficiency and power of

such clusters, which are providing us great speedups: the GPU code, on a single GPU,

is up to 50x faster than the CPU version on a single CPU core. Furthermore it makes

our simulations ready for the (likely) upcoming manycore era, in which the CPUs and

GPUs architectures will continue to converge.

At the moment of writing this thesis, no published article has claimed the development

of an electromagnetic PIC code that can run on hybrid clusters using the main CPU

memory efficiently, or that can run multi-gpu full-PIC 3D simulations, or that is easily

adaptable for various geometries and numerical schemes as jasmine.

4

Summary

In this thesis I will discuss:

Chapter1 The physical background of laser plasma interactions, from the analytical

theory, which derives the fluid model and explains linear oscillation modes and

plasma parameters, to the nonlinear laser-plasma acceleration regimes, explained

with some simulations run with my code jasmine (chapter 4).

Chapter2 The numerical solutions for the integration of the Maxwell Vlasov equations:

the particle-in-cell method, the numerical integration of the Maxwell equations in

time, the standard second order PIC scheme, the charge conservation issues and

the laser envelope approximation.

Chapter3 My work parallelizing and improving with 3D visualization of INF&RNO, an

efficient code, developed at LBNL, designed for modeling underdense-plasmas laser

interactions.

It assumes cylindrical symmetry, allows to run the simulations in a reference frame

that balances the length scales (the so called Boosted Lorentz frame) and uses

the laser envelope model, integrated implicitly. Given these unique features of

the code, its parallelization required to find some original and physically guided

solutions for solving some recursive dependencies in the numerical scheme (which

make parallelization impossible).

Chapter4 Jasmine, my particle-in-cell code and framework, designed to be as flexible

and reusable as possible, in order to reduce code rewrites for developing different

schemes to the minimum.

It is parallelized for running 1D/2D/3D simulations on graphics processing units

CUDA and hybrid (CPU+GPU nodes) HPC clusters, achieving very promising

performance results: the order of the (1 GPU)/(CPU core) speedup can be of 50x. A

standard, second order, PIC scheme and the INF&RNO scheme were implemented

in the framework. At the moment of writing this thesis, no published article has

claimed the development of a full electromagnetic PIC code with such features.

5

1 The Physics of plasma based laser

accelerators

1.1 Basic plasma physics

A plasma is basically a fully ionized gas. The free charges in a plasma make it very

reactive to electromagnetic solicitations. Plasmas exist in various forms, which can be

different, in density, temperature or dimensions, of various orders of magnitude. However,

their properties and behaviors are extremely scalable across these differences.

Dynamically, a plasma is a statistically relevant number of charged particles, interacting

with and generating electromagnetic fields. In principle, the dynamics of a plasma is fully

determined considering that: the force acting on each relativistic particle is the Lorentz

force and the electromagnetic fields evolution is governed by the Maxwell equations. In

CGS ( xi , pi = miγivi are the position and momentum of the i-th particle):

xi = vi

pi = qi

(E(xi) +

vi×B(xi)c

) (1.1)

∇ �B = 0

∇ �E = 4πρ

∇×B− 1c∂E∂t = 4pi

c j

∇×E+ 1c∂B∂t = 0

(1.2)

In the latter ones, the sources ρ and j are calculated starting from the particles’ phase

space distribution without doing any spatial average operation: this guarantees to include

in the model the binary collisions.

This approach is very impractical, both for what concerns numerical simulations (the

number of the particles involved is not realistically computable) and analytical theory

6

1 The Physics of plasma based laser accelerators

(the nonlinear partial differential equations describing the physics are not solvable ana-

lytically).

A cascade of physical considerations and approximations lead to a cascade of simplified

models.

The first step can be to neglect the collisional effects in the kinetic, phase space density

plasma description. The validity of the collisionless model is evaluated considering the

Debye length and its relations with other plasma parameters.

The collisionless model of the plasma can be further simplified to a fluid model, if the

phase space distribution can be considered as single valued for each point in space:

for each position in space the velocity is defined univocally. The fluid model can be

used to describe various plasma oscillation modes, linear and nonlinear, but it can not

include wavebreaking phenomena, where different particles in the same point in the the

configuration space have different velocities.

1.1.1 Debye Length and plasma parameter

We can consider an hydrogen-like fully ionized plasma, in which we label electron and

ion densities as ne and ni respectively.

The plasma is near to thermal equilibrium (at temperature T) and, in the unperturbed

state, we consider ne = ni = n0 everywhere.

If we perturb the system by adding a discrete point charge Q(> 0), we see that it will

attract the electrons and repel the ions, making a cloud of net negative charge that

shields the point charge’s electric field and potential. The electrons are prevented from

collapsing onto the point charge (to fully neutralize it) by their thermal energy.

We can estimate the amplitude of this shielding effect. The electrostatic potential affects

the distribution function for the electrons and other species and the Boltzmann factor

e−E/kT becomes:

fj(r, v) = exp

(−mv2

2KT

)exp

(qjΦ

kTj

)(1.3)

We can then write the electron density as:

nj(r) =

ˆ

fj(r, v)d3v = n0 exp

(−qjΦ

kTj

)(1.4)

7


We assume that:

• qjΦ << kTj for each species j, so we can expand nj as:

nj = n0j

(1− qjΦ

kTj

)(1.5)

In CGS, Poisson equation for electrostatic potential Φ reads:

∇2Φ = −4πρ = −4π∑

j

qjnj − 4πQδ(r) (1.6)

Substituting njwe get:

−∇2Φ = 4π∑

j

[qjn0j

(1− qj

kTj

)]Φ+ 4πQδ(r) (1.7)

Defining the Debye length as

λD =

√√√√1

4π∑

j

[qjn0j

(1− qj

kTj

)] (1.8)

and exploiting the spherical symmetry

− 1

r2∂

∂r

(r2

∂Φ

∂r

)=

Φ

λ2D

+ 4πQδ(r) (1.9)

we can solve the equation for Φ, giving:

Φ(r) =Q

rexp

(− r

λD

)(1.10)

Being mi ≫ me it is reasonable to consider the ions as immobile in most of the cases,

especially on short time scales; this allows to rewrite the Debye length as:

λD =

√kT

4πn0ee2(1.11)

We can interpret the Debye length as the space scale at which the plasma shields the

electrostatic potentials generated by single point charges.

The number of particles in a Debye sphere is called the plasma parameter Λ, whose

significance is explained in1.1.5:

8


Λ =4

3nλ3

D

(=

4

3

(kT

4πe2

)3/2

n−1/2

)(1.12)

1.1.2 Kinetic and fluid plasma descriptions

The kinetic model of a plasma describes the state of the particles of the system by means

of a distribution function fj(x,p, t), which is just the density of particles (of species j)

in phase space (x,p = γmv):

n(particles in dxdp) = fj(x,p, t)dxdp (1.13)

Macroscopical (observable, fluid) quantities are obtained averaging over momenta:

nj(x) =´

fj(x,p, t)dp particle density

njuj(x) =´

vfj(x,p, t)dp mean velocity

[Pkl(x)]j = mj

´

vkvlfdp mean pressure

(1.14)

ρj(x) = qj´

fj(x,p, t)dp charge density

jj(x) = qj´

vfj(x,p, t)dp current density(1.15)

On a time scale much smaller than collision frequencies, the Boltzmann equation for fj

reduces to the phase space continuity equation:

∂fj∂t

+∂

∂x� (xfj) +

∂

∂p� (pfj) = 0 (1.16)

, which, once coupled with equations of motion of particles (Lorentz force), is named

Vlasov equation:

x = v

p = qj

(E(x) + v×B(x)

c

) (1.17)

∂fj∂t

+ v�∂fj∂x

+ qj

(E+

v ×B

c

)�

∂fj∂p

= 0 (1.18)

The electromagnetic fields are given by Maxwell equations, which close the system and

make it self-consistent:

9


∇ �B = 0

∇ �E = 4πρ

∇×B− 1c∂E∂t = 4pi

c j

∇×E+ 1c∂B∂t = 0

(1.19)

In the non-relativistic limit, we can obtain the n-fluid model (each species j is treated

as a separate fluid interacting with the others by means of EM fields) of a plasma by

considering the momenta of the Vlasov equation. The first two are:

ˆ

dp

[∂fj∂t

+ v�∂fj∂x

+ qj

(E+

v×B

c

)�

∂fj∂p

]= 0 (1.20)

ˆ

dpp

[∂fj∂t

+ v�∂fj∂x

+ qj

(E+

v×B

c

)�

∂fj∂p

]= 0 (1.21)

We obtain the spatial continuity equation for the particle density and a fluid equation

that describes the motion of the charged fluids.

∂nj

∂t+

∂

∂x� (njuj) = 0 (1.22)

nj

(∂uj

∂t+ uj �

∂uj

∂x

)=

njqjmj

(E+

uj ×B

c

)− 1

mj

∂

∂x�

[Pkl]jm

We have therefore reduced the 6-dimensional Vlasov equation to a 3 dimensional fluid

equation.

This integration performed considering the momentum of next order (second) would

lead to the equation for a pressure tensor P . It is reasonable to restrict to these two

equation if further assumptions concerning heat flux and obtaining a thermodynamical

state equation.

10


Being the k and ω respectively the typical wavelength and frequency of the system and

vj the thermal velocity of the particles of the j− th species, two simplifying assumptions

can be:

• if ω/k ≪ vj,thermal, then the heat transfer is so fast that the fluid can be considered

isothermal: pj = njθj

• if ω/k ≫ vj,thermal, then the heat flux is negligible, and we can use an adiabatic

equation for pressure pj/nγj = constant

1.1.3 Collective motion: plasma electron waves

Using fluid equations we can study a common form of collective motions in plasmas: the

charge and electrostatic field oscillation associated with the motion of the electrons.

We consider an 1D, initially uniform, non-relativistic, neutral plasma (ions with Z=1),

with massive ions that can be considered as fixed in our process’ timescale. We consider

only electrostatic fields and neglect thermal effects.

Under these conditions the fluid equations are simplified to:

∂ne

∂t+

∂

∂x(neue) = 0 (1.23)

∂

∂t(neue) +

∂

∂x(neu

2e) = −

neeE

me(1.24)

The electrostatic field can be obtained directly using Poisson equation:

∂E

∂x= −4πe (ne − n0ions) (1.25)

Considering small variations (denoted by tilde) in particle density n, the average velocity

u and the electrostatic field E and thus linearizing equations, we get:

ne = n0 + n, ue = u (1.26)

∂E∂x = −4πen0 = ∂2n

∂t2 −n0em

∂E∂x

(1.27)

11


Substituting in the fluid equation we obtain a stationary wave equation:

(∂2

∂t2+ ω2

pe

)n = 0 (1.28)

ωpe =

√4πe2

n0

me= 5.64 × 104

(ne

[cm−3

])1/2(1.29)

The last equation describes the oscillation of charge density with angular frequency ωpe,

the plasma electron frequency.

1.1.4 The collision frequency

In fully-ionized plasma binary particle interactions are mostly due to the Coulomb force.

We can take the order of magnitude of the range of this force as the order of magnitude

of the Debye length.

We consider a charged particle (mass m, charge q0, velocity v0) approaching a target

particle, at rest, with mass M ≫ m and charge q0.

The particle will not be able to get closer than the distance of minimum approach δ to

the target:

mv202

=qq0δ→ δ =

2qq0mv20

(1.30)

so that, when a particles “tries” to overlap the δ region, is deflected at high angles.

Having a population of particles of density n, mean velocity v0, heading for our target

particle, the high angle scattering rate can be evaluated computing the flux of particles

passing within a radius δ from the target:

νc = (πδ2)nv0 = π 4e2

m2v20nv0 =

= e4nm2v30

∝ T−3/2(1.31)

1.1.5 Collisionless limit and the plasma parameter Λ

Recalling (CGS units):

12


vc =4πe4n

m2v30(1.32)

λD =

√kT

4πn0e2(1.33)

ωpe =

√

4πn0e2

me(1.34)

and

vth =

√kT

me(1.35)

then:

ωpe =vthλD

=2π

Tpe(1.36)

A thermal electron travels about a Debye length in a plasma oscillation period. Just as

Debye length represents the electrostatic correlation length, so the plasma period plays

the role of the electrostatic correlation time.

We consider the ratio between the two frequencies which characterize the problem ωp

and νc:

ωp

νc=

(4π

ne2

m

)1/2 (m2v304πe4n

)(1.37)

ωpe

νc= 4πΛ (1.38)

and it is apparent how Λ (the number of electrons in a Debye sphere) plays a key role

for evaluating the dominance of collisional regimes, connecting the collective motion and

the collision timescales.

The collisionless limit, in which the collision timescale is slow compared to the collective

phenomena one, can be expressed as (using the plasma parameter definition 1.12):

ωpe

νc=

τcτpe

= 4πΛ≫ 1 (1.39)

13


1.1.6 Electromagnetic waves propagation in a cold plasma

From the non-relativistic fluid equations, neglecting pressure terms in a cold plasma, one

gets:

∂n1

∂t + n0∇ · u1 = 0

men0∂v1

∂t = −en0E1

∇×B1 = −4πec n0u1 +

1c∂E1

∂t

∇×E1 = −1c∂B1

∂t

(1.40)

in which, the suffix 0 indicates the equilibrium quantities and 1 the small fluctuations.

ne = n0 + n1

Expressing E in its plane plane waves Fourier decomposition,

E =∑

k

Ek exp(k · r− ωt) (1.41)

considering the first order component in the series, and substituting in the second of the

previous equations, one gets u1 = eiωme

E1. Substituting the latter in the third equation

in 1.40 one obtains:

∇×B1 = −iω

c

(1− ω2pe

ω2

)E1 (1.42)

Furthermore, taking a time derivative and using the fourth equation (∇×E1 = ... ) one

has:

ω2

c

(1− ω2pe

ω2

)E1 = c∇× (∇×E1) (1.43)

, or, using the plane wave expression for E1 on the right hand side:

k× (k×E1) = −ω2

c

(1− ω2pe

ω2

)E1 (1.44)

Taking, for symmetry reasons, k = kz, the equation can be recast to the form:

14


ω2 − ω2pe − k2c2 0 0

0 ω2 − ω2pe − k2c2 0

0 0 ω2 − ω2pe

E1x

E1y

E1z

=

0

0

0

(1.45)

The solutions of this system are the dispersion relations for the electric field linear oscil-

lation modes.

• A first solution (previously found) is for longitudinal electric field oscillations (plasma

electron waves):E1x = E1y = 0

ω2 = ω2pe

(1.46)

• Another one describes some transversal waves:E1z = 0

ω2 = ω2pe + k2c2

(1.47)

Substituting k from the dispersion equation in the plane wave expression one gets:

E = Ek exp(i(kz − ωt)) = Ek exp

i

√ω2 − ω2

pe

c

z

exp (iωt) (1.48)

If ω2−ω2pe < 0, the spatial term becomes a damping term, with a characteristic length λ

λ =c(√

ω2pe − ω2

) (1.49)

The plasma skin depth, defined in the ω ≪ ωpe limit as λskin depth = cωpe

=(5.31 × 105 n

−1/2e

)cm

represents the length scale at which the plasma damps electromagnetic waves of angular

frequency ω.

This electromagnetic waves non-propagation condition, ω2 − ω2pe < 0, can be expressed

also introducing the plasma critical density for EM waves of frequency ω:

nc(ω) =( me

4πe2

)ω2 (1.50)

If ne > nc the plasma is called “overdense” or “overcritic” and it becomes opaque to the

radiation having frequency smaller than ω.

15


1.2 Laser plasma acceleration physical concepts and regimes

In this section are presented some basic laser-plasma acceleration techniques.

1.2.1 Electron acceleration: LWFA

In the regime named LWFA (Laser WakeField Acceleration), an electron bunch is ac-

celerated while it travels in phase with a plasma wakefield, generated by a laser pulse

driver.

Qualitatively, when a ultra short laser pulse begins to propagate in a uniform underdense

plasma, the ponderomotive force, pushing electrons, creates a longitudinal charge sepa-

ration that results in a longitudinal electric field, which pulls back the electrons again.

This results in the formation of a plasma electron wave and the corresponding wakefield.

The quasi-resonance condition:

Llaser∼= λplasma(nplasma)

2(1.1)

makes the formation of the wake very efficient. In fact, in these conditions, the pondero-

motive force and the wake electric field change sign at the same frequency.

For the laser pulse to propagate and create the wakefield, the plasma must be underdense.

The laser intensity, parametrized by the adimensional parameter a = eAmc2

determines the

shape of the wakefield: for a≪ 1 the wakefield is linear, for a ∼ 1 it becomes nonlinear

(and the electrons quiver motion becomes relativistic), and for higher a a bubble (see

1.2.2 for further conditions), or blow-out regime, can be achieved.

The electron bunch can be injected externally or generated from the wakefield itself, by

wavebreaking phenomena occurring in relativistic conditions.

Analytical models manage to describe the wakefield generation phenomenon, for the

linear regime, only if the driver (the laser beam) is assumed nonevolving : it is function

of the comoving coordinate ξ = z − vpt vp ∼= vg ≤ c only (vp is the plasma wave phase

velocity and vg is the laser group velocity).

On the other hand, nonlinear wakefield generation can be treated analytically only in

both the nonevolving driver and one-dimensional approximations: the driver has to be

16


assumed to be broad, kpr⊥ ≫ 1, where r⊥ is characteristic radial drive beam dimension.

In this case, starting from the relativistic cold fluid equations and the Poisson equation:

∂nc∂t +∇ · (nu/γ) = 0

∂(u− a)/∂ξ = ∇(a− Φ)

∂2Φ∂ξ2 = k2p

(nn0− 1), (1.2)

one can get the evolution equation for the electrostatic wakefield potential [12] (γp =(1− (vp/c)

2)

−1/2) :

1

k2p

∂2φ

∂ξ2= γ2p

vpc

[1−

(1 + a2

)

γ2p(1 + Φ)2

]−1/2

− 1

≃(γ2

p≫1)1 + a2

2(1 + Φ)2− 1

2(1.3)

In general, wakefield generation by an evolving laser pulse can be studied only with

numerical simulations. The following simulation results will help illustrating some accel-

eration regimes.

The units in all the plots are in CGS and the spatial scale of the density plots is the cell

grid index. The longitudinal phase space plots units are cm and MeV per electron or

proton. The physical parameters are given in terms of the adimensional laser potential

a0 = eA/mec2 and the density is normalized as n/nc, being nc as in 1.50. All the jasmine

simulations were run on one or multiple GPUs.

A first sample of simulation is a 2D one, run with my code jasmine (chapter 4), which

illustrates the generation of a linear wakefield . The parameters are:

a0 n/nc λlaser Llaser w0,laser c∆t ∆x = ∆y Grid P.P.C

0.01 0.001 1µm λp/2 λp/6 0.035µm λlaser/20 4096x1024 16

and the longitudinal electrical current plot, showing how the generated linear wakefields

looks like (the laser pulse longitudinal component is on the right):

17


For higher laser intensities, such as a0 = 1.75, nonlinear wakefields are created and

it is observable a wavebreaking effect which results in electrons auto-injection. The

parameters of a jasmine 2D simulation in this regime follow and the results are in figure

1.2.1.

a0 n/nc λlaser Llaser w0,laser c∆t ∆x ∆y Grid P.P.C.

1.75 0.004 1µm λp/2 2λp 0.029µm λlaser/30 2∆x 1280x1024 16

The parameters of a similar 3D simulation follows and the results are plot in figure 1.2.2.

a0 n λlaser Llaser w0,laser ∆t ∆x ∆z = ∆y Grid P.P.C.

4.0 1.38e19cm−3 0.8µm 15fs 8.2µm 0.1fs 0.08µm 0.32µm 256x224x224 4

1.2.2 Electron acceleration: bubble regime

The bubble mechanism is a very robust electron acceleration scheme that can produce

very short self-injected monochromatic beams. Given its deep nonlinearity and com-

plexity, the only method available for modeling accurately this regime is to run full 3D

simulations (or cylindrical symmetry reduced ones, as in INF&RNO, chapter 3).

Qualitatively, an intense laser pulse, shorter than a plasma wavelength both in longitu-

dinal and transversal directions, propagates in an underdense plasma and pushes away

18


Figure 1.2.1: Jasmine nonlinear wakefield simulation results. Wakefield generated inrelativistic nonlinear conditions, density and longitudinal phase space[x(cm), E(MeV )] plots showing wavebreaking and acceleration. The firstplot is at time ct = 51µm and the others at ct = 553µm.

19


Figure 1.2.2: Jasmine 3D nonlinear wakefield generation simulation results. Evolution,at times multiple of 12.8 fs, of a 3D wakefield slice , simulated with jasmine.

20


all the electrons that invests, creating a robust, low density cavity (the bubble) just be-

hind. The electrons, accumulating on its borders, slip to the back of the cavity, where

an injection mechanism happens. The injected electrons are trapped inside the cavity

and are accelerated by the electrostatic field resulting from the charge separation. As the

electrons inside the cavity get accelerated, they move towards inside it, allowing more

electrons to be injected (beam loading).

The conditions required for the bubble regime are ([14]):

• Relativistic or ultra-relativistic electron motion a > 1.

• Underdense plasma nnc(λlaser)

≪ 1 .

• Pulse length shorter than its transversal waist w0, and kpw0 ≃√a.

I have prepared some simulations for explaining this regime further:

• A 2D simulation run with my code jasmine (chapter 4). The physical and numeri-

cal parameters are:a0 n/nc λlaser Llaser w0,laser c∆t ∆x ∆y Grid P.P.C.

7.5 0.008 0.8µm w0,laser/√2 λp

√a/2π 0.023µm λlaser/30 2∆x 2048x1024 20

and the simulation results are plot in figure 1.2.3.

• A similar 3D simulation, whose results are plot in figure 1.2.4:a0 n/nc λlaser Llaser w0,laser c∆t ∆x Grid P.P.C.

10 0.005 1µm 8µcm 3µcm 0.018µm λlaser/32 768x256x256 8

• A 2D, 3D in cylindrical symmetry, simulation run with INF&RNO (chapter 3),

plotted in figure 1.2.5.a0 n/nc λlaser Llaser w0,laser ∆z Grid P.P.C.

4 0.0025 = 2πkp√n/nc 2/kp 4/kp 1/30λp 6/∆z 16

1.2.3 Ion acceleration: TNSA and RPA

In the regime named TNSA (Target Normal Sheath Acceleration), qualitatively, an intense

(the electrons become relativistic very soon), linearly polarized laser pulse is partially

21


Figure 1.2.3: 2D bubble simulation. Electron charge density at ct = 294.4µm. Cavity for-mation (ct = 23µm,ct = 117µm) and particle phase space [x(cm), E(MeV )]at ct = 491µm. Electric field inside the cavity (bubble) at ct = 117µm.

22


Figure 1.2.4:3D bubble simulation. Electrons density 3D plot (with injected bunch ingreen) and sliced electron density at t = 0.004cm/c.

Figure 1.2.5:INF&RNO bubble simulation. Electrons density and longitudinal electricfield ωpt = 480

23


absorbed and reflected by an overdense plasma (initially solid and ionized by the laser),

whose electrons are heated and slightly pushed forward, overcoming the rear side of the

target. This charge displacement creates a static longitudinal electric field, that can pull

plasma ions out of the target. The resulting accelerated ion bunch has an exponential

distribution of energy, and the maximum energy reached is of the order of several MeV

s ([24]).

To better illustrate this physical regime, I have run a 2D simulation. It was run in 2D

with my PIC code jasmine. The simulation setup is the following:


10 80 1µm 25fs 3µm 5 ∗ 10−7cm 1/15λp 2∆x 5120x4096 64

, and the results are in figure 1.2.6.

In the regime named RPA (Radiation Pressure Acceleration) regime, particle acceleration

is dominated by radiation pressure. The results of a simulation in this regime are in figure

1.2.7 , the setup is similar to the one in 1.2.3, but here a circularly polarized laser was

taken.


20 50 1µm 8µm 8µm 6.3 ∗ 10−7cm 1/15λp 2∆x 5120x4096 64

24


Figure 1.2.6:

TNSA simulation results. Electron density displacement, correspondingstatic longitudinal electric field accelerating protons, whose phase space[x(cm), E(MeV )] at the end of simulation is plot rightmost. The time ofthe density plots is ct = 0.002048cm.

25


Figure 1.2.7:

RPA simulation results. Protons density displacement, static longi-tudinal electric field and protons phase space [x(cm), E(MeV )] atct = 0.0036288cm

26

2 The numerical modelization

2.1 The particle-in-cell method

2.1.1 Phase space representation

The most straightforward and complete approach to model numerically a system de-

scribed by Maxwell-Vlasov equations consists in computing, for each time, the phase

space distribution fj(x,p, t), discretized on a grid. In a full three-dimensional model,

the plasma phase space is six-dimensional. Thus, the number of grid points (the memory

required for the execution of the simulation), scales as n6, n being the linear dimension

of the discrete grid.

This memory requirements are far beyond the actual technology limits, for example

taking a meaningful grid size, let’s say n = 1024, the memory required would be ∼1018Gb.

It is therefore necessary to use a “compressed” representation of the discretized fj(x,p, t).

A method that use a very sparse phase space representation is the so-called particle-in-cell

method.

It decomposes the fj distribution into the sum of contributions coming from a finite Npj

set of computational macro-particles, or quasiparticles. Their trajectories are followed in

the phase space in a lagrangian manner, while the electromagnetic fields are discretized

on a spatial grid, with grid spacing ∆x.

The macro particles are not point like charges, they are represented by a density function

which is extended in space so that they can be considered as a smooth cloud of charge ,

in order to smooth out the numerical noise. The support of these function has a size of

the order of the grid cell size. Whereas in the configuration space the numerical particles

are defined by a finite extension, in the momentum space they are point-like (they have

definite momentum).

The interaction of the particles with the field grids, which complete the description of the

dynamics, is achieved by processes of interpolation and deposition. The interpolation and

27


deposition processes, being the support of the quasi-particle density function compact

involve only a small number of grid cells, the ones overlapping with the particle’s finite

shape.

The Vlasov equation and the equations of motion read:

(∂t + x·∂x + p · ∂p) fj(x,p, t) = 0 (2.1)

x =p

γm, p = F(x,p, t) (2.2)

The PIC approach consists in discretizing the phase space density function using a finite,

approximated, sum:

fj(x,p, t) = f0j

Npj−1∑

n=0

g (x− xn(t)) δ (p− pn(t)) , (2.3)

in which, f0 is a normalization factor, xn(t) is the trajectory of the n-th macro-particle

and pn(t) is its momentum.

The function g(x) is the macro-particle shape function.

The shape function is used as a convolution kernel and it is assumed to have δ-like

properties (from which follows f0 =1Np

):

´

g(x− xn)dx = 1´

∂xg(x − xn)dx = 0(2.4)

g(x) describes the macro-particle spatial extension in space and it is useful for reducing

the numerical noise arising from interpolation and deposition processes, which would

arise if a δ-function was used instead. The meaning of g(x) is evident considering the

expression for the charge density, which becomes:

ρ(x, t) =∑

j Qj

´

fj(x,p, t)dp

ρ(x, t) =∑

j,n qjg(x− xn)

28


, whereas the electrical current can be defined as:

j(x, t) =∑

j,n

vnqjg(x − xn) (2.5)

Rewriting the Vlasov equation 2.1, using this discretized phase space discretization and

equation 2.2, one gets:

∂tf = −f0∑Np

n {[∂xg (x− xn(t)) · xn(t)] δ (p− pn(t))

+g (x− xn(t)) [∂pδ (p− pn(t)) · pn(t)]}x · ∂xf = f0

∑Npn

{pn(t)γm · [∂xg (x− xn(t))] δ (p− pn(t))

}

p · ∂pf = f0∑Np

n {F (z, pn(t), t) · g (x− xn(t)) [∂pδ (p− pn(t))]}f0∑Np

n

(−xn · g′nδn − pngn · δ′n + pn

γm · g′nδn + F (z, pn(t), t) gn · δ′n)= 0

(2.6)

Integrating in the momentum space and using the delta function properties, one has:

Np∑

n

(−xn +

pn

γm

)∂xg (x− xn(t)) = 0, ∀x→ xn =

pn

γm(2.7)

Being Fn the spatial average of the external force field acting on the n−th macroparticle

F(x,pn, t) evaluated over the shape function g(x):

Fn(x,p, t) =

ˆ

g(x − xn)F(x,pn, t)dx (2.8)

, integrating on dz, and using the delta-like properties 2.4 of g(x), one gets:

Np∑

n

(−pn + Fn

)∂pδ(p− pn) = 0, ∀p = 0→ pn = Fn (2.9)

The particle-in-cell method, therefore, reduces the computational complexity required

for the evolution of a six-dimensional phase space grid to a system of 2Np (for each

species) equations of motion, coupled with the proper equations (in our case for the e.m.

fields) that close the system giving an expression for the external force field F.

29


2.1.2 Passes of an electromagnetic PIC code and numerical parameters of

a laser plasma interaction simulation

Dealing with charged particles, the physical description of the problem is closed by the

Maxwell equations for the electromagnetic fields, which are coupled with the particle

motion in a bidirectional way (by the Lorentz force and by the evaluation of charge and

current densities).

The passes of an integration cycle of an electromagnetic PIC code are the following:

1. Time advancement of macro-particles momentum and position p,x, using the ob-

tained equations of motion and the Lorentz force. The fields are interpolated from

the E,B grids.

2. Deposition (spatial average on a discrete grid) of the external field quantities needed

in Maxwell equations, ρ and j .

3. Time advancement of electromagnetic fields E,B, discretized on spatial grids, (see

subsection 2.2.1), using Maxwell equations and the quantities computed in step 2

as external sources.

The critical parameter of a simulation is the grid cell size ∆x. The integration timestep

∆t is related to ∆x by the Courant condition ([6]). It is a condition required for the

stability of the explicit integration schemes for the Maxwell PDEs, reading ∆t ≤ c∆x,

where the constant c depends on the set of algorithms used.

Furthermore, the size ∆x must be small enough to resolve with enough grid points the

typical lengths of the considered system.

30


In the case of a system of electromagnetic waves interacting with a plasma, these are:

• λem, the wavelength of the electromagnetic waves

• λsd = c/ωpe, the plasma skin depth

The smallest of the two length scales must be resolved with enough grid points. The two

length scales correspond to two mutually-exclusive regimes:

• Sovracritical regime: ω ≪ ωpe → λsd ≪ λem , the λsd must be resolved, having the

other one resolved as well.

• Underdense plasma regime: λem ≪ λsd , the λem must be resolved, having the

other one resolved as well.

The laser envelope approximation (see section ) may come to help in this case,

requiring only the much larger scale λsd to be resolved, allowing the use much

smaller grid sizes.

Another critical parameter of a PIC simulation the number of macro-particles per cell,

sampling the local phase space. Approximating the phase space distribution as a finite de-

composition of a too small number of spatially extended macro-particles, can cause some

regions of the phase space to be represented with not enough detail and the introduction

of a statistical noise effect. The amplitude of the latter effect scales approximately with√Npart per cell ([6]).

2.1.3 Interpolation and deposition using shape functions

The spatial averaging needed for interpolation and deposition processes is defined using

the particles’ shape function g.

2.1.3.1 Force interpolation

The average (interpolated) force acting on a particle is defined as, being F(x,pn, t) =

q(E(x) + pn×B(x)

mγc

):

Fn(xn,pn, t) =

ˆ

g(x− xn)F(x,pn, t)dx (2.10)

Indexing the grid cells, with characteristic function χı, with the multidimensional index

ı = (i, , j, k) it is possible to decompose the above integral average into a finite sum of

single cell averages:

31


Fn(xn,pn, t) =∑

ι∈G

ˆ

Xι

g(x− xn)F(x,pn, t)dx (2.11)

The force is given by the fields which are discretized in such a way that they take a

single, constant, value per cell Ei,Bi. It is therefore possible to write the cell-integrals

as function of the particle position only:

Fn(xn,pn, t) =∑

ι∈G

Fı(pn, t)

ˆ

Xι

g(x− xn)dx (2.12)

, or, introducing the shape factors (for a particle whose position is xn) Sı(xn) =´

Xιg(x−

xn)dx:

Fn(xn,pn, t) =∑

ι∈G

Fı(pn, t)Sı(xn) (2.13)

For the shape factors the property∑

ı Sı(x) = 1 hold true.

2.1.3.2 Charge and current deposition

In order to evaluate the current and electrical charge density, discretized on a grid, it is

necessary to “deposit” the macro-particle charge on the grid nodes. Being ρ(x) defined

as ρ(x) =∑

n q g(x − xn), then:

ρı =

´

χıρ(x)dx

´

dxχı = Vı=

ˆ

χı

[∑

n

qg(x− xn)

]dx/Vı

=∑

n

q

[ˆ

χı

g(x − xn)dx

]/Vı =

1

Vı

∑

n

qSı(xn)

2.1.3.3 Common shapefunctions and shapefactors

For a regular 3D cartesian grid, with grid cells sized ∆x = (∆x,∆y,∆z) centered in the

point xı=(i,j,k) = (xi, yj , zk), xi = x0 + ∆x · i, it is useful to introduce the centered and

normalized shapefunctions and shapefactors g(x) and S(xı), defined starting from the

cell-centered coordinate system (denoted by ~):

32


xı = ((x− xi) /∆x, (y − yj) /∆y, (z − zk) /∆z) , (2.14)

, in which, the shape factors becomes:

S(xı) = Sı(x) = S ((x− xi) /∆x, (y − yj) /∆y, (z − zk) /∆z) (2.15)

The equation Sı(xn) =´

Xιg(x−xn)dx, can be recast using the linear change of variables

y = ∆y ∗ yı + yı:

S(xı) = Sı(x) =

ˆ

χi

g (y − (xı ∗ d∆x+ xı)) dy (2.16)

S(xı) = V

ˆ

χg(∆x ∗ (yı − xı))dyı =

ˆ

χg(yı − xı)dyı (2.17)

where χ is the volume of the box defined by |x| < 12 , |y| < 1

2 , |z| < 12 and g(yı) ≡

V g(∆x ∗ yı) .

It is natural for the shapefunction to be separable in one dimensional components, i.e.

g(x) = g(x)g(y)g(z). By simple integration properties, one has also:

S(x) = Sx(x)Sy(y)Sz(z) (2.18)

This last relation and equation 2.17 allow to compute easily the shape factors for any

separable shape function. Some examples of normalized, one-dimensional shape func-

tions/factors are (dropping the ~ in the figures) are:

33


34


These classical shape functions are defined piecewise on intervals of length ∆x (1 in the

normalized coordinates system). By definition, the shape factor functions S(x) have the

same properties. The intervals α ∈ Å of piecewise definition of these S(xi) are always

of the kind α = [aα, aα + 1], and can be identified by their parameter a ∈ A, integral or

35


half-integral.

a ≤ xi ≤ a+ 1 (2.19)

Replacing some definitions in the relation above, and applying the floor and ceil function

(⌊x⌋and ⌈x⌉) properties, one obtains directly the cell index corresponding to a given piece

of function definition (for performance reasons, it is useful to know it in advance):

a∆x ≤ x− x0 − ia∆x ≤ (a+ 1)∆x

⌈(x− x0)/∆x− a− 1⌉ ≤ ia ≤ ⌊(x− x0)/∆x− a⌋

∀a : ia = ⌊(x− x0)/∆x− a⌋ =

⌊x0⌋ − a ≡ i0 − a; a integer⌊x0 +

12

⌋− a′ ≡ i′0 − a′; a = a′ − 1

2halfinteger(2.20)

The optimized chain of computation reads (’ for the case in which a are half integer):

∀a ∈ A

↓ia = i

(′)0 − a(

′)

↓xi0−a = xi0 − a(

′)

↓S(xi0 − a(

′)) ≡ Sa(xi0) optimizedSa

Defining b ∈ B and c ∈ C as the analogous, for the y and z directions, of the intervals

a ∈ A, one can finally recompose the full 3D interpolation algorithm for a particle in

position x = (x, y, z) (dropping ’):

36


(i0, j0, k0) = (⌊x0⌋ , ⌊y0⌋ , ⌊z0⌋) =

(⌊(x− x0)/∆x

⌋, ..., ...

)

F =∑

(abc) Sxa (xi0) · Sy

b (yj0) · Szc (zk0) · Fi0+a, j0+b, k0+c

(2.21)

and the deposition algorithm (of the single particle quantityF ) :

(i0, j0, k0) = (⌊x0⌋ , ⌊y0⌋ , ⌊z0⌋) =

(⌊(x− x0)/∆x

⌋, ..., ...

)

∀a, b, c : = Sxa (xi0) · Sy

b (yj0) · Szc (zk0) · F → ⊕→ Fi0+a, j0+b, k0+c

(2.22)

More generally, considering symmetric shape factors S(x) with support supp(S) in the

interval [−l, l = b+∆x/2], the interpolation is computed only on the grid cells for which

Si(x) 6= 0, or, equivalently, S(xı) 6= 0 holds true:

supp(S) = {−l ≤ xi ≤ l}−l ≤ (x− xi) /∆x ≤ l

−l∆x ≤(x− x0 − i ·∆x

)≤ l∆x

(x− x0)/∆x− l ≤ i ≤ (x− x0)/∆x+ l

⌈x0 − l⌉ ≤ i ≤ ⌊x0 + l⌋ (2.23)

So, the cells interacting with our particle, are the one with i-index in the set I:

i ∈ I = {⌊x0 − l⌋+ 1, ⌊x0 − l⌋+ 2, ⌊x0 − l⌋+ 3, ..., ⌊x0 + l⌋}#(I) = 2l;

Furthermore,

• if l is an integer: i ∈ I = {⌊x0⌋ − l + 1, ..., ⌊x0⌋ − l + (l − 1), ⌊x0⌋ , ..., ⌊x0⌋ + l},e.g. l = 1 : i ∈ I = {⌊x0⌋ , ⌊x0⌋+ 1}

• if l = m− 12 is an half-integer: i ∈ I = {

⌊x0 +

12

⌋−m+1, ...,

⌊x0 +

12

⌋, ...,

⌊x0 +

12

⌋+

m− 1} e.g. l = 32 : m = 2, i ∈ I = {

⌊x0 +

12

⌋− 1,

⌊x0 +

12

⌋,⌊x0 +

12

⌋+ 1}

37


2.2 The “standard” second-order PIC: leapfrog and FDTD

The standard, second-order PIC scheme, uses the leapfrog scheme for advancing particles’

momentum and position and the FDTD method for solving Maxwell equations.

2.2.1 Solving the Maxwell equations numerically using the Yee Lattice

The temporal evolution of the electromagnetic fields, in presence of an electrical current

j (generated by the plasma particles’ velocities in our case), is completely determined,

given the proper initial and boundary conditions, by the Maxwell–Faraday and Ampère-

Maxwell laws:

∂E∂t = +c∇×B− 4πj

∂B∂t = −c∇×E

(2.1)

These equations form a system of two first order partial differential equations, which can

be integrated numerically in time using finite difference methods. First of all, the fields

are discretized on a finite grid and the spatial differentiation operators are approximated

by finite differences.

The finite difference approximation ∆h[f ](x) of a linear differential operator ∆ is a linear

function of the field values at the locations x+h ∗k, being k a vector of signed integers,

h the discretization step size vector, and * the component-by-component product:

∆h[f ](x) =∑

k∈κ

µ∆k f(x+ h ∗ k) ∼= ∆f(x) (2.2)

Indexing the grid points G with a multi-dimensional index ı = (i, j, k, ...) ∈ G, the finite

difference, calculated at the grid point ı, is:

(∆h[f ])ı =

∑∈G µı−f ≡ µ∆

ı f

µ∆ı ≡ µ∆

ı−

(2.3)

For the curl, one has:

38


∇×E =

∂yEz − ∂zEy

∂zEx − ∂xEz

∂xEy − ∂yEx

→ (∇×Eı) =

µ∂yı Ez − µ∂z

ı Eyj

...

...

≡ µ(∇×)

ı Ej (2.4)

This way, the PDE becomes a system of ordinary differential equations involving the

values of the fields defined at the grid points locations ı.

(∂E∂t

)ı

= +c(µ(∇×)ı Bj

)ı− 4πjı

(∂B∂t

)ı

= −c(µ(∇×)ı Ej

)

ı

(2.5)

It is now possible to integrate iteratively and numerically the equations, using methods

like Runge-Kutta, or directly expressing the temporal differentiation operators as finite

differences, and then solving the resulting system for the values at later times.

Depending on the fields’ discretization geometry and the physical quantities that is more

important to conserve, various choices can be made between numerical derivative and

integration schemes:

• The time integration is usually computed explicitly (the appropriate Courant–Friedrich’s–Lewy

on the integration step must be therefore satisfied for having stability), using

Runge-Kutta, forward (Euler) or centered (leapfrog) schemes.

• The spatial derivatives (curls) can be discretized using centered (second-order ac-

curate in space), higher order (five point stencil), upwind (for certain geometries

and for certain purposes) or even compact (see [5]) schemes.

A very popular integration scheme, which is second-order accurate in space and time, is

the Finite-Difference Time-Domain method, or FDTD (see [33]). It manages to discretize

both time and all of the space derivatives using centered differences of the kind ∆x ·f ′(x) ≃ f(x+ 1

2∆x)−f(x− 12∆x), which are practically more accurate than the standard

2∆x · f ′(x) ≃ f(x+∆x)− f(x−∆x) , being f ′(x) = ∂xf(x):

• The electric and magnetic fields must be defined at staggered time positions (En,Bn+1/2),

and the time-advance iteration is obtained with a leapfrog step: the PDE system

has a symplectic structure. This integration scheme allows for numerical-energy-

dissipation-free wave propagation.

39


• Spatially : the different field components are stored for different grid locations,

on a Yee Lattice. A 3D computational domain is split into cubical voxels. The

components of E are stored for the edges (in the corresponding directions), while

the components of B for the face centers of the cube. The numerical derivatives

composing the curl operations, can all be expressed using central differences, as we

can see clearly in the following figure (from wikipedia):

For an uniform 3D cartesian discretization of the fields (the grid spacings are dx, dy, dz),

the FDTD scheme can be therefore written explicitly as (the notation used is Fni,j,k ≡

F (idx, jdy, kdz, ndt) ):

En+1xi+1/2,j,k

−Enxi+1/2,j,k

dt = cB

n+1/2zi+1/2,j+1/2,k

−Bn+1/2zi+1/2,j−1/2,k

dy − cB

n+1/2yi+1/2,j,k+1/2

−Bn+1/2yi+1/2,j,k−1/2

dz

−4πJn+1/2xi+1/2,j,k

En+1yi,j+1/2,k

−Enyi,j+1/2,k

dt = cB

n+1/2xi,j+1/2,k+1/2

−Bn+1/2xi,j+1/2,k−1/2

dz − cB

n+1/2zi+1/2,j+1/2,k

−Bn+1/2zi−1/2,j+1/2,k

dx

−4πJn+1/2yi,j+1/2,k

En+1zi,j,k+1/2

−Enzi,j,k+1/2

dt = cB

n+1/2yi+1/2,j,k+1/2

−Bn+1/2yi−1/2,j,k+1/2

dx − cB

n+1/2xi,j+1/2,k+1/2

−Bn+1/2xi,j−1/2,k+1/2

dy

−4πJn+1/2zi,j+1/2,k

Bn+1/2xi,j+1/2,k+1/2

−Bn−1/2xi,j+1/2,k+1/2

dt = −cEn

zi,j+1,k+1/2−En

zi,j,k+1/2

dy + cEn

yi,j+1/2,k+1−En

yi,j+1/2,k

dz

Bn+1/2yi+1/2,j,k+1/2

−Bn−1/2yi+1/2,j,k+1/2

dt = −cEn

xi+1/2,j,k+1−En

xi+1/2,j,k

dz + cEn

zi+1,j,k+1/2−En

zi,j,k+1/2

dx

Bn+1/2zi+1/2,j+1/2,k

−Bn−1/2zi+1/2,j+1/2,k

dt = −cEn

yi+1,j+1/2,k−En

yi,j+1/2,k

dx + cEn

xi+1/2,j+1,k−En

xi+1/2,j,k

dy

,

(2.6)

or, defining the operators:

40


∇+fijk =

(fi+1,j,k−fi,j,k

dx ,fi,j+1,k−fi,j,k

dy ,fi,j,k+1−fi,j,k

dz

)

∇−fijk =(fi,j,k−fi−1,j,k

dx ,fi,j,k−fi,j−1,k

dy ,fi,j,k−fi,j,k−1

dz

) , (2.7)

as:

∇− · En = 4πρn

∇+ ·Bn+1/2 = 0

En+1−En

dt = c∇− ×Bn+1/2 − 4πJn+1/2

Bn+1/2−Bn−1/2

dt = −c∇− ×En

, (2.8)

with:

En =(En

xi+1/2,j,k, En

yi,j+1/2,k, En

zi,j,k+1/2

)

Bn =(Bn

xi,j+1/2,k+1/2, Bn

yi+1/2,j,k+1/2, Bn

zi+1/2,j+1/2,k

)

ρn = ρni,j,k

Jn =(Jnxi+1/2,j,k

, Jnyi,j+1/2,k

, Jnzi,j,k+1/2

)

(2.9)

The Courant condition of this numerical scheme is ([6]):

c · dt < 1/

√(1

dx

)2

+

(1

dy

)2

+

(1

dz

)2

(2.10)

Physics requires the first two equations (Gauss laws) in 2.8 to hold at every time step .

Analytically, if the electrical charge continuity equation ∂ρ∂t + ∇ · J = 0 holds, the EM

field evolution PDEs 2.1 do automatically enforce Gauss law (if the law was satisfied at

the initial time). In fact, applying the divergence to the Maxwell–Faraday Law ∂E∂t =

+c∇×B−4πj, and using the Gauss law itself, one obtains the electrical charge continuity

equation: if the latter is satisfied so it is the Gauss law.

In a simple PIC simulation, instead, the J deposition process, which uses finite sized shape

functions, introduces and accumulates numerical errors due to cell boundary crossing and

charge conservation must be enforced in other ways, as shown in section 2.3.

41


2.2.2 Boris pusher

For what concerns the quasi-particles’ motion, the solutions are computed using Boris

method, which is a second-order, leapfrog-like, method that perfectly matches with the

previously discussed FDTD and can be used, in general, for integrating the equations of

motion of relativistic particles in an external electromagnetic field.

For a relativistic particle of mass m, the second Newton’s equation reads,

p = γmv, F =dp

dt(2.11)

Introducing u = γv/c = p/mc, one has:

F = md (γv)

dt= m

du

dt(2.12)

In the electromagnetic case, the acting force is the Lorentz force, F = q(E+ v×B

c

), in

which the fields are the result of the shape function interpolation on the grid, at the

particle’s position x. One has:

du

dt=

q

m

(E+

v ×B

c

)(2.13)

The so-called leapfrog method can be used to resolve numerically a symplectic system of

differential equations (with the appropriate initial condition), of the kind:

u = 1

mF

x = u/γ(2.14)

Defining x and u at staggered discrete times xn = x(n∆t), un+1/2 = u(n∆t + ∆t/2),

the method integrates iteratively the equations of motion according to the replacement

rules:

un+1/2−un−1/2

∆t = 1mFn

xn+1−xn

∆t = un+1/2/γn+1/2(2.15)

42


In the non-relativistic case γ = 1 and if the force can be defined at integer times

F = F(x) → Fn = F(xn), all the quantities are discretized and centered correctly in

time.

In our case, instead, Lorentz force requires to know the momentum u and the relativistic

factor γ at integer times n:

un+1/2 − un−1/2

∆t=

q

mc

(En +

un

2γn×Bn

)(2.16)

It is therefore necessary to center in time (obtaining the values at times n∆t) un and γn.

The momentum can be centered using a temporal average (which formally maintains the

second-order accuracy):

un = un+1/2 + un−1/2, (2.17)

and leads to an implicit equation:

un+1/2 − un−1/2

∆t=

q

mc

(En +

un+1/2 + un−1/2

2γn×Bn

), (2.18)

whereas, for the centering of γn, it is convenient to use the Boris method. Defining u+

and u− by the relations:

un−1/2 = u− − qEn∆t

2mc(2.19)

un+1/2 = u+ +qEn∆t

2mc(2.20)

and replacing in the above equation, one has:

u+ − u−

∆t= (u+ + u−)× q

2γnmcB (2.21)

u+ − u− = (u+ + u−)× qB∆t

2γnmc(2.22)

43


The very last equation represent a rotation of the vector u around an axis parallel to B

of an angle θ = −2 arctan(qB∆t/2γmc). Therefore the relation γn =

√1 +

(u−

c

)2=

√1 +

(u+

c

)2holds.

Introducing t = qB∆t2γnmc , one has:

u+ − u− = (u+ + u−)× t (2.23)

But this is a linear system in the unknowns u+x , u+y , u

+z :

ty (u+z + u

−z )− tz

(u+y + u−y

)+ u+x − u−x = 0

−tx (u+z + u−z ) + u+y + tz (u+x + u−x )− u−y = 0

u+z + tx(u+y + u−y

)− ty (u+x + u−x )− u

−z = 0

, (2.24)

having, as solution:

u′

= u− + u− × t

u+ = u− + u′ × 2t

1+t2

u+ = u− + (u− + u− × t)× 2t1+t2

(2.25)

The position of the particle is obtained centering respect to the values of u, γ:

xn+1 = xn + vn+1/2∆t = xn +un+1/2

γn+1/2∆t (2.26)

The scheme is time-reversible and introduces a second-order error in the particle trajec-

tory.

2.2.3 Bringing all togheter

For writing a PIC code using the Boris scheme, one has to match the time indices in the

evolution equations derived in the previous subsections.

The chain of computation for the n − th simulation cycle, assuming that the quantities

are know at times En,Bn−1/2,xn,un−1/2, reads:

1. Advance Bn−1/2 → Bn using En.

44


2. Advance un−1/2 → un+1/2 using En and Bn.

3. Advance xn+1 using xn and un+1/2.

4. Compute Jn+1/2 using un+1/2 and xn+1/2 = xn + xn+1.

5. Advance Bn → Bn+1/2 using En.

6. Advance En → En+1 using Bn+1/2 and Jn+1/2.

2.3 Charge conservation

2.3.1 Spectral Poisson corrector

The Poisson equation for the electrical potential:

∇2Φ = −4πρ = −(∇ · E) (2.1)

can be enforced using a spectral method.

At a given time, the charged density is deposed on the ρ grid and the numerical di-

vergence of E is computed. These operations must be done coherently with the rest of

the integration scheme (taking the same spatial derivatives, staggering the grids at the

correct positions..).

The charge error accumulated in the Gauss law:

4πρerr(x) = 4πρ(x) ≡ 4πρ(x)− (∇ · E)(x) (2.2)

reads, discretized:

4πρnijk = 4πρnijk − (∇num · En)nijkl (2.3)

Now, one seeks for an auxiliary electrostatic potential Φ, which counterbalances the error

in charge ρ:

∇2Φ = −4πρ (2.4)

Once Φ is known, the electric field can be corrected easily, by adding:

45


Ecorrected → E−∇ · Φ (2.5)

In fact, now, one has: ∇ ·Ecorrected = ∇ ·E−∇2Φ = (4πρ− 4πρ) + 4πρ = 4πρ.

The Poisson equation ∇2Φ = 4πρ must be solved numerically. The operator ∇2 =

∂xx + ∂yy + ∂zz can be discretized in physical space, using the same finite difference

derivative scheme as the one used for computing EM field derivatives. After choosing

the boundary conditions, this discretization gives rise to a banded (tridiagonal for 1D,

tridiagonal of tridiagonals for 2D and tridiagonal of tridiagonals of tridiagonals for 2D)

linear system, which can be solved:

• Directly, using an algorithm computable in linear time. Unfortunately the system

requires an order of magnitude of extra memory to be stored (this makes it very im-

practical, if not impossible, for 2d and 3d grids) and the efficient solving algorithm

is not easy to parallelize.

• According to the prescribed boundary condition, one can use a discrete spectral

representation of the fields.

As the continuous Fourier transform of the field would transform the differen-

tial equation ∇2Φ = 4πρ into the algebraic equation k2 ˘Φ(k) = 4π ˘˜(k)ρ , the

discrete Fourier transform (DFT, for periodic boundary condition), the discrete

cosine transform (DCT, for Von Neumann boundary conditions) and the discrete

sine transform (DST, for Dirichlet boundary conditions) reduce the banded linear

system into a diagonal system (much easier to solve).

This can be immediately shown in 1D using DCT, the computation for the 2D and 3D

case is analogous.

Starting from the second-order discretized Poisson equation:

∂xxA = b→ Ai+1 − 2Ai +Ai−1

2h= bi (2.6)

Computing the 1D inverse discrete transform, the discretized equation becomes:

46


Ai = cost. ·∑k Cikak

Bi = cost. ·∑

k Cikbk

Cik = cos

(iπkN

)→∑

k

ak(Ci+1k − 2Ci

k + Ci−1k ) + bkC

ik = 0 (2.7)

Hence, removing the summation we obtain a diagonal system of the expansion coefficients

ak, from which we can recover the Ai using the direct transform:

∀k, ak = − Cik

Ci+1k − 2Ci

k + Ci−1k

bk (2.8)

2.3.2 Esirkepov Shape functions

The Esirkepov method ([13]) for electrical current deposition enforces charge conservation

directly inside the deposition algorithm, and it can be applied for quasi-particles of

arbitrary shape factors.

As, analytically, the charge continuity equation ∂ρ∂t + ∇ · J = 0, guarantees the Gauss

law to be compatible with the dynamics (determined by Maxwell Ampere equation), an

analogue equation must hold (with the appropriate discretized operators) for numerical

discrete quantities.

This equation can be obtained using the following properties of the discretized operators

defined in 2.7 (∆ is here the discrete Poisson operator in central differences):

∇± · ∇±× = 0

∇± · ∇∓ = ∆ =fi+1,j,k−2fi,j,k+fi−1,j,k

dx2 +fi,j+1,k−2fi,j,k+fi,j−1,k

dy2+

fi,j,k+1−2fi,j,k+fi,j,k−1

dz2

(2.9)

, and applying the ∇−· divergence to the third equation in 2.8, one has:

∇− · En+1 −∇− ·En

dt= c∇− · ∇− ×Bn+1/2 − 4π · ∇−Jn+1/2 (2.10)

, or recalling Gauss law, ∇− ·En = 4πρn, :

ρn+1 − ρn

dt+∇− · Jn+1/2 = 0 (2.11)

47


Gauss Law is satisfied at every time step only if the obtained discretized continuity

equation 2.11 holds at every time step, explicitly:

ρn+1i,j,k − ρni,j,k

dt+

Jn+1/2xi+1/2,j,k

− Jn+1/2xi−1/2,j,k

dx+

Jn+1/2yi,j+1/2,k

− Jn+1/2yi,j−1/2,k

dy+

Jn+1/2zi,j,k+1/2

− Jn+1/2zi,j,k−1/2

dz= 0

(2.12)

The charge density, in a PIC, is constructed from the form factors S (the cell-integrated

shape functions) of the quasiparticles, indexed by α, with position xnα:

ρni,j,k =∑

α∈particles

qαSi,j,k(xnα) ,

Si,j,k(xα) = S(xαijk

= (xi − xα) /dx, (yj − yα) /dy, (zk − zα) /dz),

∑

ijk∈grid

Si,j,k(xalpha) = 1 ∀α

Considering a single particle of charge q one therefore has ρni,j,k = qSi,j,k(x). During its

motion, in a timestep, it shifts by (δx, δy, δz) from the position xn to xn+1. Using the

vector W, defined as

W xijk= −dt

q

Jxi+1,j,k−Jxi,j,k

dx

W yijk = −dtq

Jyi,j+1,k−Jyi,j,k

dy

W zijk = −dtq

Jzi,j,k+1−Jzi,j,k

dz

(2.13)

, into the discretized continuity equation 2.12 one is lead to the equation:

Sijk(xn+1)− Sijk(x

n)−Wxijk+Wyijk +Wzijk = 0

→ Sijk(x+ δx, y + δy, z + δz)− S(x, y, z) = Wxijk+Wyijk +Wzijk

As shown in ([13] ), if the particle shifts by (δx, δy, δz) along a straight line, the W

components are linear combinations of the eight shape factors (ı = (i, j, k)):

48


Sı(x, y, z), Sı(x+ δx, y, z), Sı(x, y + δy, z), Sı(x, y, z + δz),

Sı(x, y + δy, z + δz), Sı(x+ δx, y, z + δz), Sı(x+ δx, y + δy, z), Sı(x+ δx, y + δy, z + δz)

, and they are given by:

W x;ı =(13 ,

13 , −1

6 , −16 , −1

3 ,16 ,

16 ,

13

)· Sijk

W y;ı =(13 , −1

6 ,13 , −1

6 ,16 , −1

3 ,16 ,

13

)· Sijk

W z;ı =(13 , −1

6 , −16 ,

13 ,

16 ,

16 , −1

3 ,13

)· Sijk

Sijk =

Sı(x, y, z)

Sı(x+ δx, y, z)

Sı(x, y + δy, z)

Sı(x, y, z + δz)

Sı(x, y + δy, z + δz)

Sı(x+ δx, y, z + δz)

Sı(x+ δx, y + δy, z)

Sı(x+ δx, y + δy, z + δz)

(2.14)

Once W is computed for a particle of arbitrary shape factors S, it is possible to iterate

(in space) the equations 2.13 to calculate the correct contribution to the current J.

The boundary conditions for the integration (the iteration starting and ending points),

are determined by the fact that, outside the particle’s finite-size shape function, the

contribution of the particle to the current is zero.

2.4 Envelope model

The laser envelope model describes the laser pulse propagation in a fully ionized plasma.

It leads to a slow-varying laser envelope equation which is obtained representing the fields

as modulations of fast and slow parts.

The model assumes the ions to be stationary and the collision time to be much greater

than the other timescales.

49


Both the hypothesis are valid for phenomena involving a short laser pulse (≤ 1ps),

propagating in an underdense plasma (ω2p/ω

2 ≪ 1): the electrons’ thermal speed is small

compared to their quiver speed in the laser field and their collision time is much greater

than the laser pulse length.

2.4.1 Laser envelope equation derivation

The wave equation and Poisson equation for the normalized, being e and m are the

charge and the mass of the electron, potentials in the Coloumb gauge:

E = −∇Φ− ∂A∂ct

B = ∇×A

0 = ∇ ·A

→

φ = eΦ

mc2

a = eAmc2

, (2.1)

read:

(∇2 − 1

c2∂2

∂t2

)a = k2p

n

n0

u

γ+

∂

∂ct∇φ = −k2p

1

n0ceJ+

∂

∂ct∇φ

∇2φ = k2p (n− ni) /n0

, (2.2)

in which u = γv/c = p/(mc) is the fluid average of the normalized electron momentum

and γ = (1 + u2)1/2, kp =(4πn0e2

mc2

)1/2, J = −encu/γ.

Introducing the comoving coordinates ξ = z − ct, τ = t, and neglecting the term ∂∂ct∇φ

(the fast part of the electrostatic potential φfast ∼ exp(ikξ) is typically small compared

to the fast part of the plasma current), the wave equation becomes:

(∇2

⊥ +2

c

∂2

∂ξ∂τ− 1

c2∂2

∂τ2

)a ⋍ k2p

n

n0

u

γ(2.3)

Further possible simplifications are:

• For forward-going light waves, the term 1c2

∂2

∂τ2a can be neglected. On the other

hand, this term is important for backward-propagating radiation (i.e. Raman

backscattering). By the way, it is necessary when running simulation in a boosted

Lorentz frame ([4]),

• The leading-order transverse electron motion is the laser quiver motion: on the

right hand side we can approximate u = a.

50


• Assuming a linearly polarized laser field of frequency ω = ck, propagating in the

positive z direction, having a transverse component expressible as a modulation

of a fast oscillating field and a slow-varying |∂ξaslow| ≪ |k aslow| envelope aslow:

a = aslow(r, ξ, t) exp(ikξ)/2 + c.c.

Applying these simplifications, one has:

(∇2

⊥ + 2ick∂

∂τ+

2

c

∂2

∂ξ∂τ

)aslow ⋍ k2p

n

n0

aslow

γ(2.4)

or,

(ik0kp

+ ∂ξ

)∂τ a =

1

2

n

n0

a

γ− 1

2∆⊥a (2.5)

2.4.2 Plasma motion and wakefield equations in cylindrical comoving

coordinates

From now on, it is assumed cylindrical symmetry around the axis z (direction of the laser

propagation).

Let us define these non-dimensional, plasma wavelength normalized, comoving cylindrical

coordinates and variables:

ξ = kp (z − ct) , kp = ωp/c

r = kpr

τ = ωpt = kpct

δ = nn0

(normalized density)

(2.6)

The slow-varying laser envelope a, defined from the normalized vector potential a⊥ =

eA⊥/mc2 = a2e

i(k0/kp)ξ + c.c. is evolved according to the equation:

(ik0kp

+ ∂ξ

)∂τ a =

1

2

δ

γfluida− 1

2∆2

⊥a (2.7)

, in which 2π/k0 is the laser wavelength, γfluid is the relativistic factor associated to the

local average of plasma velocity and δ is the normalized density.

The laser envelope evolves at timescales much longer than a single laser oscillations. The

force acting on a plasma electron can be therefore averaged in time in this scheme, using

51


the ponderomotive approximation for the force. In this model, the plasma do generate an

electromagnetic wakefield, which spatial and temporal scale is the one of the slow varying

envelope ([4],[12]]).

The equations of motion of an electron of momentum p, in the envelope/ponderomotive

an approximation, read:

dpdt = q

(Ewake +

v×Bwake

c

)− mc2

2γ ∇∣∣∣ qAmc2

∣∣∣2

γ =

√1 + p2

m2c2 +∣∣∣ qAmc2

∣∣∣2 (2.8)

In our coordinates and normalization cylindrical system we have:

γ =√

1 + |a|2 /2 + u2z + u2r

dξdτ = uz

γ − 1

drdτ = ur

γ

duzdτ = −∂γ

∂ξ − Ez − urγ Bφ

durdτ = −∂γ

∂r − Er − uzγ Bφ

(2.9)

The relativistic electromagnetic wakefield is described by the electromagnetic fields Ewake,

Bwake (the suffix wake is further omitted for brevity):

∂Ewake∂t = c∇×Bwake − 4πJ

∂Bwake∂t = −c∇×Ewake

→cyl

∂Ezc∂t = 1

r∂rBφ

∂r − 1r∂Br∂φ − 4π

c Jz

∂Erc∂t = 1

r∂Bz∂φ −

∂Bφ

∂z − 4πc Jr

∂Eφ

c∂t = ∂Br∂z − ∂Bz

∂r − 4πc Jφ

...

...

...

(2.10)

To maintain the radial symmetry of the plasma distribution (the laser envelope acts only

with the radially symmetric ponderomotive force), the wakefield electromagnetic fields

are constrained to be Eφ = 0, Bz = 0, Br = 0, in every point of the domain. This

simplifies our system, which becomes:

52


∂Ezc∂t = 1

r∂(rBφ)

∂r − 4πc Jz

∂Erc∂t = −∂Bφ

∂z − 4πc Jr

∂Eφ

c∂t ≡ 0

∂Bzc∂t =

∂Eφ

∂r ≡ 0

∂Brc∂t = −∂Eφ

∂z ≡ 0

∂Bφ

c∂t = −∂Er∂z + ∂Ez

∂r

→ξ=kp(z+ct)

∂Ez∂τ = ∂Ez

∂ξ + 1r∂(ρBφ)

∂r − jz

∂Er∂τ =

∂(Er−Bφ)∂ξ − jr

∂Bφ

∂τ = −∂(Er−Bφ)∂ξ + ∂Ez

∂r

(2.11)

53

3 INF&RNO

3.1 Motivation

INF&RNO ([5]) is a simulation framework which is under development at LBNL, by

Carlo Benedetti. Its purpose is simulating laser-plasma electron accelerators, short and

intense laser pulse interacts with an underdense plasma over distances that can lengths

of 1 meter.

Such simulations, in full 3D geometry, are technically very challenging, or even impossible,

with the hardware available now.

In fact, as seen in subsection 2.1.2 the shortest resolution scale needed is the laser wave-

length, which is several order of magnitude smaller than the plasma skin depth for under-

dense plasmas (the scale to be resolved to describe the wakefield), which can be several

order of magnitudes smaller than the length of the simulation box:

λlaser ≪ λwakefield ≪ L (3.1)

For storing the state of the whole distance of one meter, therefore, ∼ 20 ·L/λlaser points

are required on the longitudinal direction. This number, for L = 1m and λlaser =

632nm leads to an unrealistic 3D grid of of 1027 total points. To overcome this memory

requirement, which makes the simulation not event long, but impossible now (which is

worse), one has to run it on a moving window, a window in the simulation box with moves

following the laser pulse (the most interesting things happen close to it), discarding the

non-important particles that are left far from the pulse.

Even after that, to follow the propagation of the pulse in a meter of plasma at a resolution

of λ, one has to compute, using an explicit integration method n ∼ L/λ timesteps, of the

order of n ∼ 108. On a 2011 CPU, a 3D particle in cell code takes t ∼ 1µs to compute

a single particle step, and it requires Np ∼ 1010 particles to decently sample the moving

window phase space. Therefore, the CPU time needed to run an 1 meter long simulation

is:

54

3 INF&RNO

T = n · t ·Np ∼ 3 · 1012s = 105years (3.2)

Even with an efficient parallelization on 105 cores, the time required for the simulation

is still prohibitive. A millimeter scale simulation, with one tenth of the particles used,

on 100 cores, takes about a month.

Anyways, 3D simulations are crucial for understanding phenomena and for designing

LPAs working at these physical scales. Fortunately, for LPA phenomenology, a series of

possible approximations do exist:

• Boosted Lorentz frame (BLF ) if one can neglect the effect of backward propagating

waves, simulation can be run in a Lorenz “boosted” frame, moving along the laser

propagation direction at relativistic velocities. Here, the unbalance between the

maximum and minimum physical scales involved in a simulation can be reduced,

leading to a shortening of the simulation length of order γ2. In INF&RNO, γ can

be taken to be ∼ 10 → γ2 ∼ 102. Again, backward propagating waves, as Raman

backscattering, are not resolved by the resolution in the BLF.

• Reduced models: using a cylindrical symmetry assumption, instead of the full carte-

sian 3D geometry. A speedup of ∼ 102 can be achieved by this.

• Envelope model : using the envelope and ponderomotive approximations, one does

not have, in principle, to resolve all the scales down to the λlaser: the smallest

laser-related length is Lenvelope ∼ λsd . This could boost the simulation speed by

a factor of (Lenvelope

λsd)4, which can be even ∼ 104 . The problem is that, if the laser

becomes strongly depleted, structures smaller than Lenvelope arise, and they need

to be resolved. By the way, even deep in depletion, keeping the resolution high

enough, a speedup of ∼ 10 can still be obtained.

INF&RNO can run the simulations using the fluid or PIC model (or both, for staged

simulations). For the PIC, for now, the last two approximations are available, while, for

the fluid, all of them are. So, the CPU years required for INF&RNO simulations, are,

roughly: for the PIC 102y, for the fluid 10−2y.

The parallelized version of INF&RNO is therefore capable of running these simulations

in practically feasible times on modern supercomputers. The uniqueness of the code

required to find unique solutions for the parallelization.

55

3 INF&RNO

3.2 Numerical Scheme

3.2.1 Plasma modelization

The laser is described by the envelope model (see sect. 2.4.1), the wakefield by the usual

Maxwell equations (see sect. 2.4.2), and the plasma can be modeled using a fluid model

or a PIC one.

3.2.1.1 Fluid plasma model

Starting from the relativistic plasma cold fluid equations (u = γv/c):

∂nc∂t +∇ · (nu/γ) = 0(

∂c∂t +

(uγ

)· ∇)u = F

(3.1)

Changing variables to the comoving system ∂∂t =

∂∂τ − ∂

∂ξ , and recalling 2.8, one obtains

the fluid equations used in INF&RNO-fluid:

γfluid =√

1 + |a|2 /2 + u2z + u2r

∂δ∂τ = ∂δ

∂ξ −∇ ·(

uγfluid

δ)

∂(δuj)∂τ =

∂(δuj )∂ξ −∇ ·

(−→β δuj

)+ δ

[−(E+ u

γfluid×B

)− 1

2γfluid∇(|a|2

2

)], j = z, r

(3.2)

3.2.1.2 PIC plasma model

The quasiparticle dynamics follows from the electrons’ equations of motion 2.8.

3.2.1.3 Numerical choices

For the Maxwell wakefield and the quasiparticle motion an explicit Runge-Kutta in-

tegrator, of order 2 or 4, is used. The laser envelope is advanced using an implicit

Crank-Nicholson method, which offers better stability and accuracy than explicit ones.

56

3 INF&RNO

The fields are discretized on un-staggered, 2D, regular z, r grids. The longitudinal ξ

derivatives are computed with an upwind scheme (∂ξ;∆ξfi,j = (−3fi,j+4fi+1,j−fi+2,j)/(2∆ξ)),

while the transversal one with a standard centered scheme. The motivation of using up-

wind derivatives is to damp some backward propagating instabilities.

The shape functions used for the macro-particles are the standard linear and quadratic

splines.

3.3 Parallelization

As we have seen in the previous sections, an INF&RNO simulation can take several years

of CPU time in order to be completed. Therefore, parallelizing it gave us the possibility

to run simulations that were otherwise inapproachable for us before.

In general, nowadays, parallelizing a code (splitting the computation among different

processing units) is important because processors speed (frequency) ceased to increase,

while the number of processing cores per CPU die and the number of computing nodes

interconnected in supercomputers still do increase, scaling very well as the years pass by.

The key quality that a parallel code should have is the scalability. It is defined as the

capability to half the computation time, by doubling the computational resources (cores).

This is so important because the number of available processors in HPC cluster, as just

said, grows year by year.

The first requirement for scalability is imposed by Ahmdal’s Law: if a code contains a

(even small) part of serial code, the speedup given by parallelization reaches a maximum,

at a certain number of computing processors. As can be seen in the following figure, even

if the serial part only weights as the 5% of the total serial runtime, the maximum speedup

that one can reached parallelizing is only 20x, even using thousands of processors.

This can be easily understood considering that:

• the parallel runtime is the sum of the parallelizable part and the serial part

• the parallel part time scales as 1/Nprocs, while the serial one remains constant

• even taking the limit N →∞ one still gets tserial and not 0:

limNprocs→∞

t =tparallelizable

N+ tserial = tserial (3.1)

57

3 INF&RNO

In INF&RNO the laser envelope equation is integrated using an implicit Crank-Nicholson

method, which resolutive algorithm, as usual for an implicit methods, is recursive longitu-

dinally and requires to solve tridiagonal linear systems transversally (their parallelization

is very inefficient). So the code has parallelization issues along both the transversal and

the longitudinal (the most important to parallelize along) directions.

This recursiveness is in the laser evolution function F (very complicated algebraically),

which computes an+1i−1,j from right to left, using, for each i, the values on the right, which

must be known:

an+1i−1,j = F

(an+1ij − anij , a

nij

)(3.2)

This right-value dependency makes the laser integration part impossible to parallelize

longitudinally (transversally the tridiagonals are even harder to parallelize), as it is im-

possible to break the dependency chain and split the computation among processing

units. Leaving it un-parallelized would have given us a very poor scalability, and the

maximum reachable speedup would have been bound to 10x.

Motivated by the linearity of the F in 3.2 and by the fact that the envelope varies slowly

and smoothly I were lead to try an iterative algorithm to solve the equation 3.2.

I guessed that the difference between an+1ij − anij and its previous value anij − an−1

ij was

not too large. Fortunately, this guess came out to be true: this difference is the finite

58

3 INF&RNO

Figure 3.3.1: Validation of the algorithm: line-outs before laser depletion, deep in deple-tion (density), deep in depletion (laser)

difference approximation of the second derivative in time of a :(∂2t;dξ[aij ]

)n= an+1

ij −2anij + an−1

ij , that is a small factor for our conditions, non-neglectable , as seen in section

2.4.1, only for backward-propagating waves.

Our algorithm, after finding a not-too-bad initial value for an+1ij − anij ≈ anij − an−1

ij , finds

an initial guess guess g0 for an+1. After that, it starts iterating F using successive guesses

in place of the unknown an+1ij , to which the succession of guesses converge.

Summarizing, being P the chosen number of iterations:

gn+1i−1,j;0 = F

(anij − an−1

ij , anij

)initial guess

gkn+1i−1,j; = F (gn+1

i,j;k−1 − anij, anij) k − th iteration

an+1ij = gn+1

ij;P final

(3.3)

Even with a low resolution and the laser deep into depletion, a low number of iterations

∼ 10 manages to find a very accurate solution, as seen in the lineouts in figure 3.3.1.

After having made our code worth to be parallelized massively (the 10x speedup bound

was not too attractive) I had to choose which parallelization approach to use.

I started parallelizing it using OpenMP shared memory parallelization, keeping all the

data on a multicore node and distribute the computation of the shared data among

59

3 INF&RNO

the local cores. After that, I have parallelized it using domain splitting with message

passing, i.e. splitting the domain and distributing the subdomains among independent

networked nodes, each one exchanging with its neighbors the data (resident in or crossing

the boundaries) needed for computation.

Shared memory Message passing

Concept Many processors sharing the work Split the domain

Advantages Fast to implement Scales to huge number of cores

Drawbacks lower efficiency and scalability Communication overhead

Standard OpenMP MPI

The message passing parallelization is usually more efficient and of course more scalable

than the shared memory one. In fact, the small number of cores per local node sets a

bound on the scalability of a shared memory parallelized code.

A parallelization limit do exists also for message passing codes: at a certain point, split-

ting the domain again and again, one arrives to subdomains which are too small. At this

point it becomes useful to combine the two parallelization approaches, assigning larger

subdomains to multicore powered nodes.

3.3.1 Shared memory (OpenMP)

I parallelized the fluid and the PIC code using OpenMP parallelization, which is a shared

memory, concurrent threads, task parallelization API:

Basically the work consisted in parallelizing the particle loop and all the loops in the

longitudinal grid points index using OpenMP directives and watching for false-sharing

60

3 INF&RNO

inefficiencies.

For the PIC I had to solve a memory read-after-write collision in the current deposition

algorithm: when two or more different cores, processing two or more different particles

concurrently, have to sum the respective particle’s current to the same grid cell:

• They both read the initial cell value, at the same time.

• They sum their particle’s current to the value read, which is “0” for both processors,

since no one had the time to write back to the shared memory.

• They write their sum to the same shared memory location.

• At the end, instead of having the sum of the particle currents, one has only the

current of the last particle writing.

I have solved this issue in two ways:

• Using atomic operations (with the “#pragma omp atomic” directive), “locking” the

memory first in such a way that a thread’s memory sum operation is not interrupted

by another.

• Using a different sum buffer per core (faster but memory wasting).

The OpenMP code scaling graph, up to the twelve cores of our machine, is:

3.3.2 Message-Passing (MPI)

Because of the spatial locality of the physical equations describing the system, it is possi-

ble to split the computational domain into smaller subdomains, each one evolving almost

61

3 INF&RNO

independently, needing only boundary information, which gathered from the neighboring

domains. Each physical subdomain is computed by a different computational node.

For finite differences fluid codes each computing node needs to ask its neighbors the

values required to compute the numerical derivatives on the boundary of its subdomain.

In particle in cell codes, additionally, the particles are free to move inside the entire

domain, and they interact with grid cells around them. It is necessary to keep the particles

“belonging” to a computational node interacting only with the node-local subdomain of

fields: the particles are moved among nodes when they cross their subdomain boundaries.

The width of the particle shape function requires to take an extra ghost grid point on

the left. The particle velocity limit, c, allows to check boundary crossing once every few

time steps.

3.3.2.1 1D domain decomposition

I started decomposing the domain along the longitudinal direction, which is usually the

one with more grid points Nz. The grids (of dimensions Nz × Nr) and the particle

domains are sliced in the z−direction, and each one of the Np;z slices is assigned to a

different node.

In our longitudinal upwind derivative scheme ∂ξ;∆ξfi,j = (−3fi,j+4fi+1,j−fi+2,j)/(2∆ξ),

the values required for computing f ′ij are the two points on the right, fi+1,j,fi+1,j. Before

every derivative computation, one has to be sure that the ghost region contains the right

data, which must be copied from the right neighbor node, as illustrated in the following

figure:

62

3 INF&RNO

After validating the parallelized codes (see figure 3.3.2), I measured their parallel perfor-

mance:

• The fluid code showed to scale very well, as can be seen from this strong scaling

graph, describing how the parallel speedup scales with the number of processors

used, given a fixed problem size (Nz = 4224):

In particular, for this problem size (Nz = 4224), the maximum speedup reached

(still at high efficiency) is 1000x (1 year -> some hours).

• The weak scaling graph for the PIC code is similar:

63

3 INF&RNO

Figure 3.3.2: Comparison between a parallielized PIC simulation (40 minutes on 120cores) result and scalar one (38 hours)

3.3.2.2 2D domain decomposition

The length of the information needed to be exchanged scales with the size of the boundary.

The long time needed to transfer information (during which the computation is usually

idle) makes important to minimize, for each subdomain, the ratio of boundary grid points

(or crossing particles) and total ones.

64

3 INF&RNO

In this sense, 1D decomposition is not optimal. In fact, the ratio scales linearly with the

number of slices Np;z and eventually reaches 1, when Np;z = Nz.

Furthermore, Nz represent also the limit to parallelization and scalability, since it is

impossible to take more than Nz slices keeping the number of their longitudinal a positive

integer.

This ratio is optimal taking the subdomain to be squares and the parallelization limit

can be broken decomposing the domain in the radial direction too. So, 2D decomposition

allows, in theory, to scale to Nz ×Nr processors, instead of only Nz.

Unfortunately, in INF&RNO, the implicit algorithm using to advance the laser envelope

equation in time has to solve Nz tridiagonal systems, one for each longitudinal point,

involving all the Nr grid points at that longitudinal coordinate. The most efficient

algorithm (running in linear time) to solve such system is defined recursively, and it

is very hard to parallelize efficiently, as it calculates numerical successions propagating

radially.

In a 2D decomposition, a naive approach to solve this issue would be to split the suc-

cession in chunks (corresponding to the subdomains) and, for each system, let each node

wait for the last value in the succession, computed by its right neighbor. Summarizing,

for each longitudinal grid point (Nz), there is this chain of Np;r wait-compute-send steps.

By the way, its parallelization would be preferred, for what stated by the Ahmdal’s Law

3.1.

The best compromise found for solving the tridiagonals concurrently, by different nodes,

is using a pipelined solving algorithm: after solving the first part of the system, for

the first z-slice, the first node begins the computation of the second system, and so on.

Anyways, for having full efficiency it is still good to set Nz ≥ Np;z ·Np;r = Np.

Practically, the scaling graph for the PIC code, using the 2D decomposition, shows that

it is possible to run on more processors, still at a reasonable efficiency:

65

3 INF&RNO

3.3.3 GPU parallelization

I have rewritten both the fluid and the PIC code using the jasmine (chapter 4) framework

components, enabling it to be run on graphics processing units (GPUs, see section 4.2).

The details of the PIC code implementation and benchmarking results are in chapter 4.

Parallelizing the fluid was pretty straightforward to do, since all memory accesses, in

the fluid code, are predictable and fixed. The final speedup result for the fluid GPU

parallelization is (measured using a NVIDIA Fermi GPU on the “Dirac” machine at

NERSC) that a single GPU is 70 times faster than a single CPU core.

3.4 3D Visualization

Handling the data generated can be a significant challenge that requires special visual-

ization techniques as it is required to render (physically 3D) density fields, such as the

charge density field or the electromagnetic fields’ components.

For rendering these 3D density volumes a lot of physically based models exist. Most of

them are based on raytracing and many of them are very complex, as they include multi-

scattering effects and external illumination. By the way, the algorithm that produces the

best clear-looking renders of our data is the simplest one, the emission model.

It assumes that each point of the 3D field emits light according to the value of the

“density” there and it neglects all the scattering, absorption and external lighting effects.

It simply consists in integrating the light emitted on marching rays starting from the eye

and transversing the field volume.

66

3 INF&RNO

Practically, one discretizes the eye-field integral sampling the density ρ, and transforming

the values to colors using is a transfer function T (linear or nonlinear).

I =∑

volume edgeeye T [ρ(s)] (3.1)

For such renderings, the standard available visualization tools (Visit, Paraview, ...) are

too memory intensive, as they sample our cylindrically symmetrical grids on a full 3D

representation. To solve this problem I wrote a visualization program “jasviz”, from

scratch, that samples fields along rays in cylindrical geometry.

The program has a graphical user interface (GUI), generated semi-automatically using

a C++ classes reflection system. Defined, in the C++ code, the type-specific meta-

properties of the objects (which are organized hierarchically as nodes in a scene tree) a

rewrite engine generates the GUI and serialization (load/save XML files containing scene

data and parameters) code. This approach is very programmer friendly, as it is fast and

the bugs are very manageable, but the GUI generated may not bee too user friendly at

a first glance:

The program main features are:

67

3 INF&RNO

• Ability to record movies and still images:

– each scene node is animable separately,

– the user can define all the properties of a node at keyframes, automatically

interpolated linearly.

• Everything is computed on the GPU.

• Absorption/emission/MIP volume rendering models.

• Transfer functions:

– customizable linear gradients

– non linear user-customizable gaussian activation filter

• Point particles rendering (for PIC data):

– clip by energies, color by energy

• Automatic conversion tools INF&RNO data → hdf5 format.

• It renders both full 3d cartesian grids and 2d cylindrical symmetric ones.

• Volume clipping.

• Text and subtitles rendering.

• Cross platform.

A few sample snapshots from a video created with jasviz are:

68

3 INF&RNO

69

4 Jasmine: a flexible, hybrid

(CPU+GPU), PIC framework

Particle in cell codes can be used for simulating a wide range of phenomena, and, while

some concepts of the algorithm remain the same, other parts, depending also on the

physical assumption made and the corresponding numerical solutions (we have seen many

examples in the previous sections) can be very different among different PIC codes.

Furthermore, technology today offers different supercomputing architectures that may

require heavy code modification in order to have a code mapped into them. For example,

parallelizing a PIC code to run on a graphics processing units (GPUs, see section 4.2)

requires some attention and to write a completely different current deposition algorithm

(as shown in subsection4.2.3).

Given these two degrees freedom - numerical algorithms and architecture optimization -

an “universal” PIC code, is very unlikely to be realized and, mostly important, future-

proof.

By the way, it is feasible to write, and our attempt has this objective, a framework

for particle in codes which contains as components the core concepts of the algorithm,

as interpolation and deposition algorithms, written in a generic, final implementation

agnostic, form. This abstract form, allowed by the meta-programming features of the

C++ language, may make look the code a bit more complex and less straightforward but

it makes it reusable.

Jasmine is the name we have given to this framework, which includes some example

“standard” complete PIC codes: the 1D,2D,3D classical second order PICs (subsect.2.2.3)

one and the cylindrical symmetric INF&RNO numerical schemes (subsect. 3.2).

The code main features are:

• Multi-dimensionality, 1D,2D,3D.

• Modularity (sect. 4.1).

70

4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework

• OpenMP CPU intra-node parallelization.

• Fast GPU implementation, providing an average speedup of ~45x (sect 4.2).

• Inter-node hybrid MPI, multi-GPU, parallelization (sect 4.3).

• HDF5 output, diagnostics and 2D/3D visualization of the results.

4.1 Structure of the framework

The most general layer of abstraction, available in jasmine, for a parallel PIC code

consists in:

• the particles interact, via interpolation and deposition, with the cells of a field grid.

• the particles move in a computational domain which can be split into parallel

subdomains.

– The grid points on the boundaries may need to be exchanged.

– The particles crossing a subdomain boundary need to be exchanged to the

corresponding neighbor node.

The functions in this “geometrical” layer should implement these functionalities remaining

generic for what concerns particle and fields structures and datatypes. On the other hand,

the layer is specialized for different geometries (spatial dimensions, 1d, 2d, 3d, cartesian

or cylindrical geometry) and the algorithms are written, with the minimal differences as

possible, for different architectures, CPUs or GPUs.

Next, the equations for the fields evolution must be defined. If they can be expressed as

linear PDEs, or better as hamiltonian ones, it is a good thing to give the framework this

information: it can generate automatically the code for integrating them in time with

various explicit method, as Runge-Kutta 2, 4 or leapfrog. It is only needed to define the

fields properties - such as dimensionality, number of components and staggering - and

provide the expression for their temporal derivatives. The spatial derivatives used in the

PDEs can also be left as parameter, in this layer.

Also the particles datatype and evolution equation are chosen as a simulation template

parameters. Particles’ and fields’ equation are merged and mounted on some available in-

tegration scheme skeleton, such as Runge-Kutta or classical second-order (2.2.3) schemes.

The GPU (CUDA)/CPU (OpenMP) parallelization abstraction layer is partially provided

by the boson library, described in the second part of section 4.2.1.3.

71


4.2 GPU Parallelization

Graphics processing units, or GPUs, are a new generation of processors that provide

parallel codes with orders of magnitude of speedups (compared to their scalar implemen-

tation for normal CPUs), implementing a new kind of efficient, on-chip parallelism.

Their architectural features and their programming model are explained in subsection

4.2.1, while the approach we have followed implementing jasmine for the GPUs is ex-

plained in subsection 4.2.2.

4.2.1 GPU architecture and CUDA programming model

4.2.1.1 Throughput

Modern scalars processors impose limits (or even prohibit) on two critical aspects of a

computing device architecture: parallel execution and memory hierarchical organization.

These processors are designed to give the illusion that the code execution is sequential

and the global memory accesses are truly random, uniform and linear.

The evolution of the these sequential processors seem to have come to an end, as their

performance grows very slowly with time.

Meanwhile, new architectures have been developed, which, instead of denying, hiding,

parallelism and explicit memory hierarchy, embrace them, enabling developers to know

and control these features, in order to gain efficiency and performance respect to scalar

processors.

These architectures are named throughput processors or streaming processors or GPU and

they have become common in use, for the first time, in a field in which the performance

72


is indispensable and the parallelism is explicit: real time computer graphics. Because of

the commercial interest in this applications, a lot on research effort has been made in

order to improve these architectures, which soon started to overcome, for what concerns

raw computing power, normal CPUs.

The first programming model available for these processors was the SIMD (single in-

struction multiple data), used in most of the cases to write real time shading codes for

rasterization computer graphics. Despite the limitation of this paradigm, these shaders

have been used successfully to accelerate general, not computer graphics related, algo-

rithms, such as linear algebra or physical simulations.

In 2007, NVIDIA launched the CUDA architecture (Compute unified device architec-

ture REF): hundreds of computing cores on a single chip, capable of running thousands

of threads (with hardware, zero latency, switching) and an efficient and clear hierarchi-

cally organized memory. CUDA has given the developers direct access to the native

instructions and the memory profiles in the GPUs.

The release of the second generation of CUDA architectures, named FERMI, has marked

the GPUs success in general purpose computing. The improved flexibility and compu-

tational power of this new generation persuaded the high performance supercomputers

manufacturers to introduce in the market hybrid (CPUs+GPUs) clusters: as of 2011,

many of the world’s most powerful, for computational power, supercomputers are hybrid

ones.

For such a kind of architectures, performance derives from parallelism and efficiency

comes from the locality of memory transfers, which are clearly controllable by the pro-

grammer. The data access locality is guaranteed using fast on-chip memories, the caches.

The role and the use of the cache memories play the fundamental difference between

the GPU and CPU programming model: while the cache philosophy implemented on

CPUs inhibit parallelism, as their coherent cache implementation unavoidably suffer

false-sharing and cache-thrashing issues, the GPUs ones offer a good compromise between

latency hiding (and data locality keeping) capabilities and parallel efficiency, allowing

their use as a fast, local, communication channel among neighbor cores.

Without memory locality, memory bandwidth, which is the most critical resource in a

modern system, soon becomes a bottleneck.

For what concerns code execution, instead:

73


• CPUs execute a serialized code and give the illusion that the operation are run

sequentially: in reality every operation is split in a few (~ 30) passes places on

a computing pipeline. This “superscalarity” allows to increment the number of

operations done per second. The problem with pipelines is that there is always a

bottleneck: if one ring of the chain stops, all the others have to wait for it. So

every pipeline stage has to be fed with work. That’s the reason why CPUs have,

implemented on the chip, branch predicting algorithms, and their memory caches

must be enormously large, as the latency of a random global memory access (RAM)

would slow down all the pipeline. These branch and memory access predicting

devices take almost all the surface of a CPU die, and allow room only for a small

number of actually computing cores.

• Conversely, GPUs solve the problem of global memory latencies keeping in parallel

execution thousands of lightweight, hardware threads, with a switching time of the

order of a clock cycle. This way, the latency of a global memory operation, blocking

a thread in a waiting state, can be hidden switching to the execution of another

one.

Therefore, a GPU completely take advantage of parallelism to remain efficient and to

exploit its computation capabilities.

4.2.1.2 Parallelism and memory model (hierarchy)

The CUDA programming paradigm splits the computation in many parallel processes,

named threads. The ensemble of all threads constitutes the grid.

The threads execute the same program, but they have independent registers and share,

in groups, some memory caches. CUDA threads are therefore similar to SIMD threads,

with the appropriate differences:

• CUDA threads are grouped in blocks. The size of the blocks is uniform on the

grid and can be chosen by the programmer. Threads in the same block share a

register-like fast , on-chip memory named shared memory.

• The blocks are deployed and executed on the physical computing cores of the GPU,

the streaming multiprocessors. Each SM concurrently executes all the threads of

the assigned blocks, grouping them in warps of 32 threads.

• Each thread has, in addition to the registers’ state, an independent instruction

pointer: the threads in a block can diverge, but the corresponding warps are seri-

74


Figure 4.2.1: Organizzazione gerarchica dei threads (fonte: NVIDIA)

alized (until the code path reunification) by the SM.

As we have said many times, the key for having performance from a GPU is to exploit

efficiently its specialized memory spaces/profiles. The GPU memory hierarchy consists

in:

• Registers, low latency on chip memory, private per each thread.

• Shared memory, on chip low latency memory, shared by blocks.

• Constant memory, a small, read only (after setting the initial value), low latency

on chip memory, visible to all threads.

• Global memory: off chip, on board high latency RAM, shared among all the

threads. Automatically cached:

– by the texture memory, a spatially efficient read only cache that provides also

1D, 2D, 3D interpolation functions,

75


– by, on Fermi architecture only, a general two-level (the first level is on a single

SM and the second one is shared) configurable cache.

• Host memory: host main CPU RAM, connected to the GPU board by a bus (usually

the PCI express) which has an order of magnitude less bandwidth than the one

connecting the GPU processor to its on-board RAM.

• Intra-node, communication is possible using standard message passing standards,

such as MPI, and it requires passing through the host memory. Some alternatives,

such as the APEnet+ project ([3]), allow the GPU to communicate directly, without

passing through the PCI express bottleneck.

In detail:

lifetime thread access Position Latency Bandwidth Gb/s Size

Registers singolo thread read+write on-chip 4 clock cycles * 4k / thread

Shared Memory singolo block read+write on-chip 4 clock cycles* * 16k / block

Texture Cache globale read only on-chip 4-~400 clock cycles ~141 16k

Constant Memory globale read only on-chip 4-~400 clock cycles ~141 16k

Global Memory globale read+write off-chip ~400 clock cycles ~141 ~1-4 Gb

Local memory singolo thread read+write off-chip ~400 clock cycles ~141 ~1-4 Gb

Host Memory* node read+write(through PCI-e) off-board non accessibile ~5 ~10 Gb

4.2.1.3 Programming GPUs: CUDA libraries and our multi architecture approach

The CUDA toolkit, released by NVIDIA, is the most common GPUs programming en-

vironment. A CUDA program is written using an extended version of the C (and C++)

language and it is compiled by the NVIDIA compiler nvcc, which manages the separa-

tion of GPU and CPU code, compiles and optimizes the former and pipes the latter to

a standard compiler (gcc, ms compiler or icc).

Some ad-hoc language intrinsics are used to specify the type of functions and procedures:

• The “__host__” qualifier, which is the default one, indicates that the function

body is a normal CPU function and it cannot be called from the GPU.

• The “__device__” is applied to functions run on GPUs and callable only within

the GPU.

• The “__global__” qualifier indicates a CUDA kernel, an entry point for a CUDA

computation. The kernel is executed on the GPU but it can be launched only

76


Figure 4.2.2: CUDA memory spaces (NVIDIA)

from host code. The launch consists in the allocation of a GPU thread grid,

of configurable size. On the host side, the call has the syntactical form ”ker-

nel_name<‌<‌<grid size in blocks,single block’s size>‌>‌>(arguments)”.

Other nvcc compiler keywords have the purpose to qualify the memory space in which

variables reside in. For example, __shared__, used in GPU code as a prefix qualifier

in a variable declaration, allocates it in the block-shared memory space.

Furthermore, in the GPU code, some variables are defined implicitly by the compiler.

Two examples are the “threadIdx” and “blockIdx” read only variables, which are register-

space multidimensional indices that give a thread the possibility to locate itself on the

kernel grid: initially this information is the only thing that is not uniform across the grid

and therefore it is used by the threads to know which portion of the problem they are

assigned to.

CUDA contains also some primitive atomic operations and synchronization functions,

used to synchronize memory accesses, for example:

• __syncthreads() is a barrier that synchronizes the threads in a block,

• __threadfence() is a global barrier that, blocking the execution until the last global

memory write is complete, prevents some global data inconsistencies.

77


On the host side, GPU memory can be allocated/deallocated/copied (CPU ←→ GPU or

GPU←→CPU) using the host functions cudaMalloc/cudaFree/cudaMemcpy.

4.2.2 Our approach for writing cross architecture code

Programming GPUs using these language extensions can require to modify the code in

many points. For example, writing a code which can run both on GPUs and CPUs (as

in our case) requires to rewrite the memory allocation and copy functions for each of

the two computing devices. This rewrite effort, which increases code complexity and

maintenance costs, can be avoided using C++ templates, which can be used as a generic

abstraction proxy.

For this purpose, we have designed and implemented a general purpose CPU/GPU

lightweight parallelization library, named boson. It is composed by three main com-

ponents: a N-dimensional abstract vector class, the field class and the closure launcher.

The class field can be used to allocate/copy/access N-dimensional discrete grids of arbi-

trary type on both host (CPU) and device (CUDA GPU). It provides:

• The same functionalities on GPUs and CPUs: the choice of the device is done via

a template argument.

• Cross CPU-GPU/CPU-GPU mapping and copy operations.

• Grid iteration functions.

• Compound grid elements can be organized in a soa (structs of array) or aos (array

of structs) manner by specify another template parameter.

Furthermore, as we have seen, the syntax for launching a shared memory, multi threaded

computation is completely different in the OpenMP (CPU) and CUDA (GPU) frame-

works, but the base concepts are similar. Both the programming paradigms can be

abstracted using the concept of closure.

A closure is defined as an operation to be computed (possibly in parallel) on a set of

elements (a stream): all the computations are identical and the only thing that changes

is the element to process, identified by a multidimensional index if the set of elements

can be organized on a N-dimensional grid.

Our closure launcher component runs the computation, in parallel, on the device (CPU or

GPU) chosen by the application developer. Summarizing, the latter needs only to write

78


the closure code and decide where to launch it, without worrying about the different

parallel programming syntaxes, which are hidden by our abstraction layer.

Of course, the library does not provide any magical feature and does not solve the

different architecture / different parallelization issues: it is only a form to write codes

more compactly and avoid many rewrites.

Jasmine was written using this abstraction library and therefore it has been able to

run on GPUs, multicore CPUs and, possibly, also on future similar shared memory

architectures. The main difficulty for GPU porting was developing an ad-hoc parallel

current deposition algorithm, as seen in the next subsection.

4.2.3 Current and density deposition algorithm

As seen in section 3.3.1, memory conflicts (and consequently wrong computed values)

happen when some parallel threads sum particles’ contributions to densities naively in

parallel.

If this problem can be solved easily having a low number of threads, as happens in

OpenMP parallelization, new problems arise when scaling to the hundreds or thousands

of a GPU. In fact, the two workarounds found - atomic memory operations and the use

of multiple buffers - both become very inefficient when the number of threads becomes

high: the cost of memory locks increases with the number of threads waiting for them

and, of course, multiple buffers are too memory demanding.

Therefore, other ways to proceed must be found. The problem is caused by the sparsity

and the randomness of the particles’ memory accesses: their location in the grid is not

known a priori and this makes impossible to know which threads will be trying to sum

in the same grid cell, creating a conflict. On the other hand, if the particles are spatially

organized or even sorted, ordered sums algorithm are possible to write.

Our approach consists in these steps:

1. keep the particles sorted (using the efficient parallel sort algorithm coming with

the thrust library [17]) by their central cell index (or a hash of it),

2. find the subsets of particles with the same index, which is very easy since they are

already sorted,

3. finally, in parallel, compute the sum for each subset (cell), looping in the particles,

and write the sum to the global memory.

79


All the operations can be done in parallel by the GPU threads and none of them requires

memory locking or other forms of synchronization.

Since every particle, being finite sized, contributes to the density of a few grid points

(the central one and its neighbors) step 3 must be repeated once per shape factor. An

alternative is to nest the particle-in-a-cell summation loop inside an outer loop (in the

same GPU kernel) in the shape factors. For consistency reasons, for every shape factor

iteration, after writing the sum to global memory, a thread fence ([23]) synchronization

is needed.

In case of staggered density grids, as it is the current density field J in the FDTD scheme,

particles’ central cell indices are different for each component grid (Jx, Jy , Jz). Naively,

this would require to repeat the particle sort for each grid component. A more elegant

and much more efficient solution is to keep the particles sorted not only by their cell

index, but also for their subindex, defined as their index in an un-staggered grid having

half grid step. Since the relative offset of the component grids is ±12 , the subindex

unambiguously identifies the central point in all of the Jx, Jy, Jz grids.

4.3 Hybrid parallelization

Large 3D PIC simulations cannot be handled by a single graphics processing unit, since

the memory requirements are way too demanding. Parallelizing the code across multiple

GPU nodes using domain decomposition and message passing solves this issues and

gives also the possibility to exploit all the computational power of the hybrid computing

clusters.

Such clusters are usually made of networked nodes containing, each, two devices: a GPU

board and a multicore processor.

4.3.1 Inter-node & intra-node parallelization

In a CPU only cluster, the MPI-only classical parallelization distributes each subdomain

to a single scalar core. This strategy has the advantage of being normally very efficient,

except for the cases in which the small subdomains become poorly balanced in compu-

tational load (as happens in our simulations: the particle density accumulates in certain

zones and leaves other empty).

The use of GPUs introduces a new, intra-node, level of parallelism. Here, subdomains

are taken to be larger and can, in theory, be assigned for processing on a GPU, a CPU

80


core or an entire CPU (using OpenMP). Anyways, this strategy is not particularly more

efficient than using only GPUs because of their overwhelming computational power.

A better idea, which allows also to solve a memory requirement problem (as explained

below), is to assign each subdomain to a physical node, instead of assigning it to a single

core or a single node device.

The subdomain’s computational work is therefore shared among the node resources, using

shared memory parallelization. In our case the work is shared by the CPU and the GPU,

using OpenMP and CUDA parallelization paradigms, as shown in:

Because of the fact that a PIC code spends most of the time in pushing particles and

depositing their densities (there are many particles for each grid cell), a convenient way to

implement this CPU/GPU work sharing is to stream (as seen in the figure below) ’particle

computation’ from a device to the other, in a physical node. Since the two devices have

separate memories, one of them must be chosen to keep particle data, streamed to the

other device through the PCI express bus, as summarized in the following figure:

81


This strategy can prove itself to be necessary:

1. if the total GPU memory on a cluster is not big enough to keep the state of all

the particles, they can be placed in the bigger host memory and streamed to the

graphics processing unit (and in the meanwhile, to the CPU cores).

2. If one does not want to waste the computational power of CPUs on the physical

node, which is smaller, but not completely neglectable, compared to the GPU one.

In the second situation, particularly, the the CPU/GPU memory copies (passing through

the slow PIC-express bus) overhead should be kept as small as possible. Therefore, in

order to minimize the transfers, it is recommended to store the particles in the memory

of the fastest device, the GPU.

Furthermore, the overhead can be hidden putting the three operations (transfer in, com-

putation and transfer out) on a pipeline (using CUDA streams). The pipelining is effi-

cient because every stage is processed by an independent unit: the Fermi boards in GPU

clusters have 2 parallel DMAs for memory copies. The time

82


The performance benchmark tests (sect 4.4) show that this strategy makes possible to

run larger simulations on memory limited GPUs clusters.

4.3.2 Inter-node communication

For what concerns the communication among neighbor nodes, for exchanging field bor-

ders and crossing particles (see sect 3.3.2), standard MPI techniques have been used in

jasmine.

On normal clusters, the GPUs belonging to different nodes can communicate only passing

through the node host (CPU) memory. The main bottleneck resides in the bus connecting

these two memories, which, having a bandwidth of 4GB/s, can be slower than theoretical

network communication speed.

Therefore, we designed jasmine to let only the data that actually need to be moved across

the network pass through this bus. This means that the GPU has to, independently build

the list of particles crossing the subdomain boundaries and copy the field borders to/from

a send/receive buffer.

4.4 Performance benchmarks

Jasmine validation and performance benchmark tests were run on the GPUs of the PLX

machine, at CINECA.

83


The performance benchmarks consisted in the simulation of a laser pulse nonlinear plasma

wakefield generation phenomena, using the standard second-order full electromagnetic

PIC scheme with staggered grids and particles with quadratic shape functions. The

electrons in the simulation were in the relativistic regime and the number of (quasi-

)particle per cell was 64 for the 2D version and 8 for the 3D one. All the computation

was done in double floating point precision.

Four parallelization configurations were tested: the single GPU parallelized version, the

multiple (8 MPI nodes) GPUs one, the GPU using host memory particle streaming (4.3.1)

one, and the scalar CPU version. The results are summarized in the following table.:

Configuration Dimensionality Time per particle (ns) 1GPU/1CPU speedup

Single GPU 2D(64 ppc) 22 52x

MPI Multi GPU 2D(64 ppc) 29 39x

Streaming 2D(64 ppc) 26.4 43x

1 CPU core 2D(64 ppc) 1150 1x

Single GPU 3D(8 ppc) 109 38x

MPI Multi GPU 3D(8 ppc) 118 34x

Streaming 3D(8 ppc) 115 36x

1 CPU core 3D(8 ppc) 4104 1x

84

Conclusions

Numerical modeling is a fundamental tool for studying the nonlinear Physics of the laser

plasma interactions, whose comprehension is fundamental for developing new generation,

compact, particle accelerators. Such simulations are computationally very demanding

and require, in order to be run, optimization via numerical scheme improvement and

parallelization.

It has been shown that the parallelization of the partially implicit INF&RNO codes

(fluid and PIC) is possible and efficient, once the recursive dependency in the implicit

laser wakefield integrator is solved with the iterative method found, whose convergence

is motivated by a physical consideration. The code is able to scale up to thousands of

cores for a typically sized problem, providing very interesting speedup factors (about two

or three orders of magnitude) compared to the serial version.

This fact will have certainly impact designing the next-generation of laser-plasma ac-

celerators at LBNL, as the parallelization has made some very unfeasible simulations

possible and affordable in time, preserving their accuracy.

The features of the jasmine PIC framework have been outlined and its performance

benchmark results have shown that the GPU architecture can provide PIC simulations

with great speedups, being the CUDA parallelized version up to 50 times faster (on a

single GPU board, running double precision simulations with quadratic particle shape

functions) than our serial CPU implementation. In particular, the solution found for

implementing correctly the concurrent current deposition algorithm has shown to be

very robust.

The code was benchmarked checking the correct simulation of some analytically treatable

phenomena and some runs, in various acceleration regimes, have been compared with the

ones obtained with ALaDyn.

Its second level, multi-gpu, inter-node, parallelization makes us able to exploit all of the

great computational power of large scale HPC GPU clusters, providing better scaling and

85


overall efficiency (compared to the use of traditional CPU cluster). This is particularly

important because, very likely, GPU clusters will continue growing in computational

power faster than the conventional ones.

The GPU memory limitation problem, which could represent a limit for running large

scale simulations, has been tackled using a concurrent streaming technique that allows to

store the PIC particles in the larger CPU main memory, from which they are progressively

streamed to the GPU for computation. The overhead of such memory transfers, done

concurrently with the computation, has been shown to be negligible for 3D simulations.

The components (building blocks) of jasmine has been shown to be very reusable and

allowed, for example, the rewrite of some PIC and fluid algorithms in INF&RNO for the

hybrid GPU+CPU architecture in a very short time.

At the moment of writing this thesis, no published article has claimed the development of

an electromagnetic PIC code that can run on hybrid clusters using the main CPU memory

efficiently, or that can run multi-gpu full-PIC 3D simulations, or that is adaptable for

various geometries and numerical schemes, as jasmine.

86

Bibliography

[1] Gev electron beams from a centimeter-scale channel guided laser wakefield acceler-

ator. 14(5):056708, 2007.

[2] Theory and simulation of ion acceleration with circularly polarized laser pulses.

Comptes Rendus Physique, 10(2-3):207 – 215, 2009. <ce:title>Laser acceleration of

particles in plasma</ce:title> <ce:alt-title xml:lang=.

[3] Roberto Ammendola, Andrea Biagioni, Ottorino Frezza, Francesca Lo Cicero,

Alessandro Lonardo, Pier Stanislao Paolucci, Davide Rossetti, Andrea Salamon,

Gaetano Salina, Francesco Simula, Laura Tosoratto, and Piero Vicini. "apenet+:

high bandwidth 3d torus direct network for petaflops scale commodity clusters",

2011.

[4] C. Benedetti, C. B. Schroeder, E. Esarey, C. G. R. Geddes, and W. P. Leemans.

Efficient Modeling of Laser-Plasma Accelerators with INFERNO. In S. H. Gold &

G. S. Nusinovich, editor, American Institute of Physics Conference Series, volume

1299 of American Institute of Physics Conference Series, pages 250–255, November

2010.

[5] C. Benedetti, A. Sgattoni, G. Turchetti, and P. Londrillo. ALaDyn: A High-

Accuracy PIC Code for the Maxwell-Vlasov Equations. IEEE Transactions on

Plasma Science, 36(4), 2008.

[6] CK Birdsall and AB Langdon. Plasma physics via computer simulation. Taylor &

Francis Group, 2004.

[7] JA Bittencourt. Fundamentals of plasma physics. Springer, 2004.

[8] W. Leemans C.B. Schroeder C. Benedetti, E. Esarey. Modeling a 10 gev laser-plasma

accelerator with inf&rno.

[9] Benjamin M. Cowan, David L. Bruhwiler, Estelle Cormier-Michel, Eric Esarey,

Cameron G. R. Geddes, Peter Messmer, and Kevin M. Paul. Characteristics of an

envelope model for laser-plasma accelerator simulation. J. Comput. Phys., 230:61–

86, January 2011.

87

Bibliography

[10] William Dally. The End of Denial Architecture and the Rise of Throughput Com-

puting. In GPU’s : current state and evolution - INFN CNAF meeting.

[11] E. Esarey C.B. Schroeder Cs. Toth D.L. Bruhwiler K. Paul B. Cowan W.P. Leemans

E. Cormier-Michel, C.G.R. Geddes. Scaled simulations of a 10gev accelerator.

[12] E. Esarey, C. B. Schroeder, and W. P. Leemans. Physics of laser-driven plasma-based

electron accelerators. Rev. Mod. Phys., 81:1229–1285, Aug 2009.

[13] T Esirkepov. "exact charge conservation scheme for particle-in-cell simulation with

an arbitrary form-factor". Computer Physics Communications, 135(2):144–153,

2001.

[14] S. Gordienko and A. Pukhov. Scalings for ultrarelativistic laser plasmas and quasi-

monoenergetic electrons. 12(4):043109, 2005.

[15] M. Harris, J. Owens, S. Sengupta, Y. Zhang, and A. Davidson. CUDPP: CUDA

data parallel primitives library, 2007.

[16] R.W. Hockney and J.W. Eastwood. Computer simulation using particles. Institute

of Physics Publishing, 1988.

[17] Nathan Bell Jared Hoberock. Thrust: a template library for cuda applications.

[18] W.L. Kruer. The physics of laser plasma interactions. Westview Press, 2003.

[19] Jens Krüger and Rüdiger Westermann. Acceleration Techniques for GPU-based

Volume Rendering. In Proceedings IEEE Visualization 2003, 2003.

[20] Leemans, W P and Nagler, B and Gonsalves, A J and Toth, Cs and Nakamura,

K and Geddes, C G R and Esarey, E and Schroeder, C B and Hooker, S M. Gev

electron beams from a centimetre-scale accelerator. Nature Physics, 2(10):696–699,

2006.

[21] P. Londrillo, C. Benedetti, A. Sgattoni, and G. Turchetti. Charge preserving high

order pic schemes. Nuclear Instruments and Methods in Physics Research Section

A: Accelerators, Spectrometers, Detectors and Associated Equipment, 620(1):28 –

35, 2010. <ce:title>COULOMB09</ce:title> <ce:subtitle>Ions Acceleration with

high Power Lasers: Physics and Applications</ce:subtitle>.

[22] V. Malka, J. Faure, Y. Glinec, A. Pukhov, and J.-P. Rousseau. Monoenergetic

electron beam optimization in the bubble regime. 12(5):056702, 2005.

[23] C. NVIDIA. Compute Unified Device Architecture Programming Guide. Nvidia,

2009.

88

Bibliography

[24] Matteo Passoni, Luca Bertagna, and Alessandro Zani. Target normal sheath accel-

eration: theory, comparison with experiments and future perspectives. New Journal

of Physics, 12(4):045012, 2010.

[25] A P L Robinson, M Zepf, S Kar, R G Evans, and C Bellei. Radiation pressure

acceleration of thin foils with circularly polarized laser pulses. New Journal of

Physics, 10(1):013021, 2008.

[26] S. Sengupta, M. Harris, Y. Zhang, and J.D. Owens. Scan primitives for GPU com-

puting. In Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS sympo-

sium on Graphics hardware, pages 97–106. Eurographics Association Aire-la-Ville,

Switzerland, Switzerland, 2007.

[27] A. Sgattoni, C. Benedetti, P. Londrillo, and G. Turchetti. Simulation of the laser-

plasma acceleration for the plasmonx project with the pic code aladyn. Radiation

Effects and Defects in Solids, 165(6-10):803–808, 2010.

[28] Igor V Sokolov. High-order field interpolation in a charge-conserving numerical

scheme for particle-in-cell simulations. Technical Report arXiv:1101.0809, Jan 2011.

Comments: 10 pages.

[29] P. Sprangle, E. Esarey, and A. Ting. Nonlinear interaction of intense laser pulses in

plasmas. Phys. Rev. A, 41:4463–4469, Apr 1990.

[30] G. Stantchev, W. Dorland, and N. Gumerov. Fast parallel Particle-To-Grid interpo-

lation for plasma PIC simulations on the GPU. Journal of Parallel and Distributed

Computing, 68(10):1339–1349, 2008.

[31] S. Stegmaier, M. Strengert, T. Klein, and T. Ertl. A simple and flexible volume ren-

dering framework for graphics-hardware-based raycasting. International Workshop

on Volume Graphics, 0:187–241, 2005.

[32] T. Tajima and J. M. Dawson. Laser electron accelerator. Phys. Rev. Lett., 43:267–

270, Jul 1979.

[33] Kane S. Yee. Numerical solution of initial boundary value problems involving

maxwell’s equations in isotropic media. IEEE Trans. Antennas and Propagation,

pages 302–307, 1966.

89

development of algorithms for an electromagnetic particle

Documents