development of algorithms for an electromagnetic particle
TRANSCRIPT
Alma Mater Studiorum · Università di Bologna
FACOLTÀ DI SCIENZE MATEMATICHE, FISICHE E NATURALI
Corso di Laurea in Fisica
Development of algorithms
for an electromagnetic particle in cell code
and implementation on a hybrid
architecture (CPU+GPU)
Tesi di Laurea Magistrale in Fisica
Relatore:
Chiar.mo Prof.
Giorgio Turchetti
Presentata da:
Francesco Rossi
Sessione II
Anno Accademico 2010-2011
.
i
Indice
Acknowledgements 1
Introduction 2
Summary 5
1 The Physics of plasma based laser accelerators 6
1.1 Basic plasma physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.1 Debye Length and plasma parameter . . . . . . . . . . . . . . . . . 7
1.1.2 Kinetic and fluid plasma descriptions . . . . . . . . . . . . . . . . . 9
1.1.3 Collective motion: plasma electron waves . . . . . . . . . . . . . . 11
1.1.4 The collision frequency . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.5 Collisionless limit and the plasma parameter Λ . . . . . . . . . . . 12
1.1.6 Electromagnetic waves propagation in a cold plasma . . . . . . . . 14
1.2 Laser plasma acceleration physical concepts and regimes . . . . . . . . . . 16
1.2.1 Electron acceleration: LWFA . . . . . . . . . . . . . . . . . . . . . 16
1.2.2 Electron acceleration: bubble regime . . . . . . . . . . . . . . . . . 18
1.2.3 Ion acceleration: TNSA and RPA . . . . . . . . . . . . . . . . . . . 21
2 The numerical modelization 27
2.1 The particle-in-cell method . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.1 Phase space representation . . . . . . . . . . . . . . . . . . . . . . 27
2.1.2 Passes of an electromagnetic PIC code and numerical parameters
of a laser plasma interaction simulation . . . . . . . . . . . . . . . 30
2.1.3 Interpolation and deposition using shape functions . . . . . . . . . 31
2.1.3.1 Force interpolation . . . . . . . . . . . . . . . . . . . . . . 31
2.1.3.2 Charge and current deposition . . . . . . . . . . . . . . . 32
2.1.3.3 Common shapefunctions and shapefactors . . . . . . . . . 32
2.2 The “standard” second-order PIC: leapfrog and FDTD . . . . . . . . . . . 38
2.2.1 Solving the Maxwell equations numerically using the Yee Lattice . 38
ii
Indice
2.2.2 Boris pusher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2.3 Bringing all togheter . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3 Charge conservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.3.1 Spectral Poisson corrector . . . . . . . . . . . . . . . . . . . . . . . 45
2.3.2 Esirkepov Shape functions . . . . . . . . . . . . . . . . . . . . . . . 47
2.4 Envelope model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.1 Laser envelope equation derivation . . . . . . . . . . . . . . . . . . 50
2.4.2 Plasma motion and wakefield equations in cylindrical comoving
coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3 INF&RNO 54
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Numerical Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.1 Plasma modelization . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2.1.1 Fluid plasma model . . . . . . . . . . . . . . . . . . . . . 56
3.2.1.2 PIC plasma model . . . . . . . . . . . . . . . . . . . . . . 56
3.2.1.3 Numerical choices . . . . . . . . . . . . . . . . . . . . . . 56
3.3 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.1 Shared memory (OpenMP) . . . . . . . . . . . . . . . . . . . . . . 60
3.3.2 Message-Passing (MPI) . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.2.1 1D domain decomposition . . . . . . . . . . . . . . . . . . 62
3.3.2.2 2D domain decomposition . . . . . . . . . . . . . . . . . . 64
3.3.3 GPU parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.4 3D Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework 70
4.1 Structure of the framework . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 GPU Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.1 GPU architecture and CUDA programming model . . . . . . . . . 72
4.2.1.1 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.1.2 Parallelism and memory model (hierarchy) . . . . . . . . 74
4.2.1.3 Programming GPUs: CUDA libraries and our multi ar-
chitecture approach . . . . . . . . . . . . . . . . . . . . . 76
4.2.2 Our approach for writing cross architecture code . . . . . . . . . . 78
4.2.3 Current and density deposition algorithm . . . . . . . . . . . . . . 79
4.3 Hybrid parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.1 Inter-node & intra-node parallelization . . . . . . . . . . . . . . . . 80
iii
Indice
4.3.2 Inter-node communication . . . . . . . . . . . . . . . . . . . . . . . 83
4.4 Performance benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Conclusions 85
iv
Acknowledgements
For giving me the opportunity of performing the work in this thesis, I would like to thank
my supervisor Prof. Giorgio Turchetti, a person of reference during all the period of my
studies.
Particular thanks are for Dr. Carlo Benedetti, who has introduced and guided me,
with contagious enthusiasm, through this field of research and gave me the possibility to
collaborate, for a while, with the LOASIS group at Lawrence Berkeley National Lab, one
of the excellences in the field. I also thank him personally for the great time he made
me spend in Berkeley.
During this period, it has always been a pleasure to work with both the groups in
Bologna and in Berkeley, and I want to give a special thank to all the people I have been
working with, whose essential advices and support have been always fundamental and
encouraging. It was a pleasure to meet my supervisor in Berkeley Carl Schroeder and to
collaborate with Andrea Sgattoni and Pasquale Londrillo, who helped me with incredible
altruism.
It has been an honor and incredibly motivating to work for experts of the likes of Wim P.
Leemans, Eric Esarey and Cameron Geddes. It has been a pleasure to be in contact with
Professor Graziano Servizi, Nicholas Matlis, Daniel Bakker, Lule Yu and Joshua Renner,
who gave me the chance for very fruitful discussions, though not always centered on
theme of this thesis.
I thank CINECA, and in particular Carlo Cavazzoni and Riccardo Brunino, for showing
great interest in my work and giving me the possibility to benchmark my codes on their
new machine PLX.
Sofia, Emma and Rolando have been always next to me and the support they have always
been giving me is invaluable.
I acknowledge support by the INFN, the “Collegio Superiore dell’Università di Bologna”
and by the Department of Energy under the Office of Science contract No. DE-AC02-
05CH11231.
1
Introduction
The development of laser-plasma accelerators was inspired by the work of Tajima and
Dawson [32, 1979] and it was boosted by the rapid upgrowth of high intensity laser
systems, which allow to explore interactions that drive electrons up to ultra-relativistic
velocities. The particle beams generated in such accelerators are expected to find appli-
cation in a wide range of contexts, including high energy physics, proton therapy for the
treatment of cancers and generation of intense X-ray radiation.
Plasma-based particle acceleration is of great interest because of their capability to sus-
tain extremely high acceleration gradients. The accelerating electrical field in a con-
ventional, radio frequency, linear accelerator (linac) is limited, for structure breakdown
reasons, to approximately∼ 100MV/m. A ionized plasma (having electron number den-
sity n0), instead, can sustain electron oscillations that generate electric wakefields that
can exceed the non-relativistic wave-breaking limit:
E0(V/m) ≃ 96√
n0(cm−3) (0.1)
This limit can be various orders of magnitudes greater than the limit in a conventional
linac, for example n0 = 1018cm−3 yields E0 = 100GV/m. Wakefields of this order,
generated by the plasma electron waves, are able to accelerate electrons and static fields
can be used to accelerate protons and ions.
New laser technologies are able to provide high power (PW scale) and ultra-short (∼10fs) pulses. These features make lasers ideal acceleration drivers, as they can be used
for exciting these plasma electron waves in quasi-resonance conditions.
Experiments have demonstrated, so far, the acceleration of ultra-short, monochromatic
electron bunches up to 1GeV ([1]) and ion bunches up to ∼ 100MeV . Electron acceler-
ation up to 10GeV in ∼ 1m of plasma, is currently under development at LBNL by the
LOASIS group ([8],[11]).
2
Indice
The theoretical study of laser plasma interaction phenomena requires to solve systems of
nonlinear partial differential equations which, in general, can not be solved analytically.
Numerical modeling and computer simulation codes are therefore fundamental tools for
understanding the Physics of laser-plasma acceleration and for supporting the design
of the experiments. The use of simulations, in particular the so called particle-in-cell
codes, has made possible to discover new physical regimes (as the bubble one) and has
had, in general, a very significant impact in designing, optimizing and understanding
laser-plasma accelerators.
These simulations are usually very demanding from the point of view of the required CPU
time and memory. In fact, full dimensionality (3D) models are required for describing
the dynamics correctly and it is required to resolve many physical length scales, which
can differ by various orders of magnitude.
Given the crucial importance of simulations in this field, the effort for making them more
accurate and faster (which means, in most of the cases, possible) is definitely worth.
Those objectives can be achieved via efficient code parallelization and numerical scheme
improvement.
Since 2007, our group in Bologna has been developing an MPI-parallel particle in cell
code, ALaDyn ([5], acronym for Acceleration by Laser and Dynamics of charged parti-
cles), featuring high order integration (Runge-Kutta 4) and derivative schemes, which
allowed to reduce computational requirements. It has been used for modeling the INFN
laser plasma acceleration experiments LILIA (proton acceleration) and PLASMON-X
(electrons acceleration).
Furthermore, customized codes for specific acceleration regimes have been developed.
The INF&RNO code framework, developed by Carlo Benedetti at Lawrence Berkeley
National Laboratory, is designed to simulate laser plasma electron accelerators that work
in underdense plasma regime, which, in certain cases, allows the use of the laser envelope
approximation and the boosted Lorentz frame.
In this thesis I will discuss the parallelization of the INF&RNO code - which required to
find some original and physically guided solutions for solving some recursive dependencies
in the numerical scheme - and my implementation (in modern C++) of a generic particle
in cell codes framework, named jasmine, in which I have implemented some of the
schemes of ALaDyn and INF&RNO, capable to run full 3D PIC simulations on (general-
3
Indice
purpose-)graphics processing unit (GP)GPU clusters.
The GPGPU architecture, developed in the recent years, represents a very good opportu-
nity for efficient parallelization. It implements the parallelism at the very chip level and
it provides a parallel-efficient memory hierarchy, hundreds of lightweight cores (providing
more floating point computational power) and more memory bandwidth compared to a
conventional processor (CPU).
Many among the world’s largest high performance computing clusters (4 in the worldwide
top ten while writing this thesis) have, already, most of their computational power in
their GPU nodes, as the new machine PLX at CINECA, which was used for jasmine
benchmarking.
Furthermore, in the last years, the development of manycore, shared memory, architec-
tures has been the only way to increase processors’ computational power, and therefore,
in the future, they are likely to increase in importance. Parallelizing for these archi-
tectures is not trivial: for a particle-in-cell code it is required to write some algorithms
(such as densities deposition) using original and completely different approaches respect
to their serial implementation.
My work for developing a GPU version of an electromagnetic particle in cell code is
giving us the possibility to run the simulations exploiting the efficiency and power of
such clusters, which are providing us great speedups: the GPU code, on a single GPU,
is up to 50x faster than the CPU version on a single CPU core. Furthermore it makes
our simulations ready for the (likely) upcoming manycore era, in which the CPUs and
GPUs architectures will continue to converge.
At the moment of writing this thesis, no published article has claimed the development
of an electromagnetic PIC code that can run on hybrid clusters using the main CPU
memory efficiently, or that can run multi-gpu full-PIC 3D simulations, or that is easily
adaptable for various geometries and numerical schemes as jasmine.
4
Summary
In this thesis I will discuss:
Chapter1 The physical background of laser plasma interactions, from the analytical
theory, which derives the fluid model and explains linear oscillation modes and
plasma parameters, to the nonlinear laser-plasma acceleration regimes, explained
with some simulations run with my code jasmine (chapter 4).
Chapter2 The numerical solutions for the integration of the Maxwell Vlasov equations:
the particle-in-cell method, the numerical integration of the Maxwell equations in
time, the standard second order PIC scheme, the charge conservation issues and
the laser envelope approximation.
Chapter3 My work parallelizing and improving with 3D visualization of INF&RNO, an
efficient code, developed at LBNL, designed for modeling underdense-plasmas laser
interactions.
It assumes cylindrical symmetry, allows to run the simulations in a reference frame
that balances the length scales (the so called Boosted Lorentz frame) and uses
the laser envelope model, integrated implicitly. Given these unique features of
the code, its parallelization required to find some original and physically guided
solutions for solving some recursive dependencies in the numerical scheme (which
make parallelization impossible).
Chapter4 Jasmine, my particle-in-cell code and framework, designed to be as flexible
and reusable as possible, in order to reduce code rewrites for developing different
schemes to the minimum.
It is parallelized for running 1D/2D/3D simulations on graphics processing units
CUDA and hybrid (CPU+GPU nodes) HPC clusters, achieving very promising
performance results: the order of the (1 GPU)/(CPU core) speedup can be of 50x. A
standard, second order, PIC scheme and the INF&RNO scheme were implemented
in the framework. At the moment of writing this thesis, no published article has
claimed the development of a full electromagnetic PIC code with such features.
5
1 The Physics of plasma based laser
accelerators
1.1 Basic plasma physics
A plasma is basically a fully ionized gas. The free charges in a plasma make it very
reactive to electromagnetic solicitations. Plasmas exist in various forms, which can be
different, in density, temperature or dimensions, of various orders of magnitude. However,
their properties and behaviors are extremely scalable across these differences.
Dynamically, a plasma is a statistically relevant number of charged particles, interacting
with and generating electromagnetic fields. In principle, the dynamics of a plasma is fully
determined considering that: the force acting on each relativistic particle is the Lorentz
force and the electromagnetic fields evolution is governed by the Maxwell equations. In
CGS ( xi , pi = miγivi are the position and momentum of the i-th particle):
xi = vi
pi = qi
(E(xi) +
vi×B(xi)c
) (1.1)
∇ �B = 0
∇ �E = 4πρ
∇×B− 1c∂E∂t = 4pi
c j
∇×E+ 1c∂B∂t = 0
(1.2)
In the latter ones, the sources ρ and j are calculated starting from the particles’ phase
space distribution without doing any spatial average operation: this guarantees to include
in the model the binary collisions.
This approach is very impractical, both for what concerns numerical simulations (the
number of the particles involved is not realistically computable) and analytical theory
6
1 The Physics of plasma based laser accelerators
(the nonlinear partial differential equations describing the physics are not solvable ana-
lytically).
A cascade of physical considerations and approximations lead to a cascade of simplified
models.
The first step can be to neglect the collisional effects in the kinetic, phase space density
plasma description. The validity of the collisionless model is evaluated considering the
Debye length and its relations with other plasma parameters.
The collisionless model of the plasma can be further simplified to a fluid model, if the
phase space distribution can be considered as single valued for each point in space:
for each position in space the velocity is defined univocally. The fluid model can be
used to describe various plasma oscillation modes, linear and nonlinear, but it can not
include wavebreaking phenomena, where different particles in the same point in the the
configuration space have different velocities.
1.1.1 Debye Length and plasma parameter
We can consider an hydrogen-like fully ionized plasma, in which we label electron and
ion densities as ne and ni respectively.
The plasma is near to thermal equilibrium (at temperature T) and, in the unperturbed
state, we consider ne = ni = n0 everywhere.
If we perturb the system by adding a discrete point charge Q(> 0), we see that it will
attract the electrons and repel the ions, making a cloud of net negative charge that
shields the point charge’s electric field and potential. The electrons are prevented from
collapsing onto the point charge (to fully neutralize it) by their thermal energy.
We can estimate the amplitude of this shielding effect. The electrostatic potential affects
the distribution function for the electrons and other species and the Boltzmann factor
e−E/kT becomes:
fj(r, v) = exp
(−mv2
2KT
)exp
(qjΦ
kTj
)(1.3)
We can then write the electron density as:
nj(r) =
ˆ
fj(r, v)d3v = n0 exp
(−qjΦ
kTj
)(1.4)
7
1 The Physics of plasma based laser accelerators
We assume that:
• qjΦ << kTj for each species j, so we can expand nj as:
nj = n0j
(1− qjΦ
kTj
)(1.5)
In CGS, Poisson equation for electrostatic potential Φ reads:
∇2Φ = −4πρ = −4π∑
j
qjnj − 4πQδ(r) (1.6)
Substituting njwe get:
−∇2Φ = 4π∑
j
[qjn0j
(1− qj
kTj
)]Φ+ 4πQδ(r) (1.7)
Defining the Debye length as
λD =
√√√√1
4π∑
j
[qjn0j
(1− qj
kTj
)] (1.8)
and exploiting the spherical symmetry
− 1
r2∂
∂r
(r2
∂Φ
∂r
)=
Φ
λ2D
+ 4πQδ(r) (1.9)
we can solve the equation for Φ, giving:
Φ(r) =Q
rexp
(− r
λD
)(1.10)
Being mi ≫ me it is reasonable to consider the ions as immobile in most of the cases,
especially on short time scales; this allows to rewrite the Debye length as:
λD =
√kT
4πn0ee2(1.11)
We can interpret the Debye length as the space scale at which the plasma shields the
electrostatic potentials generated by single point charges.
The number of particles in a Debye sphere is called the plasma parameter Λ, whose
significance is explained in1.1.5:
8
1 The Physics of plasma based laser accelerators
Λ =4
3nλ3
D
(=
4
3
(kT
4πe2
)3/2
n−1/2
)(1.12)
1.1.2 Kinetic and fluid plasma descriptions
The kinetic model of a plasma describes the state of the particles of the system by means
of a distribution function fj(x,p, t), which is just the density of particles (of species j)
in phase space (x,p = γmv):
n(particles in dxdp) = fj(x,p, t)dxdp (1.13)
Macroscopical (observable, fluid) quantities are obtained averaging over momenta:
nj(x) =´
fj(x,p, t)dp particle density
njuj(x) =´
vfj(x,p, t)dp mean velocity
[Pkl(x)]j = mj
´
vkvlfdp mean pressure
(1.14)
ρj(x) = qj´
fj(x,p, t)dp charge density
jj(x) = qj´
vfj(x,p, t)dp current density(1.15)
On a time scale much smaller than collision frequencies, the Boltzmann equation for fj
reduces to the phase space continuity equation:
∂fj∂t
+∂
∂x� (xfj) +
∂
∂p� (pfj) = 0 (1.16)
, which, once coupled with equations of motion of particles (Lorentz force), is named
Vlasov equation:
x = v
p = qj
(E(x) + v×B(x)
c
) (1.17)
∂fj∂t
+ v�∂fj∂x
+ qj
(E+
v ×B
c
)�
∂fj∂p
= 0 (1.18)
The electromagnetic fields are given by Maxwell equations, which close the system and
make it self-consistent:
9
1 The Physics of plasma based laser accelerators
∇ �B = 0
∇ �E = 4πρ
∇×B− 1c∂E∂t = 4pi
c j
∇×E+ 1c∂B∂t = 0
(1.19)
In the non-relativistic limit, we can obtain the n-fluid model (each species j is treated
as a separate fluid interacting with the others by means of EM fields) of a plasma by
considering the momenta of the Vlasov equation. The first two are:
ˆ
dp
[∂fj∂t
+ v�∂fj∂x
+ qj
(E+
v×B
c
)�
∂fj∂p
]= 0 (1.20)
ˆ
dpp
[∂fj∂t
+ v�∂fj∂x
+ qj
(E+
v×B
c
)�
∂fj∂p
]= 0 (1.21)
We obtain the spatial continuity equation for the particle density and a fluid equation
that describes the motion of the charged fluids.
∂nj
∂t+
∂
∂x� (njuj) = 0 (1.22)
nj
(∂uj
∂t+ uj �
∂uj
∂x
)=
njqjmj
(E+
uj ×B
c
)− 1
mj
∂
∂x�
[Pkl]jm
We have therefore reduced the 6-dimensional Vlasov equation to a 3 dimensional fluid
equation.
This integration performed considering the momentum of next order (second) would
lead to the equation for a pressure tensor P . It is reasonable to restrict to these two
equation if further assumptions concerning heat flux and obtaining a thermodynamical
state equation.
10
1 The Physics of plasma based laser accelerators
Being the k and ω respectively the typical wavelength and frequency of the system and
vj the thermal velocity of the particles of the j− th species, two simplifying assumptions
can be:
• if ω/k ≪ vj,thermal, then the heat transfer is so fast that the fluid can be considered
isothermal: pj = njθj
• if ω/k ≫ vj,thermal, then the heat flux is negligible, and we can use an adiabatic
equation for pressure pj/nγj = constant
1.1.3 Collective motion: plasma electron waves
Using fluid equations we can study a common form of collective motions in plasmas: the
charge and electrostatic field oscillation associated with the motion of the electrons.
We consider an 1D, initially uniform, non-relativistic, neutral plasma (ions with Z=1),
with massive ions that can be considered as fixed in our process’ timescale. We consider
only electrostatic fields and neglect thermal effects.
Under these conditions the fluid equations are simplified to:
∂ne
∂t+
∂
∂x(neue) = 0 (1.23)
∂
∂t(neue) +
∂
∂x(neu
2e) = −
neeE
me(1.24)
The electrostatic field can be obtained directly using Poisson equation:
∂E
∂x= −4πe (ne − n0ions) (1.25)
Considering small variations (denoted by tilde) in particle density n, the average velocity
u and the electrostatic field E and thus linearizing equations, we get:
ne = n0 + n, ue = u (1.26)
∂E∂x = −4πen0 = ∂2n
∂t2 −n0em
∂E∂x
(1.27)
11
1 The Physics of plasma based laser accelerators
Substituting in the fluid equation we obtain a stationary wave equation:
(∂2
∂t2+ ω2
pe
)n = 0 (1.28)
ωpe =
√4πe2
n0
me= 5.64 × 104
(ne
[cm−3
])1/2(1.29)
The last equation describes the oscillation of charge density with angular frequency ωpe,
the plasma electron frequency.
1.1.4 The collision frequency
In fully-ionized plasma binary particle interactions are mostly due to the Coulomb force.
We can take the order of magnitude of the range of this force as the order of magnitude
of the Debye length.
We consider a charged particle (mass m, charge q0, velocity v0) approaching a target
particle, at rest, with mass M ≫ m and charge q0.
The particle will not be able to get closer than the distance of minimum approach δ to
the target:
mv202
=qq0δ→ δ =
2qq0mv20
(1.30)
so that, when a particles “tries” to overlap the δ region, is deflected at high angles.
Having a population of particles of density n, mean velocity v0, heading for our target
particle, the high angle scattering rate can be evaluated computing the flux of particles
passing within a radius δ from the target:
νc = (πδ2)nv0 = π 4e2
m2v20nv0 =
= e4nm2v30
∝ T−3/2(1.31)
1.1.5 Collisionless limit and the plasma parameter Λ
Recalling (CGS units):
12
1 The Physics of plasma based laser accelerators
vc =4πe4n
m2v30(1.32)
λD =
√kT
4πn0e2(1.33)
ωpe =
√
4πn0e2
me(1.34)
and
vth =
√kT
me(1.35)
then:
ωpe =vthλD
=2π
Tpe(1.36)
A thermal electron travels about a Debye length in a plasma oscillation period. Just as
Debye length represents the electrostatic correlation length, so the plasma period plays
the role of the electrostatic correlation time.
We consider the ratio between the two frequencies which characterize the problem ωp
and νc:
ωp
νc=
(4π
ne2
m
)1/2 (m2v304πe4n
)(1.37)
ωpe
νc= 4πΛ (1.38)
and it is apparent how Λ (the number of electrons in a Debye sphere) plays a key role
for evaluating the dominance of collisional regimes, connecting the collective motion and
the collision timescales.
The collisionless limit, in which the collision timescale is slow compared to the collective
phenomena one, can be expressed as (using the plasma parameter definition 1.12):
ωpe
νc=
τcτpe
= 4πΛ≫ 1 (1.39)
13
1 The Physics of plasma based laser accelerators
1.1.6 Electromagnetic waves propagation in a cold plasma
From the non-relativistic fluid equations, neglecting pressure terms in a cold plasma, one
gets:
∂n1
∂t + n0∇ · u1 = 0
men0∂v1
∂t = −en0E1
∇×B1 = −4πec n0u1 +
1c∂E1
∂t
∇×E1 = −1c∂B1
∂t
(1.40)
in which, the suffix 0 indicates the equilibrium quantities and 1 the small fluctuations.
ne = n0 + n1
Expressing E in its plane plane waves Fourier decomposition,
E =∑
k
Ek exp(k · r− ωt) (1.41)
considering the first order component in the series, and substituting in the second of the
previous equations, one gets u1 = eiωme
E1. Substituting the latter in the third equation
in 1.40 one obtains:
∇×B1 = −iω
c
(1− ω2pe
ω2
)E1 (1.42)
Furthermore, taking a time derivative and using the fourth equation (∇×E1 = ... ) one
has:
ω2
c
(1− ω2pe
ω2
)E1 = c∇× (∇×E1) (1.43)
, or, using the plane wave expression for E1 on the right hand side:
k× (k×E1) = −ω2
c
(1− ω2pe
ω2
)E1 (1.44)
Taking, for symmetry reasons, k = kz, the equation can be recast to the form:
14
1 The Physics of plasma based laser accelerators
ω2 − ω2pe − k2c2 0 0
0 ω2 − ω2pe − k2c2 0
0 0 ω2 − ω2pe
E1x
E1y
E1z
=
0
0
0
(1.45)
The solutions of this system are the dispersion relations for the electric field linear oscil-
lation modes.
• A first solution (previously found) is for longitudinal electric field oscillations (plasma
electron waves):E1x = E1y = 0
ω2 = ω2pe
(1.46)
• Another one describes some transversal waves:E1z = 0
ω2 = ω2pe + k2c2
(1.47)
Substituting k from the dispersion equation in the plane wave expression one gets:
E = Ek exp(i(kz − ωt)) = Ek exp
i
√ω2 − ω2
pe
c
z
exp (iωt) (1.48)
If ω2−ω2pe < 0, the spatial term becomes a damping term, with a characteristic length λ
λ =c(√
ω2pe − ω2
) (1.49)
The plasma skin depth, defined in the ω ≪ ωpe limit as λskin depth = cωpe
=(5.31 × 105 n
−1/2e
)cm
represents the length scale at which the plasma damps electromagnetic waves of angular
frequency ω.
This electromagnetic waves non-propagation condition, ω2 − ω2pe < 0, can be expressed
also introducing the plasma critical density for EM waves of frequency ω:
nc(ω) =( me
4πe2
)ω2 (1.50)
If ne > nc the plasma is called “overdense” or “overcritic” and it becomes opaque to the
radiation having frequency smaller than ω.
15
1 The Physics of plasma based laser accelerators
1.2 Laser plasma acceleration physical concepts and regimes
In this section are presented some basic laser-plasma acceleration techniques.
1.2.1 Electron acceleration: LWFA
In the regime named LWFA (Laser WakeField Acceleration), an electron bunch is ac-
celerated while it travels in phase with a plasma wakefield, generated by a laser pulse
driver.
Qualitatively, when a ultra short laser pulse begins to propagate in a uniform underdense
plasma, the ponderomotive force, pushing electrons, creates a longitudinal charge sepa-
ration that results in a longitudinal electric field, which pulls back the electrons again.
This results in the formation of a plasma electron wave and the corresponding wakefield.
The quasi-resonance condition:
Llaser∼= λplasma(nplasma)
2(1.1)
makes the formation of the wake very efficient. In fact, in these conditions, the pondero-
motive force and the wake electric field change sign at the same frequency.
For the laser pulse to propagate and create the wakefield, the plasma must be underdense.
The laser intensity, parametrized by the adimensional parameter a = eAmc2
determines the
shape of the wakefield: for a≪ 1 the wakefield is linear, for a ∼ 1 it becomes nonlinear
(and the electrons quiver motion becomes relativistic), and for higher a a bubble (see
1.2.2 for further conditions), or blow-out regime, can be achieved.
The electron bunch can be injected externally or generated from the wakefield itself, by
wavebreaking phenomena occurring in relativistic conditions.
Analytical models manage to describe the wakefield generation phenomenon, for the
linear regime, only if the driver (the laser beam) is assumed nonevolving : it is function
of the comoving coordinate ξ = z − vpt vp ∼= vg ≤ c only (vp is the plasma wave phase
velocity and vg is the laser group velocity).
On the other hand, nonlinear wakefield generation can be treated analytically only in
both the nonevolving driver and one-dimensional approximations: the driver has to be
16
1 The Physics of plasma based laser accelerators
assumed to be broad, kpr⊥ ≫ 1, where r⊥ is characteristic radial drive beam dimension.
In this case, starting from the relativistic cold fluid equations and the Poisson equation:
∂nc∂t +∇ · (nu/γ) = 0
∂(u− a)/∂ξ = ∇(a− Φ)
∂2Φ∂ξ2 = k2p
(nn0− 1), (1.2)
one can get the evolution equation for the electrostatic wakefield potential [12] (γp =(1− (vp/c)
2)
−1/2) :
1
k2p
∂2φ
∂ξ2= γ2p
vpc
[1−
(1 + a2
)
γ2p(1 + Φ)2
]−1/2
− 1
≃(γ2
p≫1)1 + a2
2(1 + Φ)2− 1
2(1.3)
In general, wakefield generation by an evolving laser pulse can be studied only with
numerical simulations. The following simulation results will help illustrating some accel-
eration regimes.
The units in all the plots are in CGS and the spatial scale of the density plots is the cell
grid index. The longitudinal phase space plots units are cm and MeV per electron or
proton. The physical parameters are given in terms of the adimensional laser potential
a0 = eA/mec2 and the density is normalized as n/nc, being nc as in 1.50. All the jasmine
simulations were run on one or multiple GPUs.
A first sample of simulation is a 2D one, run with my code jasmine (chapter 4), which
illustrates the generation of a linear wakefield . The parameters are:
a0 n/nc λlaser Llaser w0,laser c∆t ∆x = ∆y Grid P.P.C
0.01 0.001 1µm λp/2 λp/6 0.035µm λlaser/20 4096x1024 16
and the longitudinal electrical current plot, showing how the generated linear wakefields
looks like (the laser pulse longitudinal component is on the right):
17
1 The Physics of plasma based laser accelerators
For higher laser intensities, such as a0 = 1.75, nonlinear wakefields are created and
it is observable a wavebreaking effect which results in electrons auto-injection. The
parameters of a jasmine 2D simulation in this regime follow and the results are in figure
1.2.1.
a0 n/nc λlaser Llaser w0,laser c∆t ∆x ∆y Grid P.P.C.
1.75 0.004 1µm λp/2 2λp 0.029µm λlaser/30 2∆x 1280x1024 16
The parameters of a similar 3D simulation follows and the results are plot in figure 1.2.2.
a0 n λlaser Llaser w0,laser ∆t ∆x ∆z = ∆y Grid P.P.C.
4.0 1.38e19cm−3 0.8µm 15fs 8.2µm 0.1fs 0.08µm 0.32µm 256x224x224 4
1.2.2 Electron acceleration: bubble regime
The bubble mechanism is a very robust electron acceleration scheme that can produce
very short self-injected monochromatic beams. Given its deep nonlinearity and com-
plexity, the only method available for modeling accurately this regime is to run full 3D
simulations (or cylindrical symmetry reduced ones, as in INF&RNO, chapter 3).
Qualitatively, an intense laser pulse, shorter than a plasma wavelength both in longitu-
dinal and transversal directions, propagates in an underdense plasma and pushes away
18
1 The Physics of plasma based laser accelerators
Figure 1.2.1: Jasmine nonlinear wakefield simulation results. Wakefield generated inrelativistic nonlinear conditions, density and longitudinal phase space[x(cm), E(MeV )] plots showing wavebreaking and acceleration. The firstplot is at time ct = 51µm and the others at ct = 553µm.
19
1 The Physics of plasma based laser accelerators
Figure 1.2.2: Jasmine 3D nonlinear wakefield generation simulation results. Evolution,at times multiple of 12.8 fs, of a 3D wakefield slice , simulated with jasmine.
20
1 The Physics of plasma based laser accelerators
all the electrons that invests, creating a robust, low density cavity (the bubble) just be-
hind. The electrons, accumulating on its borders, slip to the back of the cavity, where
an injection mechanism happens. The injected electrons are trapped inside the cavity
and are accelerated by the electrostatic field resulting from the charge separation. As the
electrons inside the cavity get accelerated, they move towards inside it, allowing more
electrons to be injected (beam loading).
The conditions required for the bubble regime are ([14]):
• Relativistic or ultra-relativistic electron motion a > 1.
• Underdense plasma nnc(λlaser)
≪ 1 .
• Pulse length shorter than its transversal waist w0, and kpw0 ≃√a.
I have prepared some simulations for explaining this regime further:
• A 2D simulation run with my code jasmine (chapter 4). The physical and numeri-
cal parameters are:a0 n/nc λlaser Llaser w0,laser c∆t ∆x ∆y Grid P.P.C.
7.5 0.008 0.8µm w0,laser/√2 λp
√a/2π 0.023µm λlaser/30 2∆x 2048x1024 20
and the simulation results are plot in figure 1.2.3.
• A similar 3D simulation, whose results are plot in figure 1.2.4:a0 n/nc λlaser Llaser w0,laser c∆t ∆x Grid P.P.C.
10 0.005 1µm 8µcm 3µcm 0.018µm λlaser/32 768x256x256 8
• A 2D, 3D in cylindrical symmetry, simulation run with INF&RNO (chapter 3),
plotted in figure 1.2.5.a0 n/nc λlaser Llaser w0,laser ∆z Grid P.P.C.
4 0.0025 = 2πkp√n/nc 2/kp 4/kp 1/30λp 6/∆z 16
1.2.3 Ion acceleration: TNSA and RPA
In the regime named TNSA (Target Normal Sheath Acceleration), qualitatively, an intense
(the electrons become relativistic very soon), linearly polarized laser pulse is partially
21
1 The Physics of plasma based laser accelerators
Figure 1.2.3: 2D bubble simulation. Electron charge density at ct = 294.4µm. Cavity for-mation (ct = 23µm,ct = 117µm) and particle phase space [x(cm), E(MeV )]at ct = 491µm. Electric field inside the cavity (bubble) at ct = 117µm.
22
1 The Physics of plasma based laser accelerators
Figure 1.2.4:3D bubble simulation. Electrons density 3D plot (with injected bunch ingreen) and sliced electron density at t = 0.004cm/c.
Figure 1.2.5:INF&RNO bubble simulation. Electrons density and longitudinal electricfield ωpt = 480
23
1 The Physics of plasma based laser accelerators
absorbed and reflected by an overdense plasma (initially solid and ionized by the laser),
whose electrons are heated and slightly pushed forward, overcoming the rear side of the
target. This charge displacement creates a static longitudinal electric field, that can pull
plasma ions out of the target. The resulting accelerated ion bunch has an exponential
distribution of energy, and the maximum energy reached is of the order of several MeV
s ([24]).
To better illustrate this physical regime, I have run a 2D simulation. It was run in 2D
with my PIC code jasmine. The simulation setup is the following:
a0 n/nc λlaser Llaser w0,laser c∆t ∆x ∆y Grid P.P.C.
10 80 1µm 25fs 3µm 5 ∗ 10−7cm 1/15λp 2∆x 5120x4096 64
, and the results are in figure 1.2.6.
In the regime named RPA (Radiation Pressure Acceleration) regime, particle acceleration
is dominated by radiation pressure. The results of a simulation in this regime are in figure
1.2.7 , the setup is similar to the one in 1.2.3, but here a circularly polarized laser was
taken.
a0 n/nc λlaser Llaser w0,laser c∆t ∆x ∆y Grid P.P.C.
20 50 1µm 8µm 8µm 6.3 ∗ 10−7cm 1/15λp 2∆x 5120x4096 64
24
1 The Physics of plasma based laser accelerators
Figure 1.2.6:
TNSA simulation results. Electron density displacement, correspondingstatic longitudinal electric field accelerating protons, whose phase space[x(cm), E(MeV )] at the end of simulation is plot rightmost. The time ofthe density plots is ct = 0.002048cm.
25
1 The Physics of plasma based laser accelerators
Figure 1.2.7:
RPA simulation results. Protons density displacement, static longi-tudinal electric field and protons phase space [x(cm), E(MeV )] atct = 0.0036288cm
26
2 The numerical modelization
2.1 The particle-in-cell method
2.1.1 Phase space representation
The most straightforward and complete approach to model numerically a system de-
scribed by Maxwell-Vlasov equations consists in computing, for each time, the phase
space distribution fj(x,p, t), discretized on a grid. In a full three-dimensional model,
the plasma phase space is six-dimensional. Thus, the number of grid points (the memory
required for the execution of the simulation), scales as n6, n being the linear dimension
of the discrete grid.
This memory requirements are far beyond the actual technology limits, for example
taking a meaningful grid size, let’s say n = 1024, the memory required would be ∼1018Gb.
It is therefore necessary to use a “compressed” representation of the discretized fj(x,p, t).
A method that use a very sparse phase space representation is the so-called particle-in-cell
method.
It decomposes the fj distribution into the sum of contributions coming from a finite Npj
set of computational macro-particles, or quasiparticles. Their trajectories are followed in
the phase space in a lagrangian manner, while the electromagnetic fields are discretized
on a spatial grid, with grid spacing ∆x.
The macro particles are not point like charges, they are represented by a density function
which is extended in space so that they can be considered as a smooth cloud of charge ,
in order to smooth out the numerical noise. The support of these function has a size of
the order of the grid cell size. Whereas in the configuration space the numerical particles
are defined by a finite extension, in the momentum space they are point-like (they have
definite momentum).
The interaction of the particles with the field grids, which complete the description of the
dynamics, is achieved by processes of interpolation and deposition. The interpolation and
27
2 The numerical modelization
deposition processes, being the support of the quasi-particle density function compact
involve only a small number of grid cells, the ones overlapping with the particle’s finite
shape.
The Vlasov equation and the equations of motion read:
(∂t + x·∂x + p · ∂p) fj(x,p, t) = 0 (2.1)
x =p
γm, p = F(x,p, t) (2.2)
The PIC approach consists in discretizing the phase space density function using a finite,
approximated, sum:
fj(x,p, t) = f0j
Npj−1∑
n=0
g (x− xn(t)) δ (p− pn(t)) , (2.3)
in which, f0 is a normalization factor, xn(t) is the trajectory of the n-th macro-particle
and pn(t) is its momentum.
The function g(x) is the macro-particle shape function.
The shape function is used as a convolution kernel and it is assumed to have δ-like
properties (from which follows f0 =1Np
):
´
g(x− xn)dx = 1´
∂xg(x − xn)dx = 0(2.4)
g(x) describes the macro-particle spatial extension in space and it is useful for reducing
the numerical noise arising from interpolation and deposition processes, which would
arise if a δ-function was used instead. The meaning of g(x) is evident considering the
expression for the charge density, which becomes:
ρ(x, t) =∑
j Qj
´
fj(x,p, t)dp
ρ(x, t) =∑
j,n qjg(x− xn)
28
2 The numerical modelization
, whereas the electrical current can be defined as:
j(x, t) =∑
j,n
vnqjg(x − xn) (2.5)
Rewriting the Vlasov equation 2.1, using this discretized phase space discretization and
equation 2.2, one gets:
∂tf = −f0∑Np
n {[∂xg (x− xn(t)) · xn(t)] δ (p− pn(t))
+g (x− xn(t)) [∂pδ (p− pn(t)) · pn(t)]}x · ∂xf = f0
∑Npn
{pn(t)γm · [∂xg (x− xn(t))] δ (p− pn(t))
}
p · ∂pf = f0∑Np
n {F (z, pn(t), t) · g (x− xn(t)) [∂pδ (p− pn(t))]}f0∑Np
n
(−xn · g′nδn − pngn · δ′n + pn
γm · g′nδn + F (z, pn(t), t) gn · δ′n)= 0
(2.6)
Integrating in the momentum space and using the delta function properties, one has:
Np∑
n
(−xn +
pn
γm
)∂xg (x− xn(t)) = 0, ∀x→ xn =
pn
γm(2.7)
Being Fn the spatial average of the external force field acting on the n−th macroparticle
F(x,pn, t) evaluated over the shape function g(x):
Fn(x,p, t) =
ˆ
g(x − xn)F(x,pn, t)dx (2.8)
, integrating on dz, and using the delta-like properties 2.4 of g(x), one gets:
Np∑
n
(−pn + Fn
)∂pδ(p− pn) = 0, ∀p = 0→ pn = Fn (2.9)
The particle-in-cell method, therefore, reduces the computational complexity required
for the evolution of a six-dimensional phase space grid to a system of 2Np (for each
species) equations of motion, coupled with the proper equations (in our case for the e.m.
fields) that close the system giving an expression for the external force field F.
29
2 The numerical modelization
2.1.2 Passes of an electromagnetic PIC code and numerical parameters of
a laser plasma interaction simulation
Dealing with charged particles, the physical description of the problem is closed by the
Maxwell equations for the electromagnetic fields, which are coupled with the particle
motion in a bidirectional way (by the Lorentz force and by the evaluation of charge and
current densities).
The passes of an integration cycle of an electromagnetic PIC code are the following:
1. Time advancement of macro-particles momentum and position p,x, using the ob-
tained equations of motion and the Lorentz force. The fields are interpolated from
the E,B grids.
2. Deposition (spatial average on a discrete grid) of the external field quantities needed
in Maxwell equations, ρ and j .
3. Time advancement of electromagnetic fields E,B, discretized on spatial grids, (see
subsection 2.2.1), using Maxwell equations and the quantities computed in step 2
as external sources.
The critical parameter of a simulation is the grid cell size ∆x. The integration timestep
∆t is related to ∆x by the Courant condition ([6]). It is a condition required for the
stability of the explicit integration schemes for the Maxwell PDEs, reading ∆t ≤ c∆x,
where the constant c depends on the set of algorithms used.
Furthermore, the size ∆x must be small enough to resolve with enough grid points the
typical lengths of the considered system.
30
2 The numerical modelization
In the case of a system of electromagnetic waves interacting with a plasma, these are:
• λem, the wavelength of the electromagnetic waves
• λsd = c/ωpe, the plasma skin depth
The smallest of the two length scales must be resolved with enough grid points. The two
length scales correspond to two mutually-exclusive regimes:
• Sovracritical regime: ω ≪ ωpe → λsd ≪ λem , the λsd must be resolved, having the
other one resolved as well.
• Underdense plasma regime: λem ≪ λsd , the λem must be resolved, having the
other one resolved as well.
The laser envelope approximation (see section ) may come to help in this case,
requiring only the much larger scale λsd to be resolved, allowing the use much
smaller grid sizes.
Another critical parameter of a PIC simulation the number of macro-particles per cell,
sampling the local phase space. Approximating the phase space distribution as a finite de-
composition of a too small number of spatially extended macro-particles, can cause some
regions of the phase space to be represented with not enough detail and the introduction
of a statistical noise effect. The amplitude of the latter effect scales approximately with√Npart per cell ([6]).
2.1.3 Interpolation and deposition using shape functions
The spatial averaging needed for interpolation and deposition processes is defined using
the particles’ shape function g.
2.1.3.1 Force interpolation
The average (interpolated) force acting on a particle is defined as, being F(x,pn, t) =
q(E(x) + pn×B(x)
mγc
):
Fn(xn,pn, t) =
ˆ
g(x− xn)F(x,pn, t)dx (2.10)
Indexing the grid cells, with characteristic function χı, with the multidimensional index
ı = (i, , j, k) it is possible to decompose the above integral average into a finite sum of
single cell averages:
31
2 The numerical modelization
Fn(xn,pn, t) =∑
ι∈G
ˆ
Xι
g(x− xn)F(x,pn, t)dx (2.11)
The force is given by the fields which are discretized in such a way that they take a
single, constant, value per cell Ei,Bi. It is therefore possible to write the cell-integrals
as function of the particle position only:
Fn(xn,pn, t) =∑
ι∈G
Fı(pn, t)
ˆ
Xι
g(x− xn)dx (2.12)
, or, introducing the shape factors (for a particle whose position is xn) Sı(xn) =´
Xιg(x−
xn)dx:
Fn(xn,pn, t) =∑
ι∈G
Fı(pn, t)Sı(xn) (2.13)
For the shape factors the property∑
ı Sı(x) = 1 hold true.
2.1.3.2 Charge and current deposition
In order to evaluate the current and electrical charge density, discretized on a grid, it is
necessary to “deposit” the macro-particle charge on the grid nodes. Being ρ(x) defined
as ρ(x) =∑
n q g(x − xn), then:
ρı =
´
χıρ(x)dx
´
dxχı = Vı=
ˆ
χı
[∑
n
qg(x− xn)
]dx/Vı
=∑
n
q
[ˆ
χı
g(x − xn)dx
]/Vı =
1
Vı
∑
n
qSı(xn)
2.1.3.3 Common shapefunctions and shapefactors
For a regular 3D cartesian grid, with grid cells sized ∆x = (∆x,∆y,∆z) centered in the
point xı=(i,j,k) = (xi, yj , zk), xi = x0 + ∆x · i, it is useful to introduce the centered and
normalized shapefunctions and shapefactors g(x) and S(xı), defined starting from the
cell-centered coordinate system (denoted by ~):
32
2 The numerical modelization
xı = ((x− xi) /∆x, (y − yj) /∆y, (z − zk) /∆z) , (2.14)
, in which, the shape factors becomes:
S(xı) = Sı(x) = S ((x− xi) /∆x, (y − yj) /∆y, (z − zk) /∆z) (2.15)
The equation Sı(xn) =´
Xιg(x−xn)dx, can be recast using the linear change of variables
y = ∆y ∗ yı + yı:
S(xı) = Sı(x) =
ˆ
χi
g (y − (xı ∗ d∆x+ xı)) dy (2.16)
S(xı) = V
ˆ
χg(∆x ∗ (yı − xı))dyı =
ˆ
χg(yı − xı)dyı (2.17)
where χ is the volume of the box defined by |x| < 12 , |y| < 1
2 , |z| < 12 and g(yı) ≡
V g(∆x ∗ yı) .
It is natural for the shapefunction to be separable in one dimensional components, i.e.
g(x) = g(x)g(y)g(z). By simple integration properties, one has also:
S(x) = Sx(x)Sy(y)Sz(z) (2.18)
This last relation and equation 2.17 allow to compute easily the shape factors for any
separable shape function. Some examples of normalized, one-dimensional shape func-
tions/factors are (dropping the ~ in the figures) are:
33
2 The numerical modelization
34
2 The numerical modelization
These classical shape functions are defined piecewise on intervals of length ∆x (1 in the
normalized coordinates system). By definition, the shape factor functions S(x) have the
same properties. The intervals α ∈ Å of piecewise definition of these S(xi) are always
of the kind α = [aα, aα + 1], and can be identified by their parameter a ∈ A, integral or
35
2 The numerical modelization
half-integral.
a ≤ xi ≤ a+ 1 (2.19)
Replacing some definitions in the relation above, and applying the floor and ceil function
(⌊x⌋and ⌈x⌉) properties, one obtains directly the cell index corresponding to a given piece
of function definition (for performance reasons, it is useful to know it in advance):
a∆x ≤ x− x0 − ia∆x ≤ (a+ 1)∆x
⌈(x− x0)/∆x− a− 1⌉ ≤ ia ≤ ⌊(x− x0)/∆x− a⌋
∀a : ia = ⌊(x− x0)/∆x− a⌋ =
⌊x0⌋ − a ≡ i0 − a; a integer⌊x0 +
12
⌋− a′ ≡ i′0 − a′; a = a′ − 1
2halfinteger(2.20)
The optimized chain of computation reads (’ for the case in which a are half integer):
∀a ∈ A
↓ia = i
(′)0 − a(
′)
↓xi0−a = xi0 − a(
′)
↓S(xi0 − a(
′)) ≡ Sa(xi0) optimizedSa
Defining b ∈ B and c ∈ C as the analogous, for the y and z directions, of the intervals
a ∈ A, one can finally recompose the full 3D interpolation algorithm for a particle in
position x = (x, y, z) (dropping ’):
36
2 The numerical modelization
(i0, j0, k0) = (⌊x0⌋ , ⌊y0⌋ , ⌊z0⌋) =
(⌊(x− x0)/∆x
⌋, ..., ...
)
F =∑
(abc) Sxa (xi0) · Sy
b (yj0) · Szc (zk0) · Fi0+a, j0+b, k0+c
(2.21)
and the deposition algorithm (of the single particle quantityF ) :
(i0, j0, k0) = (⌊x0⌋ , ⌊y0⌋ , ⌊z0⌋) =
(⌊(x− x0)/∆x
⌋, ..., ...
)
∀a, b, c : = Sxa (xi0) · Sy
b (yj0) · Szc (zk0) · F → ⊕→ Fi0+a, j0+b, k0+c
(2.22)
More generally, considering symmetric shape factors S(x) with support supp(S) in the
interval [−l, l = b+∆x/2], the interpolation is computed only on the grid cells for which
Si(x) 6= 0, or, equivalently, S(xı) 6= 0 holds true:
supp(S) = {−l ≤ xi ≤ l}−l ≤ (x− xi) /∆x ≤ l
−l∆x ≤(x− x0 − i ·∆x
)≤ l∆x
(x− x0)/∆x− l ≤ i ≤ (x− x0)/∆x+ l
⌈x0 − l⌉ ≤ i ≤ ⌊x0 + l⌋ (2.23)
So, the cells interacting with our particle, are the one with i-index in the set I:
i ∈ I = {⌊x0 − l⌋+ 1, ⌊x0 − l⌋+ 2, ⌊x0 − l⌋+ 3, ..., ⌊x0 + l⌋}#(I) = 2l;
Furthermore,
• if l is an integer: i ∈ I = {⌊x0⌋ − l + 1, ..., ⌊x0⌋ − l + (l − 1), ⌊x0⌋ , ..., ⌊x0⌋ + l},e.g. l = 1 : i ∈ I = {⌊x0⌋ , ⌊x0⌋+ 1}
• if l = m− 12 is an half-integer: i ∈ I = {
⌊x0 +
12
⌋−m+1, ...,
⌊x0 +
12
⌋, ...,
⌊x0 +
12
⌋+
m− 1} e.g. l = 32 : m = 2, i ∈ I = {
⌊x0 +
12
⌋− 1,
⌊x0 +
12
⌋,⌊x0 +
12
⌋+ 1}
37
2 The numerical modelization
2.2 The “standard” second-order PIC: leapfrog and FDTD
The standard, second-order PIC scheme, uses the leapfrog scheme for advancing particles’
momentum and position and the FDTD method for solving Maxwell equations.
2.2.1 Solving the Maxwell equations numerically using the Yee Lattice
The temporal evolution of the electromagnetic fields, in presence of an electrical current
j (generated by the plasma particles’ velocities in our case), is completely determined,
given the proper initial and boundary conditions, by the Maxwell–Faraday and Ampère-
Maxwell laws:
∂E∂t = +c∇×B− 4πj
∂B∂t = −c∇×E
(2.1)
These equations form a system of two first order partial differential equations, which can
be integrated numerically in time using finite difference methods. First of all, the fields
are discretized on a finite grid and the spatial differentiation operators are approximated
by finite differences.
The finite difference approximation ∆h[f ](x) of a linear differential operator ∆ is a linear
function of the field values at the locations x+h ∗k, being k a vector of signed integers,
h the discretization step size vector, and * the component-by-component product:
∆h[f ](x) =∑
k∈κ
µ∆k f(x+ h ∗ k) ∼= ∆f(x) (2.2)
Indexing the grid points G with a multi-dimensional index ı = (i, j, k, ...) ∈ G, the finite
difference, calculated at the grid point ı, is:
(∆h[f ])ı =
∑∈G µı−f ≡ µ∆
ı f
µ∆ı ≡ µ∆
ı−
(2.3)
For the curl, one has:
38
2 The numerical modelization
∇×E =
∂yEz − ∂zEy
∂zEx − ∂xEz
∂xEy − ∂yEx
→ (∇×Eı) =
µ∂yı Ez − µ∂z
ı Eyj
...
...
≡ µ(∇×)
ı Ej (2.4)
This way, the PDE becomes a system of ordinary differential equations involving the
values of the fields defined at the grid points locations ı.
(∂E∂t
)ı
= +c(µ(∇×)ı Bj
)ı− 4πjı
(∂B∂t
)ı
= −c(µ(∇×)ı Ej
)
ı
(2.5)
It is now possible to integrate iteratively and numerically the equations, using methods
like Runge-Kutta, or directly expressing the temporal differentiation operators as finite
differences, and then solving the resulting system for the values at later times.
Depending on the fields’ discretization geometry and the physical quantities that is more
important to conserve, various choices can be made between numerical derivative and
integration schemes:
• The time integration is usually computed explicitly (the appropriate Courant–Friedrich’s–Lewy
on the integration step must be therefore satisfied for having stability), using
Runge-Kutta, forward (Euler) or centered (leapfrog) schemes.
• The spatial derivatives (curls) can be discretized using centered (second-order ac-
curate in space), higher order (five point stencil), upwind (for certain geometries
and for certain purposes) or even compact (see [5]) schemes.
A very popular integration scheme, which is second-order accurate in space and time, is
the Finite-Difference Time-Domain method, or FDTD (see [33]). It manages to discretize
both time and all of the space derivatives using centered differences of the kind ∆x ·f ′(x) ≃ f(x+ 1
2∆x)−f(x− 12∆x), which are practically more accurate than the standard
2∆x · f ′(x) ≃ f(x+∆x)− f(x−∆x) , being f ′(x) = ∂xf(x):
• The electric and magnetic fields must be defined at staggered time positions (En,Bn+1/2),
and the time-advance iteration is obtained with a leapfrog step: the PDE system
has a symplectic structure. This integration scheme allows for numerical-energy-
dissipation-free wave propagation.
39
2 The numerical modelization
• Spatially : the different field components are stored for different grid locations,
on a Yee Lattice. A 3D computational domain is split into cubical voxels. The
components of E are stored for the edges (in the corresponding directions), while
the components of B for the face centers of the cube. The numerical derivatives
composing the curl operations, can all be expressed using central differences, as we
can see clearly in the following figure (from wikipedia):
For an uniform 3D cartesian discretization of the fields (the grid spacings are dx, dy, dz),
the FDTD scheme can be therefore written explicitly as (the notation used is Fni,j,k ≡
F (idx, jdy, kdz, ndt) ):
En+1xi+1/2,j,k
−Enxi+1/2,j,k
dt = cB
n+1/2zi+1/2,j+1/2,k
−Bn+1/2zi+1/2,j−1/2,k
dy − cB
n+1/2yi+1/2,j,k+1/2
−Bn+1/2yi+1/2,j,k−1/2
dz
−4πJn+1/2xi+1/2,j,k
En+1yi,j+1/2,k
−Enyi,j+1/2,k
dt = cB
n+1/2xi,j+1/2,k+1/2
−Bn+1/2xi,j+1/2,k−1/2
dz − cB
n+1/2zi+1/2,j+1/2,k
−Bn+1/2zi−1/2,j+1/2,k
dx
−4πJn+1/2yi,j+1/2,k
En+1zi,j,k+1/2
−Enzi,j,k+1/2
dt = cB
n+1/2yi+1/2,j,k+1/2
−Bn+1/2yi−1/2,j,k+1/2
dx − cB
n+1/2xi,j+1/2,k+1/2
−Bn+1/2xi,j−1/2,k+1/2
dy
−4πJn+1/2zi,j+1/2,k
Bn+1/2xi,j+1/2,k+1/2
−Bn−1/2xi,j+1/2,k+1/2
dt = −cEn
zi,j+1,k+1/2−En
zi,j,k+1/2
dy + cEn
yi,j+1/2,k+1−En
yi,j+1/2,k
dz
Bn+1/2yi+1/2,j,k+1/2
−Bn−1/2yi+1/2,j,k+1/2
dt = −cEn
xi+1/2,j,k+1−En
xi+1/2,j,k
dz + cEn
zi+1,j,k+1/2−En
zi,j,k+1/2
dx
Bn+1/2zi+1/2,j+1/2,k
−Bn−1/2zi+1/2,j+1/2,k
dt = −cEn
yi+1,j+1/2,k−En
yi,j+1/2,k
dx + cEn
xi+1/2,j+1,k−En
xi+1/2,j,k
dy
,
(2.6)
or, defining the operators:
40
2 The numerical modelization
∇+fijk =
(fi+1,j,k−fi,j,k
dx ,fi,j+1,k−fi,j,k
dy ,fi,j,k+1−fi,j,k
dz
)
∇−fijk =(fi,j,k−fi−1,j,k
dx ,fi,j,k−fi,j−1,k
dy ,fi,j,k−fi,j,k−1
dz
) , (2.7)
as:
∇− · En = 4πρn
∇+ ·Bn+1/2 = 0
En+1−En
dt = c∇− ×Bn+1/2 − 4πJn+1/2
Bn+1/2−Bn−1/2
dt = −c∇− ×En
, (2.8)
with:
En =(En
xi+1/2,j,k, En
yi,j+1/2,k, En
zi,j,k+1/2
)
Bn =(Bn
xi,j+1/2,k+1/2, Bn
yi+1/2,j,k+1/2, Bn
zi+1/2,j+1/2,k
)
ρn = ρni,j,k
Jn =(Jnxi+1/2,j,k
, Jnyi,j+1/2,k
, Jnzi,j,k+1/2
)
(2.9)
The Courant condition of this numerical scheme is ([6]):
c · dt < 1/
√(1
dx
)2
+
(1
dy
)2
+
(1
dz
)2
(2.10)
Physics requires the first two equations (Gauss laws) in 2.8 to hold at every time step .
Analytically, if the electrical charge continuity equation ∂ρ∂t + ∇ · J = 0 holds, the EM
field evolution PDEs 2.1 do automatically enforce Gauss law (if the law was satisfied at
the initial time). In fact, applying the divergence to the Maxwell–Faraday Law ∂E∂t =
+c∇×B−4πj, and using the Gauss law itself, one obtains the electrical charge continuity
equation: if the latter is satisfied so it is the Gauss law.
In a simple PIC simulation, instead, the J deposition process, which uses finite sized shape
functions, introduces and accumulates numerical errors due to cell boundary crossing and
charge conservation must be enforced in other ways, as shown in section 2.3.
41
2 The numerical modelization
2.2.2 Boris pusher
For what concerns the quasi-particles’ motion, the solutions are computed using Boris
method, which is a second-order, leapfrog-like, method that perfectly matches with the
previously discussed FDTD and can be used, in general, for integrating the equations of
motion of relativistic particles in an external electromagnetic field.
For a relativistic particle of mass m, the second Newton’s equation reads,
p = γmv, F =dp
dt(2.11)
Introducing u = γv/c = p/mc, one has:
F = md (γv)
dt= m
du
dt(2.12)
In the electromagnetic case, the acting force is the Lorentz force, F = q(E+ v×B
c
), in
which the fields are the result of the shape function interpolation on the grid, at the
particle’s position x. One has:
du
dt=
q
m
(E+
v ×B
c
)(2.13)
The so-called leapfrog method can be used to resolve numerically a symplectic system of
differential equations (with the appropriate initial condition), of the kind:
u = 1
mF
x = u/γ(2.14)
Defining x and u at staggered discrete times xn = x(n∆t), un+1/2 = u(n∆t + ∆t/2),
the method integrates iteratively the equations of motion according to the replacement
rules:
un+1/2−un−1/2
∆t = 1mFn
xn+1−xn
∆t = un+1/2/γn+1/2(2.15)
42
2 The numerical modelization
In the non-relativistic case γ = 1 and if the force can be defined at integer times
F = F(x) → Fn = F(xn), all the quantities are discretized and centered correctly in
time.
In our case, instead, Lorentz force requires to know the momentum u and the relativistic
factor γ at integer times n:
un+1/2 − un−1/2
∆t=
q
mc
(En +
un
2γn×Bn
)(2.16)
It is therefore necessary to center in time (obtaining the values at times n∆t) un and γn.
The momentum can be centered using a temporal average (which formally maintains the
second-order accuracy):
un = un+1/2 + un−1/2, (2.17)
and leads to an implicit equation:
un+1/2 − un−1/2
∆t=
q
mc
(En +
un+1/2 + un−1/2
2γn×Bn
), (2.18)
whereas, for the centering of γn, it is convenient to use the Boris method. Defining u+
and u− by the relations:
un−1/2 = u− − qEn∆t
2mc(2.19)
un+1/2 = u+ +qEn∆t
2mc(2.20)
and replacing in the above equation, one has:
u+ − u−
∆t= (u+ + u−)× q
2γnmcB (2.21)
u+ − u− = (u+ + u−)× qB∆t
2γnmc(2.22)
43
2 The numerical modelization
The very last equation represent a rotation of the vector u around an axis parallel to B
of an angle θ = −2 arctan(qB∆t/2γmc). Therefore the relation γn =
√1 +
(u−
c
)2=
√1 +
(u+
c
)2holds.
Introducing t = qB∆t2γnmc , one has:
u+ − u− = (u+ + u−)× t (2.23)
But this is a linear system in the unknowns u+x , u+y , u
+z :
ty (u+z + u
−z )− tz
(u+y + u−y
)+ u+x − u−x = 0
−tx (u+z + u−z ) + u+y + tz (u+x + u−x )− u−y = 0
u+z + tx(u+y + u−y
)− ty (u+x + u−x )− u
−z = 0
, (2.24)
having, as solution:
u′
= u− + u− × t
u+ = u− + u′ × 2t
1+t2
u+ = u− + (u− + u− × t)× 2t1+t2
(2.25)
The position of the particle is obtained centering respect to the values of u, γ:
xn+1 = xn + vn+1/2∆t = xn +un+1/2
γn+1/2∆t (2.26)
The scheme is time-reversible and introduces a second-order error in the particle trajec-
tory.
2.2.3 Bringing all togheter
For writing a PIC code using the Boris scheme, one has to match the time indices in the
evolution equations derived in the previous subsections.
The chain of computation for the n − th simulation cycle, assuming that the quantities
are know at times En,Bn−1/2,xn,un−1/2, reads:
1. Advance Bn−1/2 → Bn using En.
44
2 The numerical modelization
2. Advance un−1/2 → un+1/2 using En and Bn.
3. Advance xn+1 using xn and un+1/2.
4. Compute Jn+1/2 using un+1/2 and xn+1/2 = xn + xn+1.
5. Advance Bn → Bn+1/2 using En.
6. Advance En → En+1 using Bn+1/2 and Jn+1/2.
2.3 Charge conservation
2.3.1 Spectral Poisson corrector
The Poisson equation for the electrical potential:
∇2Φ = −4πρ = −(∇ · E) (2.1)
can be enforced using a spectral method.
At a given time, the charged density is deposed on the ρ grid and the numerical di-
vergence of E is computed. These operations must be done coherently with the rest of
the integration scheme (taking the same spatial derivatives, staggering the grids at the
correct positions..).
The charge error accumulated in the Gauss law:
4πρerr(x) = 4πρ(x) ≡ 4πρ(x)− (∇ · E)(x) (2.2)
reads, discretized:
4πρnijk = 4πρnijk − (∇num · En)nijkl (2.3)
Now, one seeks for an auxiliary electrostatic potential Φ, which counterbalances the error
in charge ρ:
∇2Φ = −4πρ (2.4)
Once Φ is known, the electric field can be corrected easily, by adding:
45
2 The numerical modelization
Ecorrected → E−∇ · Φ (2.5)
In fact, now, one has: ∇ ·Ecorrected = ∇ ·E−∇2Φ = (4πρ− 4πρ) + 4πρ = 4πρ.
The Poisson equation ∇2Φ = 4πρ must be solved numerically. The operator ∇2 =
∂xx + ∂yy + ∂zz can be discretized in physical space, using the same finite difference
derivative scheme as the one used for computing EM field derivatives. After choosing
the boundary conditions, this discretization gives rise to a banded (tridiagonal for 1D,
tridiagonal of tridiagonals for 2D and tridiagonal of tridiagonals of tridiagonals for 2D)
linear system, which can be solved:
• Directly, using an algorithm computable in linear time. Unfortunately the system
requires an order of magnitude of extra memory to be stored (this makes it very im-
practical, if not impossible, for 2d and 3d grids) and the efficient solving algorithm
is not easy to parallelize.
• According to the prescribed boundary condition, one can use a discrete spectral
representation of the fields.
As the continuous Fourier transform of the field would transform the differen-
tial equation ∇2Φ = 4πρ into the algebraic equation k2 ˘Φ(k) = 4π ˘˜(k)ρ , the
discrete Fourier transform (DFT, for periodic boundary condition), the discrete
cosine transform (DCT, for Von Neumann boundary conditions) and the discrete
sine transform (DST, for Dirichlet boundary conditions) reduce the banded linear
system into a diagonal system (much easier to solve).
This can be immediately shown in 1D using DCT, the computation for the 2D and 3D
case is analogous.
Starting from the second-order discretized Poisson equation:
∂xxA = b→ Ai+1 − 2Ai +Ai−1
2h= bi (2.6)
Computing the 1D inverse discrete transform, the discretized equation becomes:
46
2 The numerical modelization
Ai = cost. ·∑k Cikak
Bi = cost. ·∑
k Cikbk
Cik = cos
(iπkN
)→∑
k
ak(Ci+1k − 2Ci
k + Ci−1k ) + bkC
ik = 0 (2.7)
Hence, removing the summation we obtain a diagonal system of the expansion coefficients
ak, from which we can recover the Ai using the direct transform:
∀k, ak = − Cik
Ci+1k − 2Ci
k + Ci−1k
bk (2.8)
2.3.2 Esirkepov Shape functions
The Esirkepov method ([13]) for electrical current deposition enforces charge conservation
directly inside the deposition algorithm, and it can be applied for quasi-particles of
arbitrary shape factors.
As, analytically, the charge continuity equation ∂ρ∂t + ∇ · J = 0, guarantees the Gauss
law to be compatible with the dynamics (determined by Maxwell Ampere equation), an
analogue equation must hold (with the appropriate discretized operators) for numerical
discrete quantities.
This equation can be obtained using the following properties of the discretized operators
defined in 2.7 (∆ is here the discrete Poisson operator in central differences):
∇± · ∇±× = 0
∇± · ∇∓ = ∆ =fi+1,j,k−2fi,j,k+fi−1,j,k
dx2 +fi,j+1,k−2fi,j,k+fi,j−1,k
dy2+
fi,j,k+1−2fi,j,k+fi,j,k−1
dz2
(2.9)
, and applying the ∇−· divergence to the third equation in 2.8, one has:
∇− · En+1 −∇− ·En
dt= c∇− · ∇− ×Bn+1/2 − 4π · ∇−Jn+1/2 (2.10)
, or recalling Gauss law, ∇− ·En = 4πρn, :
ρn+1 − ρn
dt+∇− · Jn+1/2 = 0 (2.11)
47
2 The numerical modelization
Gauss Law is satisfied at every time step only if the obtained discretized continuity
equation 2.11 holds at every time step, explicitly:
ρn+1i,j,k − ρni,j,k
dt+
Jn+1/2xi+1/2,j,k
− Jn+1/2xi−1/2,j,k
dx+
Jn+1/2yi,j+1/2,k
− Jn+1/2yi,j−1/2,k
dy+
Jn+1/2zi,j,k+1/2
− Jn+1/2zi,j,k−1/2
dz= 0
(2.12)
The charge density, in a PIC, is constructed from the form factors S (the cell-integrated
shape functions) of the quasiparticles, indexed by α, with position xnα:
ρni,j,k =∑
α∈particles
qαSi,j,k(xnα) ,
Si,j,k(xα) = S(xαijk
= (xi − xα) /dx, (yj − yα) /dy, (zk − zα) /dz),
∑
ijk∈grid
Si,j,k(xalpha) = 1 ∀α
Considering a single particle of charge q one therefore has ρni,j,k = qSi,j,k(x). During its
motion, in a timestep, it shifts by (δx, δy, δz) from the position xn to xn+1. Using the
vector W, defined as
W xijk= −dt
q
Jxi+1,j,k−Jxi,j,k
dx
W yijk = −dtq
Jyi,j+1,k−Jyi,j,k
dy
W zijk = −dtq
Jzi,j,k+1−Jzi,j,k
dz
(2.13)
, into the discretized continuity equation 2.12 one is lead to the equation:
Sijk(xn+1)− Sijk(x
n)−Wxijk+Wyijk +Wzijk = 0
→ Sijk(x+ δx, y + δy, z + δz)− S(x, y, z) = Wxijk+Wyijk +Wzijk
As shown in ([13] ), if the particle shifts by (δx, δy, δz) along a straight line, the W
components are linear combinations of the eight shape factors (ı = (i, j, k)):
48
2 The numerical modelization
Sı(x, y, z), Sı(x+ δx, y, z), Sı(x, y + δy, z), Sı(x, y, z + δz),
Sı(x, y + δy, z + δz), Sı(x+ δx, y, z + δz), Sı(x+ δx, y + δy, z), Sı(x+ δx, y + δy, z + δz)
, and they are given by:
W x;ı =(13 ,
13 , −1
6 , −16 , −1
3 ,16 ,
16 ,
13
)· Sijk
W y;ı =(13 , −1
6 ,13 , −1
6 ,16 , −1
3 ,16 ,
13
)· Sijk
W z;ı =(13 , −1
6 , −16 ,
13 ,
16 ,
16 , −1
3 ,13
)· Sijk
Sijk =
Sı(x, y, z)
Sı(x+ δx, y, z)
Sı(x, y + δy, z)
Sı(x, y, z + δz)
Sı(x, y + δy, z + δz)
Sı(x+ δx, y, z + δz)
Sı(x+ δx, y + δy, z)
Sı(x+ δx, y + δy, z + δz)
(2.14)
Once W is computed for a particle of arbitrary shape factors S, it is possible to iterate
(in space) the equations 2.13 to calculate the correct contribution to the current J.
The boundary conditions for the integration (the iteration starting and ending points),
are determined by the fact that, outside the particle’s finite-size shape function, the
contribution of the particle to the current is zero.
2.4 Envelope model
The laser envelope model describes the laser pulse propagation in a fully ionized plasma.
It leads to a slow-varying laser envelope equation which is obtained representing the fields
as modulations of fast and slow parts.
The model assumes the ions to be stationary and the collision time to be much greater
than the other timescales.
49
2 The numerical modelization
Both the hypothesis are valid for phenomena involving a short laser pulse (≤ 1ps),
propagating in an underdense plasma (ω2p/ω
2 ≪ 1): the electrons’ thermal speed is small
compared to their quiver speed in the laser field and their collision time is much greater
than the laser pulse length.
2.4.1 Laser envelope equation derivation
The wave equation and Poisson equation for the normalized, being e and m are the
charge and the mass of the electron, potentials in the Coloumb gauge:
E = −∇Φ− ∂A∂ct
B = ∇×A
0 = ∇ ·A
→
φ = eΦ
mc2
a = eAmc2
, (2.1)
read:
(∇2 − 1
c2∂2
∂t2
)a = k2p
n
n0
u
γ+
∂
∂ct∇φ = −k2p
1
n0ceJ+
∂
∂ct∇φ
∇2φ = k2p (n− ni) /n0
, (2.2)
in which u = γv/c = p/(mc) is the fluid average of the normalized electron momentum
and γ = (1 + u2)1/2, kp =(4πn0e2
mc2
)1/2, J = −encu/γ.
Introducing the comoving coordinates ξ = z − ct, τ = t, and neglecting the term ∂∂ct∇φ
(the fast part of the electrostatic potential φfast ∼ exp(ikξ) is typically small compared
to the fast part of the plasma current), the wave equation becomes:
(∇2
⊥ +2
c
∂2
∂ξ∂τ− 1
c2∂2
∂τ2
)a ⋍ k2p
n
n0
u
γ(2.3)
Further possible simplifications are:
• For forward-going light waves, the term 1c2
∂2
∂τ2a can be neglected. On the other
hand, this term is important for backward-propagating radiation (i.e. Raman
backscattering). By the way, it is necessary when running simulation in a boosted
Lorentz frame ([4]),
• The leading-order transverse electron motion is the laser quiver motion: on the
right hand side we can approximate u = a.
50
2 The numerical modelization
• Assuming a linearly polarized laser field of frequency ω = ck, propagating in the
positive z direction, having a transverse component expressible as a modulation
of a fast oscillating field and a slow-varying |∂ξaslow| ≪ |k aslow| envelope aslow:
a = aslow(r, ξ, t) exp(ikξ)/2 + c.c.
Applying these simplifications, one has:
(∇2
⊥ + 2ick∂
∂τ+
2
c
∂2
∂ξ∂τ
)aslow ⋍ k2p
n
n0
aslow
γ(2.4)
or,
(ik0kp
+ ∂ξ
)∂τ a =
1
2
n
n0
a
γ− 1
2∆⊥a (2.5)
2.4.2 Plasma motion and wakefield equations in cylindrical comoving
coordinates
From now on, it is assumed cylindrical symmetry around the axis z (direction of the laser
propagation).
Let us define these non-dimensional, plasma wavelength normalized, comoving cylindrical
coordinates and variables:
ξ = kp (z − ct) , kp = ωp/c
r = kpr
τ = ωpt = kpct
δ = nn0
(normalized density)
(2.6)
The slow-varying laser envelope a, defined from the normalized vector potential a⊥ =
eA⊥/mc2 = a2e
i(k0/kp)ξ + c.c. is evolved according to the equation:
(ik0kp
+ ∂ξ
)∂τ a =
1
2
δ
γfluida− 1
2∆2
⊥a (2.7)
, in which 2π/k0 is the laser wavelength, γfluid is the relativistic factor associated to the
local average of plasma velocity and δ is the normalized density.
The laser envelope evolves at timescales much longer than a single laser oscillations. The
force acting on a plasma electron can be therefore averaged in time in this scheme, using
51
2 The numerical modelization
the ponderomotive approximation for the force. In this model, the plasma do generate an
electromagnetic wakefield, which spatial and temporal scale is the one of the slow varying
envelope ([4],[12]]).
The equations of motion of an electron of momentum p, in the envelope/ponderomotive
an approximation, read:
dpdt = q
(Ewake +
v×Bwake
c
)− mc2
2γ ∇∣∣∣ qAmc2
∣∣∣2
γ =
√1 + p2
m2c2 +∣∣∣ qAmc2
∣∣∣2 (2.8)
In our coordinates and normalization cylindrical system we have:
γ =√
1 + |a|2 /2 + u2z + u2r
dξdτ = uz
γ − 1
drdτ = ur
γ
duzdτ = −∂γ
∂ξ − Ez − urγ Bφ
durdτ = −∂γ
∂r − Er − uzγ Bφ
(2.9)
The relativistic electromagnetic wakefield is described by the electromagnetic fields Ewake,
Bwake (the suffix wake is further omitted for brevity):
∂Ewake∂t = c∇×Bwake − 4πJ
∂Bwake∂t = −c∇×Ewake
→cyl
∂Ezc∂t = 1
r∂rBφ
∂r − 1r∂Br∂φ − 4π
c Jz
∂Erc∂t = 1
r∂Bz∂φ −
∂Bφ
∂z − 4πc Jr
∂Eφ
c∂t = ∂Br∂z − ∂Bz
∂r − 4πc Jφ
...
...
...
(2.10)
To maintain the radial symmetry of the plasma distribution (the laser envelope acts only
with the radially symmetric ponderomotive force), the wakefield electromagnetic fields
are constrained to be Eφ = 0, Bz = 0, Br = 0, in every point of the domain. This
simplifies our system, which becomes:
52
2 The numerical modelization
∂Ezc∂t = 1
r∂(rBφ)
∂r − 4πc Jz
∂Erc∂t = −∂Bφ
∂z − 4πc Jr
∂Eφ
c∂t ≡ 0
∂Bzc∂t =
∂Eφ
∂r ≡ 0
∂Brc∂t = −∂Eφ
∂z ≡ 0
∂Bφ
c∂t = −∂Er∂z + ∂Ez
∂r
→ξ=kp(z+ct)
∂Ez∂τ = ∂Ez
∂ξ + 1r∂(ρBφ)
∂r − jz
∂Er∂τ =
∂(Er−Bφ)∂ξ − jr
∂Bφ
∂τ = −∂(Er−Bφ)∂ξ + ∂Ez
∂r
(2.11)
53
3 INF&RNO
3.1 Motivation
INF&RNO ([5]) is a simulation framework which is under development at LBNL, by
Carlo Benedetti. Its purpose is simulating laser-plasma electron accelerators, short and
intense laser pulse interacts with an underdense plasma over distances that can lengths
of 1 meter.
Such simulations, in full 3D geometry, are technically very challenging, or even impossible,
with the hardware available now.
In fact, as seen in subsection 2.1.2 the shortest resolution scale needed is the laser wave-
length, which is several order of magnitude smaller than the plasma skin depth for under-
dense plasmas (the scale to be resolved to describe the wakefield), which can be several
order of magnitudes smaller than the length of the simulation box:
λlaser ≪ λwakefield ≪ L (3.1)
For storing the state of the whole distance of one meter, therefore, ∼ 20 ·L/λlaser points
are required on the longitudinal direction. This number, for L = 1m and λlaser =
632nm leads to an unrealistic 3D grid of of 1027 total points. To overcome this memory
requirement, which makes the simulation not event long, but impossible now (which is
worse), one has to run it on a moving window, a window in the simulation box with moves
following the laser pulse (the most interesting things happen close to it), discarding the
non-important particles that are left far from the pulse.
Even after that, to follow the propagation of the pulse in a meter of plasma at a resolution
of λ, one has to compute, using an explicit integration method n ∼ L/λ timesteps, of the
order of n ∼ 108. On a 2011 CPU, a 3D particle in cell code takes t ∼ 1µs to compute
a single particle step, and it requires Np ∼ 1010 particles to decently sample the moving
window phase space. Therefore, the CPU time needed to run an 1 meter long simulation
is:
54
3 INF&RNO
T = n · t ·Np ∼ 3 · 1012s = 105years (3.2)
Even with an efficient parallelization on 105 cores, the time required for the simulation
is still prohibitive. A millimeter scale simulation, with one tenth of the particles used,
on 100 cores, takes about a month.
Anyways, 3D simulations are crucial for understanding phenomena and for designing
LPAs working at these physical scales. Fortunately, for LPA phenomenology, a series of
possible approximations do exist:
• Boosted Lorentz frame (BLF ) if one can neglect the effect of backward propagating
waves, simulation can be run in a Lorenz “boosted” frame, moving along the laser
propagation direction at relativistic velocities. Here, the unbalance between the
maximum and minimum physical scales involved in a simulation can be reduced,
leading to a shortening of the simulation length of order γ2. In INF&RNO, γ can
be taken to be ∼ 10 → γ2 ∼ 102. Again, backward propagating waves, as Raman
backscattering, are not resolved by the resolution in the BLF.
• Reduced models: using a cylindrical symmetry assumption, instead of the full carte-
sian 3D geometry. A speedup of ∼ 102 can be achieved by this.
• Envelope model : using the envelope and ponderomotive approximations, one does
not have, in principle, to resolve all the scales down to the λlaser: the smallest
laser-related length is Lenvelope ∼ λsd . This could boost the simulation speed by
a factor of (Lenvelope
λsd)4, which can be even ∼ 104 . The problem is that, if the laser
becomes strongly depleted, structures smaller than Lenvelope arise, and they need
to be resolved. By the way, even deep in depletion, keeping the resolution high
enough, a speedup of ∼ 10 can still be obtained.
INF&RNO can run the simulations using the fluid or PIC model (or both, for staged
simulations). For the PIC, for now, the last two approximations are available, while, for
the fluid, all of them are. So, the CPU years required for INF&RNO simulations, are,
roughly: for the PIC 102y, for the fluid 10−2y.
The parallelized version of INF&RNO is therefore capable of running these simulations
in practically feasible times on modern supercomputers. The uniqueness of the code
required to find unique solutions for the parallelization.
55
3 INF&RNO
3.2 Numerical Scheme
3.2.1 Plasma modelization
The laser is described by the envelope model (see sect. 2.4.1), the wakefield by the usual
Maxwell equations (see sect. 2.4.2), and the plasma can be modeled using a fluid model
or a PIC one.
3.2.1.1 Fluid plasma model
Starting from the relativistic plasma cold fluid equations (u = γv/c):
∂nc∂t +∇ · (nu/γ) = 0(
∂c∂t +
(uγ
)· ∇)u = F
(3.1)
Changing variables to the comoving system ∂∂t =
∂∂τ − ∂
∂ξ , and recalling 2.8, one obtains
the fluid equations used in INF&RNO-fluid:
γfluid =√
1 + |a|2 /2 + u2z + u2r
∂δ∂τ = ∂δ
∂ξ −∇ ·(
uγfluid
δ)
∂(δuj)∂τ =
∂(δuj )∂ξ −∇ ·
(−→β δuj
)+ δ
[−(E+ u
γfluid×B
)− 1
2γfluid∇(|a|2
2
)], j = z, r
(3.2)
3.2.1.2 PIC plasma model
The quasiparticle dynamics follows from the electrons’ equations of motion 2.8.
3.2.1.3 Numerical choices
For the Maxwell wakefield and the quasiparticle motion an explicit Runge-Kutta in-
tegrator, of order 2 or 4, is used. The laser envelope is advanced using an implicit
Crank-Nicholson method, which offers better stability and accuracy than explicit ones.
56
3 INF&RNO
The fields are discretized on un-staggered, 2D, regular z, r grids. The longitudinal ξ
derivatives are computed with an upwind scheme (∂ξ;∆ξfi,j = (−3fi,j+4fi+1,j−fi+2,j)/(2∆ξ)),
while the transversal one with a standard centered scheme. The motivation of using up-
wind derivatives is to damp some backward propagating instabilities.
The shape functions used for the macro-particles are the standard linear and quadratic
splines.
3.3 Parallelization
As we have seen in the previous sections, an INF&RNO simulation can take several years
of CPU time in order to be completed. Therefore, parallelizing it gave us the possibility
to run simulations that were otherwise inapproachable for us before.
In general, nowadays, parallelizing a code (splitting the computation among different
processing units) is important because processors speed (frequency) ceased to increase,
while the number of processing cores per CPU die and the number of computing nodes
interconnected in supercomputers still do increase, scaling very well as the years pass by.
The key quality that a parallel code should have is the scalability. It is defined as the
capability to half the computation time, by doubling the computational resources (cores).
This is so important because the number of available processors in HPC cluster, as just
said, grows year by year.
The first requirement for scalability is imposed by Ahmdal’s Law: if a code contains a
(even small) part of serial code, the speedup given by parallelization reaches a maximum,
at a certain number of computing processors. As can be seen in the following figure, even
if the serial part only weights as the 5% of the total serial runtime, the maximum speedup
that one can reached parallelizing is only 20x, even using thousands of processors.
This can be easily understood considering that:
• the parallel runtime is the sum of the parallelizable part and the serial part
• the parallel part time scales as 1/Nprocs, while the serial one remains constant
• even taking the limit N →∞ one still gets tserial and not 0:
limNprocs→∞
t =tparallelizable
N+ tserial = tserial (3.1)
57
3 INF&RNO
In INF&RNO the laser envelope equation is integrated using an implicit Crank-Nicholson
method, which resolutive algorithm, as usual for an implicit methods, is recursive longitu-
dinally and requires to solve tridiagonal linear systems transversally (their parallelization
is very inefficient). So the code has parallelization issues along both the transversal and
the longitudinal (the most important to parallelize along) directions.
This recursiveness is in the laser evolution function F (very complicated algebraically),
which computes an+1i−1,j from right to left, using, for each i, the values on the right, which
must be known:
an+1i−1,j = F
(an+1ij − anij , a
nij
)(3.2)
This right-value dependency makes the laser integration part impossible to parallelize
longitudinally (transversally the tridiagonals are even harder to parallelize), as it is im-
possible to break the dependency chain and split the computation among processing
units. Leaving it un-parallelized would have given us a very poor scalability, and the
maximum reachable speedup would have been bound to 10x.
Motivated by the linearity of the F in 3.2 and by the fact that the envelope varies slowly
and smoothly I were lead to try an iterative algorithm to solve the equation 3.2.
I guessed that the difference between an+1ij − anij and its previous value anij − an−1
ij was
not too large. Fortunately, this guess came out to be true: this difference is the finite
58
3 INF&RNO
Figure 3.3.1: Validation of the algorithm: line-outs before laser depletion, deep in deple-tion (density), deep in depletion (laser)
difference approximation of the second derivative in time of a :(∂2t;dξ[aij ]
)n= an+1
ij −2anij + an−1
ij , that is a small factor for our conditions, non-neglectable , as seen in section
2.4.1, only for backward-propagating waves.
Our algorithm, after finding a not-too-bad initial value for an+1ij − anij ≈ anij − an−1
ij , finds
an initial guess guess g0 for an+1. After that, it starts iterating F using successive guesses
in place of the unknown an+1ij , to which the succession of guesses converge.
Summarizing, being P the chosen number of iterations:
gn+1i−1,j;0 = F
(anij − an−1
ij , anij
)initial guess
gkn+1i−1,j; = F (gn+1
i,j;k−1 − anij, anij) k − th iteration
an+1ij = gn+1
ij;P final
(3.3)
Even with a low resolution and the laser deep into depletion, a low number of iterations
∼ 10 manages to find a very accurate solution, as seen in the lineouts in figure 3.3.1.
After having made our code worth to be parallelized massively (the 10x speedup bound
was not too attractive) I had to choose which parallelization approach to use.
I started parallelizing it using OpenMP shared memory parallelization, keeping all the
data on a multicore node and distribute the computation of the shared data among
59
3 INF&RNO
the local cores. After that, I have parallelized it using domain splitting with message
passing, i.e. splitting the domain and distributing the subdomains among independent
networked nodes, each one exchanging with its neighbors the data (resident in or crossing
the boundaries) needed for computation.
Shared memory Message passing
Concept Many processors sharing the work Split the domain
Advantages Fast to implement Scales to huge number of cores
Drawbacks lower efficiency and scalability Communication overhead
Standard OpenMP MPI
The message passing parallelization is usually more efficient and of course more scalable
than the shared memory one. In fact, the small number of cores per local node sets a
bound on the scalability of a shared memory parallelized code.
A parallelization limit do exists also for message passing codes: at a certain point, split-
ting the domain again and again, one arrives to subdomains which are too small. At this
point it becomes useful to combine the two parallelization approaches, assigning larger
subdomains to multicore powered nodes.
3.3.1 Shared memory (OpenMP)
I parallelized the fluid and the PIC code using OpenMP parallelization, which is a shared
memory, concurrent threads, task parallelization API:
Basically the work consisted in parallelizing the particle loop and all the loops in the
longitudinal grid points index using OpenMP directives and watching for false-sharing
60
3 INF&RNO
inefficiencies.
For the PIC I had to solve a memory read-after-write collision in the current deposition
algorithm: when two or more different cores, processing two or more different particles
concurrently, have to sum the respective particle’s current to the same grid cell:
• They both read the initial cell value, at the same time.
• They sum their particle’s current to the value read, which is “0” for both processors,
since no one had the time to write back to the shared memory.
• They write their sum to the same shared memory location.
• At the end, instead of having the sum of the particle currents, one has only the
current of the last particle writing.
I have solved this issue in two ways:
• Using atomic operations (with the “#pragma omp atomic” directive), “locking” the
memory first in such a way that a thread’s memory sum operation is not interrupted
by another.
• Using a different sum buffer per core (faster but memory wasting).
The OpenMP code scaling graph, up to the twelve cores of our machine, is:
3.3.2 Message-Passing (MPI)
Because of the spatial locality of the physical equations describing the system, it is possi-
ble to split the computational domain into smaller subdomains, each one evolving almost
61
3 INF&RNO
independently, needing only boundary information, which gathered from the neighboring
domains. Each physical subdomain is computed by a different computational node.
For finite differences fluid codes each computing node needs to ask its neighbors the
values required to compute the numerical derivatives on the boundary of its subdomain.
In particle in cell codes, additionally, the particles are free to move inside the entire
domain, and they interact with grid cells around them. It is necessary to keep the particles
“belonging” to a computational node interacting only with the node-local subdomain of
fields: the particles are moved among nodes when they cross their subdomain boundaries.
The width of the particle shape function requires to take an extra ghost grid point on
the left. The particle velocity limit, c, allows to check boundary crossing once every few
time steps.
3.3.2.1 1D domain decomposition
I started decomposing the domain along the longitudinal direction, which is usually the
one with more grid points Nz. The grids (of dimensions Nz × Nr) and the particle
domains are sliced in the z−direction, and each one of the Np;z slices is assigned to a
different node.
In our longitudinal upwind derivative scheme ∂ξ;∆ξfi,j = (−3fi,j+4fi+1,j−fi+2,j)/(2∆ξ),
the values required for computing f ′ij are the two points on the right, fi+1,j,fi+1,j. Before
every derivative computation, one has to be sure that the ghost region contains the right
data, which must be copied from the right neighbor node, as illustrated in the following
figure:
62
3 INF&RNO
After validating the parallelized codes (see figure 3.3.2), I measured their parallel perfor-
mance:
• The fluid code showed to scale very well, as can be seen from this strong scaling
graph, describing how the parallel speedup scales with the number of processors
used, given a fixed problem size (Nz = 4224):
In particular, for this problem size (Nz = 4224), the maximum speedup reached
(still at high efficiency) is 1000x (1 year -> some hours).
• The weak scaling graph for the PIC code is similar:
63
3 INF&RNO
Figure 3.3.2: Comparison between a parallielized PIC simulation (40 minutes on 120cores) result and scalar one (38 hours)
3.3.2.2 2D domain decomposition
The length of the information needed to be exchanged scales with the size of the boundary.
The long time needed to transfer information (during which the computation is usually
idle) makes important to minimize, for each subdomain, the ratio of boundary grid points
(or crossing particles) and total ones.
64
3 INF&RNO
In this sense, 1D decomposition is not optimal. In fact, the ratio scales linearly with the
number of slices Np;z and eventually reaches 1, when Np;z = Nz.
Furthermore, Nz represent also the limit to parallelization and scalability, since it is
impossible to take more than Nz slices keeping the number of their longitudinal a positive
integer.
This ratio is optimal taking the subdomain to be squares and the parallelization limit
can be broken decomposing the domain in the radial direction too. So, 2D decomposition
allows, in theory, to scale to Nz ×Nr processors, instead of only Nz.
Unfortunately, in INF&RNO, the implicit algorithm using to advance the laser envelope
equation in time has to solve Nz tridiagonal systems, one for each longitudinal point,
involving all the Nr grid points at that longitudinal coordinate. The most efficient
algorithm (running in linear time) to solve such system is defined recursively, and it
is very hard to parallelize efficiently, as it calculates numerical successions propagating
radially.
In a 2D decomposition, a naive approach to solve this issue would be to split the suc-
cession in chunks (corresponding to the subdomains) and, for each system, let each node
wait for the last value in the succession, computed by its right neighbor. Summarizing,
for each longitudinal grid point (Nz), there is this chain of Np;r wait-compute-send steps.
By the way, its parallelization would be preferred, for what stated by the Ahmdal’s Law
3.1.
The best compromise found for solving the tridiagonals concurrently, by different nodes,
is using a pipelined solving algorithm: after solving the first part of the system, for
the first z-slice, the first node begins the computation of the second system, and so on.
Anyways, for having full efficiency it is still good to set Nz ≥ Np;z ·Np;r = Np.
Practically, the scaling graph for the PIC code, using the 2D decomposition, shows that
it is possible to run on more processors, still at a reasonable efficiency:
65
3 INF&RNO
3.3.3 GPU parallelization
I have rewritten both the fluid and the PIC code using the jasmine (chapter 4) framework
components, enabling it to be run on graphics processing units (GPUs, see section 4.2).
The details of the PIC code implementation and benchmarking results are in chapter 4.
Parallelizing the fluid was pretty straightforward to do, since all memory accesses, in
the fluid code, are predictable and fixed. The final speedup result for the fluid GPU
parallelization is (measured using a NVIDIA Fermi GPU on the “Dirac” machine at
NERSC) that a single GPU is 70 times faster than a single CPU core.
3.4 3D Visualization
Handling the data generated can be a significant challenge that requires special visual-
ization techniques as it is required to render (physically 3D) density fields, such as the
charge density field or the electromagnetic fields’ components.
For rendering these 3D density volumes a lot of physically based models exist. Most of
them are based on raytracing and many of them are very complex, as they include multi-
scattering effects and external illumination. By the way, the algorithm that produces the
best clear-looking renders of our data is the simplest one, the emission model.
It assumes that each point of the 3D field emits light according to the value of the
“density” there and it neglects all the scattering, absorption and external lighting effects.
It simply consists in integrating the light emitted on marching rays starting from the eye
and transversing the field volume.
66
3 INF&RNO
Practically, one discretizes the eye-field integral sampling the density ρ, and transforming
the values to colors using is a transfer function T (linear or nonlinear).
I =∑
volume edgeeye T [ρ(s)] (3.1)
For such renderings, the standard available visualization tools (Visit, Paraview, ...) are
too memory intensive, as they sample our cylindrically symmetrical grids on a full 3D
representation. To solve this problem I wrote a visualization program “jasviz”, from
scratch, that samples fields along rays in cylindrical geometry.
The program has a graphical user interface (GUI), generated semi-automatically using
a C++ classes reflection system. Defined, in the C++ code, the type-specific meta-
properties of the objects (which are organized hierarchically as nodes in a scene tree) a
rewrite engine generates the GUI and serialization (load/save XML files containing scene
data and parameters) code. This approach is very programmer friendly, as it is fast and
the bugs are very manageable, but the GUI generated may not bee too user friendly at
a first glance:
The program main features are:
67
3 INF&RNO
• Ability to record movies and still images:
– each scene node is animable separately,
– the user can define all the properties of a node at keyframes, automatically
interpolated linearly.
• Everything is computed on the GPU.
• Absorption/emission/MIP volume rendering models.
• Transfer functions:
– customizable linear gradients
– non linear user-customizable gaussian activation filter
• Point particles rendering (for PIC data):
– clip by energies, color by energy
• Automatic conversion tools INF&RNO data → hdf5 format.
• It renders both full 3d cartesian grids and 2d cylindrical symmetric ones.
• Volume clipping.
• Text and subtitles rendering.
• Cross platform.
A few sample snapshots from a video created with jasviz are:
68
3 INF&RNO
69
4 Jasmine: a flexible, hybrid
(CPU+GPU), PIC framework
Particle in cell codes can be used for simulating a wide range of phenomena, and, while
some concepts of the algorithm remain the same, other parts, depending also on the
physical assumption made and the corresponding numerical solutions (we have seen many
examples in the previous sections) can be very different among different PIC codes.
Furthermore, technology today offers different supercomputing architectures that may
require heavy code modification in order to have a code mapped into them. For example,
parallelizing a PIC code to run on a graphics processing units (GPUs, see section 4.2)
requires some attention and to write a completely different current deposition algorithm
(as shown in subsection4.2.3).
Given these two degrees freedom - numerical algorithms and architecture optimization -
an “universal” PIC code, is very unlikely to be realized and, mostly important, future-
proof.
By the way, it is feasible to write, and our attempt has this objective, a framework
for particle in codes which contains as components the core concepts of the algorithm,
as interpolation and deposition algorithms, written in a generic, final implementation
agnostic, form. This abstract form, allowed by the meta-programming features of the
C++ language, may make look the code a bit more complex and less straightforward but
it makes it reusable.
Jasmine is the name we have given to this framework, which includes some example
“standard” complete PIC codes: the 1D,2D,3D classical second order PICs (subsect.2.2.3)
one and the cylindrical symmetric INF&RNO numerical schemes (subsect. 3.2).
The code main features are:
• Multi-dimensionality, 1D,2D,3D.
• Modularity (sect. 4.1).
70
4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework
• OpenMP CPU intra-node parallelization.
• Fast GPU implementation, providing an average speedup of ~45x (sect 4.2).
• Inter-node hybrid MPI, multi-GPU, parallelization (sect 4.3).
• HDF5 output, diagnostics and 2D/3D visualization of the results.
4.1 Structure of the framework
The most general layer of abstraction, available in jasmine, for a parallel PIC code
consists in:
• the particles interact, via interpolation and deposition, with the cells of a field grid.
• the particles move in a computational domain which can be split into parallel
subdomains.
– The grid points on the boundaries may need to be exchanged.
– The particles crossing a subdomain boundary need to be exchanged to the
corresponding neighbor node.
The functions in this “geometrical” layer should implement these functionalities remaining
generic for what concerns particle and fields structures and datatypes. On the other hand,
the layer is specialized for different geometries (spatial dimensions, 1d, 2d, 3d, cartesian
or cylindrical geometry) and the algorithms are written, with the minimal differences as
possible, for different architectures, CPUs or GPUs.
Next, the equations for the fields evolution must be defined. If they can be expressed as
linear PDEs, or better as hamiltonian ones, it is a good thing to give the framework this
information: it can generate automatically the code for integrating them in time with
various explicit method, as Runge-Kutta 2, 4 or leapfrog. It is only needed to define the
fields properties - such as dimensionality, number of components and staggering - and
provide the expression for their temporal derivatives. The spatial derivatives used in the
PDEs can also be left as parameter, in this layer.
Also the particles datatype and evolution equation are chosen as a simulation template
parameters. Particles’ and fields’ equation are merged and mounted on some available in-
tegration scheme skeleton, such as Runge-Kutta or classical second-order (2.2.3) schemes.
The GPU (CUDA)/CPU (OpenMP) parallelization abstraction layer is partially provided
by the boson library, described in the second part of section 4.2.1.3.
71
4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework
4.2 GPU Parallelization
Graphics processing units, or GPUs, are a new generation of processors that provide
parallel codes with orders of magnitude of speedups (compared to their scalar implemen-
tation for normal CPUs), implementing a new kind of efficient, on-chip parallelism.
Their architectural features and their programming model are explained in subsection
4.2.1, while the approach we have followed implementing jasmine for the GPUs is ex-
plained in subsection 4.2.2.
4.2.1 GPU architecture and CUDA programming model
4.2.1.1 Throughput
Modern scalars processors impose limits (or even prohibit) on two critical aspects of a
computing device architecture: parallel execution and memory hierarchical organization.
These processors are designed to give the illusion that the code execution is sequential
and the global memory accesses are truly random, uniform and linear.
The evolution of the these sequential processors seem to have come to an end, as their
performance grows very slowly with time.
Meanwhile, new architectures have been developed, which, instead of denying, hiding,
parallelism and explicit memory hierarchy, embrace them, enabling developers to know
and control these features, in order to gain efficiency and performance respect to scalar
processors.
These architectures are named throughput processors or streaming processors or GPU and
they have become common in use, for the first time, in a field in which the performance
72
4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework
is indispensable and the parallelism is explicit: real time computer graphics. Because of
the commercial interest in this applications, a lot on research effort has been made in
order to improve these architectures, which soon started to overcome, for what concerns
raw computing power, normal CPUs.
The first programming model available for these processors was the SIMD (single in-
struction multiple data), used in most of the cases to write real time shading codes for
rasterization computer graphics. Despite the limitation of this paradigm, these shaders
have been used successfully to accelerate general, not computer graphics related, algo-
rithms, such as linear algebra or physical simulations.
In 2007, NVIDIA launched the CUDA architecture (Compute unified device architec-
ture REF): hundreds of computing cores on a single chip, capable of running thousands
of threads (with hardware, zero latency, switching) and an efficient and clear hierarchi-
cally organized memory. CUDA has given the developers direct access to the native
instructions and the memory profiles in the GPUs.
The release of the second generation of CUDA architectures, named FERMI, has marked
the GPUs success in general purpose computing. The improved flexibility and compu-
tational power of this new generation persuaded the high performance supercomputers
manufacturers to introduce in the market hybrid (CPUs+GPUs) clusters: as of 2011,
many of the world’s most powerful, for computational power, supercomputers are hybrid
ones.
For such a kind of architectures, performance derives from parallelism and efficiency
comes from the locality of memory transfers, which are clearly controllable by the pro-
grammer. The data access locality is guaranteed using fast on-chip memories, the caches.
The role and the use of the cache memories play the fundamental difference between
the GPU and CPU programming model: while the cache philosophy implemented on
CPUs inhibit parallelism, as their coherent cache implementation unavoidably suffer
false-sharing and cache-thrashing issues, the GPUs ones offer a good compromise between
latency hiding (and data locality keeping) capabilities and parallel efficiency, allowing
their use as a fast, local, communication channel among neighbor cores.
Without memory locality, memory bandwidth, which is the most critical resource in a
modern system, soon becomes a bottleneck.
For what concerns code execution, instead:
73
4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework
• CPUs execute a serialized code and give the illusion that the operation are run
sequentially: in reality every operation is split in a few (~ 30) passes places on
a computing pipeline. This “superscalarity” allows to increment the number of
operations done per second. The problem with pipelines is that there is always a
bottleneck: if one ring of the chain stops, all the others have to wait for it. So
every pipeline stage has to be fed with work. That’s the reason why CPUs have,
implemented on the chip, branch predicting algorithms, and their memory caches
must be enormously large, as the latency of a random global memory access (RAM)
would slow down all the pipeline. These branch and memory access predicting
devices take almost all the surface of a CPU die, and allow room only for a small
number of actually computing cores.
• Conversely, GPUs solve the problem of global memory latencies keeping in parallel
execution thousands of lightweight, hardware threads, with a switching time of the
order of a clock cycle. This way, the latency of a global memory operation, blocking
a thread in a waiting state, can be hidden switching to the execution of another
one.
Therefore, a GPU completely take advantage of parallelism to remain efficient and to
exploit its computation capabilities.
4.2.1.2 Parallelism and memory model (hierarchy)
The CUDA programming paradigm splits the computation in many parallel processes,
named threads. The ensemble of all threads constitutes the grid.
The threads execute the same program, but they have independent registers and share,
in groups, some memory caches. CUDA threads are therefore similar to SIMD threads,
with the appropriate differences:
• CUDA threads are grouped in blocks. The size of the blocks is uniform on the
grid and can be chosen by the programmer. Threads in the same block share a
register-like fast , on-chip memory named shared memory.
• The blocks are deployed and executed on the physical computing cores of the GPU,
the streaming multiprocessors. Each SM concurrently executes all the threads of
the assigned blocks, grouping them in warps of 32 threads.
• Each thread has, in addition to the registers’ state, an independent instruction
pointer: the threads in a block can diverge, but the corresponding warps are seri-
74
4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework
Figure 4.2.1: Organizzazione gerarchica dei threads (fonte: NVIDIA)
alized (until the code path reunification) by the SM.
As we have said many times, the key for having performance from a GPU is to exploit
efficiently its specialized memory spaces/profiles. The GPU memory hierarchy consists
in:
• Registers, low latency on chip memory, private per each thread.
• Shared memory, on chip low latency memory, shared by blocks.
• Constant memory, a small, read only (after setting the initial value), low latency
on chip memory, visible to all threads.
• Global memory: off chip, on board high latency RAM, shared among all the
threads. Automatically cached:
– by the texture memory, a spatially efficient read only cache that provides also
1D, 2D, 3D interpolation functions,
75
4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework
– by, on Fermi architecture only, a general two-level (the first level is on a single
SM and the second one is shared) configurable cache.
• Host memory: host main CPU RAM, connected to the GPU board by a bus (usually
the PCI express) which has an order of magnitude less bandwidth than the one
connecting the GPU processor to its on-board RAM.
• Intra-node, communication is possible using standard message passing standards,
such as MPI, and it requires passing through the host memory. Some alternatives,
such as the APEnet+ project ([3]), allow the GPU to communicate directly, without
passing through the PCI express bottleneck.
In detail:
lifetime thread access Position Latency Bandwidth Gb/s Size
Registers singolo thread read+write on-chip 4 clock cycles * 4k / thread
Shared Memory singolo block read+write on-chip 4 clock cycles* * 16k / block
Texture Cache globale read only on-chip 4-~400 clock cycles ~141 16k
Constant Memory globale read only on-chip 4-~400 clock cycles ~141 16k
Global Memory globale read+write off-chip ~400 clock cycles ~141 ~1-4 Gb
Local memory singolo thread read+write off-chip ~400 clock cycles ~141 ~1-4 Gb
Host Memory* node read+write(through PCI-e) off-board non accessibile ~5 ~10 Gb
4.2.1.3 Programming GPUs: CUDA libraries and our multi architecture approach
The CUDA toolkit, released by NVIDIA, is the most common GPUs programming en-
vironment. A CUDA program is written using an extended version of the C (and C++)
language and it is compiled by the NVIDIA compiler nvcc, which manages the separa-
tion of GPU and CPU code, compiles and optimizes the former and pipes the latter to
a standard compiler (gcc, ms compiler or icc).
Some ad-hoc language intrinsics are used to specify the type of functions and procedures:
• The “__host__” qualifier, which is the default one, indicates that the function
body is a normal CPU function and it cannot be called from the GPU.
• The “__device__” is applied to functions run on GPUs and callable only within
the GPU.
• The “__global__” qualifier indicates a CUDA kernel, an entry point for a CUDA
computation. The kernel is executed on the GPU but it can be launched only
76
4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework
Figure 4.2.2: CUDA memory spaces (NVIDIA)
from host code. The launch consists in the allocation of a GPU thread grid,
of configurable size. On the host side, the call has the syntactical form ”ker-
nel_name<<<grid size in blocks,single block’s size>>>(arguments)”.
Other nvcc compiler keywords have the purpose to qualify the memory space in which
variables reside in. For example, __shared__, used in GPU code as a prefix qualifier
in a variable declaration, allocates it in the block-shared memory space.
Furthermore, in the GPU code, some variables are defined implicitly by the compiler.
Two examples are the “threadIdx” and “blockIdx” read only variables, which are register-
space multidimensional indices that give a thread the possibility to locate itself on the
kernel grid: initially this information is the only thing that is not uniform across the grid
and therefore it is used by the threads to know which portion of the problem they are
assigned to.
CUDA contains also some primitive atomic operations and synchronization functions,
used to synchronize memory accesses, for example:
• __syncthreads() is a barrier that synchronizes the threads in a block,
• __threadfence() is a global barrier that, blocking the execution until the last global
memory write is complete, prevents some global data inconsistencies.
77
4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework
On the host side, GPU memory can be allocated/deallocated/copied (CPU ←→ GPU or
GPU←→CPU) using the host functions cudaMalloc/cudaFree/cudaMemcpy.
4.2.2 Our approach for writing cross architecture code
Programming GPUs using these language extensions can require to modify the code in
many points. For example, writing a code which can run both on GPUs and CPUs (as
in our case) requires to rewrite the memory allocation and copy functions for each of
the two computing devices. This rewrite effort, which increases code complexity and
maintenance costs, can be avoided using C++ templates, which can be used as a generic
abstraction proxy.
For this purpose, we have designed and implemented a general purpose CPU/GPU
lightweight parallelization library, named boson. It is composed by three main com-
ponents: a N-dimensional abstract vector class, the field class and the closure launcher.
The class field can be used to allocate/copy/access N-dimensional discrete grids of arbi-
trary type on both host (CPU) and device (CUDA GPU). It provides:
• The same functionalities on GPUs and CPUs: the choice of the device is done via
a template argument.
• Cross CPU-GPU/CPU-GPU mapping and copy operations.
• Grid iteration functions.
• Compound grid elements can be organized in a soa (structs of array) or aos (array
of structs) manner by specify another template parameter.
Furthermore, as we have seen, the syntax for launching a shared memory, multi threaded
computation is completely different in the OpenMP (CPU) and CUDA (GPU) frame-
works, but the base concepts are similar. Both the programming paradigms can be
abstracted using the concept of closure.
A closure is defined as an operation to be computed (possibly in parallel) on a set of
elements (a stream): all the computations are identical and the only thing that changes
is the element to process, identified by a multidimensional index if the set of elements
can be organized on a N-dimensional grid.
Our closure launcher component runs the computation, in parallel, on the device (CPU or
GPU) chosen by the application developer. Summarizing, the latter needs only to write
78
4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework
the closure code and decide where to launch it, without worrying about the different
parallel programming syntaxes, which are hidden by our abstraction layer.
Of course, the library does not provide any magical feature and does not solve the
different architecture / different parallelization issues: it is only a form to write codes
more compactly and avoid many rewrites.
Jasmine was written using this abstraction library and therefore it has been able to
run on GPUs, multicore CPUs and, possibly, also on future similar shared memory
architectures. The main difficulty for GPU porting was developing an ad-hoc parallel
current deposition algorithm, as seen in the next subsection.
4.2.3 Current and density deposition algorithm
As seen in section 3.3.1, memory conflicts (and consequently wrong computed values)
happen when some parallel threads sum particles’ contributions to densities naively in
parallel.
If this problem can be solved easily having a low number of threads, as happens in
OpenMP parallelization, new problems arise when scaling to the hundreds or thousands
of a GPU. In fact, the two workarounds found - atomic memory operations and the use
of multiple buffers - both become very inefficient when the number of threads becomes
high: the cost of memory locks increases with the number of threads waiting for them
and, of course, multiple buffers are too memory demanding.
Therefore, other ways to proceed must be found. The problem is caused by the sparsity
and the randomness of the particles’ memory accesses: their location in the grid is not
known a priori and this makes impossible to know which threads will be trying to sum
in the same grid cell, creating a conflict. On the other hand, if the particles are spatially
organized or even sorted, ordered sums algorithm are possible to write.
Our approach consists in these steps:
1. keep the particles sorted (using the efficient parallel sort algorithm coming with
the thrust library [17]) by their central cell index (or a hash of it),
2. find the subsets of particles with the same index, which is very easy since they are
already sorted,
3. finally, in parallel, compute the sum for each subset (cell), looping in the particles,
and write the sum to the global memory.
79
4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework
All the operations can be done in parallel by the GPU threads and none of them requires
memory locking or other forms of synchronization.
Since every particle, being finite sized, contributes to the density of a few grid points
(the central one and its neighbors) step 3 must be repeated once per shape factor. An
alternative is to nest the particle-in-a-cell summation loop inside an outer loop (in the
same GPU kernel) in the shape factors. For consistency reasons, for every shape factor
iteration, after writing the sum to global memory, a thread fence ([23]) synchronization
is needed.
In case of staggered density grids, as it is the current density field J in the FDTD scheme,
particles’ central cell indices are different for each component grid (Jx, Jy , Jz). Naively,
this would require to repeat the particle sort for each grid component. A more elegant
and much more efficient solution is to keep the particles sorted not only by their cell
index, but also for their subindex, defined as their index in an un-staggered grid having
half grid step. Since the relative offset of the component grids is ±12 , the subindex
unambiguously identifies the central point in all of the Jx, Jy, Jz grids.
4.3 Hybrid parallelization
Large 3D PIC simulations cannot be handled by a single graphics processing unit, since
the memory requirements are way too demanding. Parallelizing the code across multiple
GPU nodes using domain decomposition and message passing solves this issues and
gives also the possibility to exploit all the computational power of the hybrid computing
clusters.
Such clusters are usually made of networked nodes containing, each, two devices: a GPU
board and a multicore processor.
4.3.1 Inter-node & intra-node parallelization
In a CPU only cluster, the MPI-only classical parallelization distributes each subdomain
to a single scalar core. This strategy has the advantage of being normally very efficient,
except for the cases in which the small subdomains become poorly balanced in compu-
tational load (as happens in our simulations: the particle density accumulates in certain
zones and leaves other empty).
The use of GPUs introduces a new, intra-node, level of parallelism. Here, subdomains
are taken to be larger and can, in theory, be assigned for processing on a GPU, a CPU
80
4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework
core or an entire CPU (using OpenMP). Anyways, this strategy is not particularly more
efficient than using only GPUs because of their overwhelming computational power.
A better idea, which allows also to solve a memory requirement problem (as explained
below), is to assign each subdomain to a physical node, instead of assigning it to a single
core or a single node device.
The subdomain’s computational work is therefore shared among the node resources, using
shared memory parallelization. In our case the work is shared by the CPU and the GPU,
using OpenMP and CUDA parallelization paradigms, as shown in:
Because of the fact that a PIC code spends most of the time in pushing particles and
depositing their densities (there are many particles for each grid cell), a convenient way to
implement this CPU/GPU work sharing is to stream (as seen in the figure below) ’particle
computation’ from a device to the other, in a physical node. Since the two devices have
separate memories, one of them must be chosen to keep particle data, streamed to the
other device through the PCI express bus, as summarized in the following figure:
81
4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework
This strategy can prove itself to be necessary:
1. if the total GPU memory on a cluster is not big enough to keep the state of all
the particles, they can be placed in the bigger host memory and streamed to the
graphics processing unit (and in the meanwhile, to the CPU cores).
2. If one does not want to waste the computational power of CPUs on the physical
node, which is smaller, but not completely neglectable, compared to the GPU one.
In the second situation, particularly, the the CPU/GPU memory copies (passing through
the slow PIC-express bus) overhead should be kept as small as possible. Therefore, in
order to minimize the transfers, it is recommended to store the particles in the memory
of the fastest device, the GPU.
Furthermore, the overhead can be hidden putting the three operations (transfer in, com-
putation and transfer out) on a pipeline (using CUDA streams). The pipelining is effi-
cient because every stage is processed by an independent unit: the Fermi boards in GPU
clusters have 2 parallel DMAs for memory copies. The time
82
4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework
The performance benchmark tests (sect 4.4) show that this strategy makes possible to
run larger simulations on memory limited GPUs clusters.
4.3.2 Inter-node communication
For what concerns the communication among neighbor nodes, for exchanging field bor-
ders and crossing particles (see sect 3.3.2), standard MPI techniques have been used in
jasmine.
On normal clusters, the GPUs belonging to different nodes can communicate only passing
through the node host (CPU) memory. The main bottleneck resides in the bus connecting
these two memories, which, having a bandwidth of 4GB/s, can be slower than theoretical
network communication speed.
Therefore, we designed jasmine to let only the data that actually need to be moved across
the network pass through this bus. This means that the GPU has to, independently build
the list of particles crossing the subdomain boundaries and copy the field borders to/from
a send/receive buffer.
4.4 Performance benchmarks
Jasmine validation and performance benchmark tests were run on the GPUs of the PLX
machine, at CINECA.
83
4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework
The performance benchmarks consisted in the simulation of a laser pulse nonlinear plasma
wakefield generation phenomena, using the standard second-order full electromagnetic
PIC scheme with staggered grids and particles with quadratic shape functions. The
electrons in the simulation were in the relativistic regime and the number of (quasi-
)particle per cell was 64 for the 2D version and 8 for the 3D one. All the computation
was done in double floating point precision.
Four parallelization configurations were tested: the single GPU parallelized version, the
multiple (8 MPI nodes) GPUs one, the GPU using host memory particle streaming (4.3.1)
one, and the scalar CPU version. The results are summarized in the following table.:
Configuration Dimensionality Time per particle (ns) 1GPU/1CPU speedup
Single GPU 2D(64 ppc) 22 52x
MPI Multi GPU 2D(64 ppc) 29 39x
Streaming 2D(64 ppc) 26.4 43x
1 CPU core 2D(64 ppc) 1150 1x
Single GPU 3D(8 ppc) 109 38x
MPI Multi GPU 3D(8 ppc) 118 34x
Streaming 3D(8 ppc) 115 36x
1 CPU core 3D(8 ppc) 4104 1x
84
Conclusions
Numerical modeling is a fundamental tool for studying the nonlinear Physics of the laser
plasma interactions, whose comprehension is fundamental for developing new generation,
compact, particle accelerators. Such simulations are computationally very demanding
and require, in order to be run, optimization via numerical scheme improvement and
parallelization.
It has been shown that the parallelization of the partially implicit INF&RNO codes
(fluid and PIC) is possible and efficient, once the recursive dependency in the implicit
laser wakefield integrator is solved with the iterative method found, whose convergence
is motivated by a physical consideration. The code is able to scale up to thousands of
cores for a typically sized problem, providing very interesting speedup factors (about two
or three orders of magnitude) compared to the serial version.
This fact will have certainly impact designing the next-generation of laser-plasma ac-
celerators at LBNL, as the parallelization has made some very unfeasible simulations
possible and affordable in time, preserving their accuracy.
The features of the jasmine PIC framework have been outlined and its performance
benchmark results have shown that the GPU architecture can provide PIC simulations
with great speedups, being the CUDA parallelized version up to 50 times faster (on a
single GPU board, running double precision simulations with quadratic particle shape
functions) than our serial CPU implementation. In particular, the solution found for
implementing correctly the concurrent current deposition algorithm has shown to be
very robust.
The code was benchmarked checking the correct simulation of some analytically treatable
phenomena and some runs, in various acceleration regimes, have been compared with the
ones obtained with ALaDyn.
Its second level, multi-gpu, inter-node, parallelization makes us able to exploit all of the
great computational power of large scale HPC GPU clusters, providing better scaling and
85
4 Jasmine: a flexible, hybrid (CPU+GPU), PIC framework
overall efficiency (compared to the use of traditional CPU cluster). This is particularly
important because, very likely, GPU clusters will continue growing in computational
power faster than the conventional ones.
The GPU memory limitation problem, which could represent a limit for running large
scale simulations, has been tackled using a concurrent streaming technique that allows to
store the PIC particles in the larger CPU main memory, from which they are progressively
streamed to the GPU for computation. The overhead of such memory transfers, done
concurrently with the computation, has been shown to be negligible for 3D simulations.
The components (building blocks) of jasmine has been shown to be very reusable and
allowed, for example, the rewrite of some PIC and fluid algorithms in INF&RNO for the
hybrid GPU+CPU architecture in a very short time.
At the moment of writing this thesis, no published article has claimed the development of
an electromagnetic PIC code that can run on hybrid clusters using the main CPU memory
efficiently, or that can run multi-gpu full-PIC 3D simulations, or that is adaptable for
various geometries and numerical schemes, as jasmine.
86
Bibliography
[1] Gev electron beams from a centimeter-scale channel guided laser wakefield acceler-
ator. 14(5):056708, 2007.
[2] Theory and simulation of ion acceleration with circularly polarized laser pulses.
Comptes Rendus Physique, 10(2-3):207 – 215, 2009. <ce:title>Laser acceleration of
particles in plasma</ce:title> <ce:alt-title xml:lang=.
[3] Roberto Ammendola, Andrea Biagioni, Ottorino Frezza, Francesca Lo Cicero,
Alessandro Lonardo, Pier Stanislao Paolucci, Davide Rossetti, Andrea Salamon,
Gaetano Salina, Francesco Simula, Laura Tosoratto, and Piero Vicini. "apenet+:
high bandwidth 3d torus direct network for petaflops scale commodity clusters",
2011.
[4] C. Benedetti, C. B. Schroeder, E. Esarey, C. G. R. Geddes, and W. P. Leemans.
Efficient Modeling of Laser-Plasma Accelerators with INFERNO. In S. H. Gold &
G. S. Nusinovich, editor, American Institute of Physics Conference Series, volume
1299 of American Institute of Physics Conference Series, pages 250–255, November
2010.
[5] C. Benedetti, A. Sgattoni, G. Turchetti, and P. Londrillo. ALaDyn: A High-
Accuracy PIC Code for the Maxwell-Vlasov Equations. IEEE Transactions on
Plasma Science, 36(4), 2008.
[6] CK Birdsall and AB Langdon. Plasma physics via computer simulation. Taylor &
Francis Group, 2004.
[7] JA Bittencourt. Fundamentals of plasma physics. Springer, 2004.
[8] W. Leemans C.B. Schroeder C. Benedetti, E. Esarey. Modeling a 10 gev laser-plasma
accelerator with inf&rno.
[9] Benjamin M. Cowan, David L. Bruhwiler, Estelle Cormier-Michel, Eric Esarey,
Cameron G. R. Geddes, Peter Messmer, and Kevin M. Paul. Characteristics of an
envelope model for laser-plasma accelerator simulation. J. Comput. Phys., 230:61–
86, January 2011.
87
Bibliography
[10] William Dally. The End of Denial Architecture and the Rise of Throughput Com-
puting. In GPU’s : current state and evolution - INFN CNAF meeting.
[11] E. Esarey C.B. Schroeder Cs. Toth D.L. Bruhwiler K. Paul B. Cowan W.P. Leemans
E. Cormier-Michel, C.G.R. Geddes. Scaled simulations of a 10gev accelerator.
[12] E. Esarey, C. B. Schroeder, and W. P. Leemans. Physics of laser-driven plasma-based
electron accelerators. Rev. Mod. Phys., 81:1229–1285, Aug 2009.
[13] T Esirkepov. "exact charge conservation scheme for particle-in-cell simulation with
an arbitrary form-factor". Computer Physics Communications, 135(2):144–153,
2001.
[14] S. Gordienko and A. Pukhov. Scalings for ultrarelativistic laser plasmas and quasi-
monoenergetic electrons. 12(4):043109, 2005.
[15] M. Harris, J. Owens, S. Sengupta, Y. Zhang, and A. Davidson. CUDPP: CUDA
data parallel primitives library, 2007.
[16] R.W. Hockney and J.W. Eastwood. Computer simulation using particles. Institute
of Physics Publishing, 1988.
[17] Nathan Bell Jared Hoberock. Thrust: a template library for cuda applications.
[18] W.L. Kruer. The physics of laser plasma interactions. Westview Press, 2003.
[19] Jens Krüger and Rüdiger Westermann. Acceleration Techniques for GPU-based
Volume Rendering. In Proceedings IEEE Visualization 2003, 2003.
[20] Leemans, W P and Nagler, B and Gonsalves, A J and Toth, Cs and Nakamura,
K and Geddes, C G R and Esarey, E and Schroeder, C B and Hooker, S M. Gev
electron beams from a centimetre-scale accelerator. Nature Physics, 2(10):696–699,
2006.
[21] P. Londrillo, C. Benedetti, A. Sgattoni, and G. Turchetti. Charge preserving high
order pic schemes. Nuclear Instruments and Methods in Physics Research Section
A: Accelerators, Spectrometers, Detectors and Associated Equipment, 620(1):28 –
35, 2010. <ce:title>COULOMB09</ce:title> <ce:subtitle>Ions Acceleration with
high Power Lasers: Physics and Applications</ce:subtitle>.
[22] V. Malka, J. Faure, Y. Glinec, A. Pukhov, and J.-P. Rousseau. Monoenergetic
electron beam optimization in the bubble regime. 12(5):056702, 2005.
[23] C. NVIDIA. Compute Unified Device Architecture Programming Guide. Nvidia,
2009.
88
Bibliography
[24] Matteo Passoni, Luca Bertagna, and Alessandro Zani. Target normal sheath accel-
eration: theory, comparison with experiments and future perspectives. New Journal
of Physics, 12(4):045012, 2010.
[25] A P L Robinson, M Zepf, S Kar, R G Evans, and C Bellei. Radiation pressure
acceleration of thin foils with circularly polarized laser pulses. New Journal of
Physics, 10(1):013021, 2008.
[26] S. Sengupta, M. Harris, Y. Zhang, and J.D. Owens. Scan primitives for GPU com-
puting. In Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS sympo-
sium on Graphics hardware, pages 97–106. Eurographics Association Aire-la-Ville,
Switzerland, Switzerland, 2007.
[27] A. Sgattoni, C. Benedetti, P. Londrillo, and G. Turchetti. Simulation of the laser-
plasma acceleration for the plasmonx project with the pic code aladyn. Radiation
Effects and Defects in Solids, 165(6-10):803–808, 2010.
[28] Igor V Sokolov. High-order field interpolation in a charge-conserving numerical
scheme for particle-in-cell simulations. Technical Report arXiv:1101.0809, Jan 2011.
Comments: 10 pages.
[29] P. Sprangle, E. Esarey, and A. Ting. Nonlinear interaction of intense laser pulses in
plasmas. Phys. Rev. A, 41:4463–4469, Apr 1990.
[30] G. Stantchev, W. Dorland, and N. Gumerov. Fast parallel Particle-To-Grid interpo-
lation for plasma PIC simulations on the GPU. Journal of Parallel and Distributed
Computing, 68(10):1339–1349, 2008.
[31] S. Stegmaier, M. Strengert, T. Klein, and T. Ertl. A simple and flexible volume ren-
dering framework for graphics-hardware-based raycasting. International Workshop
on Volume Graphics, 0:187–241, 2005.
[32] T. Tajima and J. M. Dawson. Laser electron accelerator. Phys. Rev. Lett., 43:267–
270, Jul 1979.
[33] Kane S. Yee. Numerical solution of initial boundary value problems involving
maxwell’s equations in isotropic media. IEEE Trans. Antennas and Propagation,
pages 302–307, 1966.
89