final report high performance computing phd seminar …final report high performance computing phd...

Final ReportHigh Performance Computing PhD seminar

Winter semester 2016/17

Peter Leitner

March 7, 2017

Contents1 Dirac’s equation in (1+1)-D 1

2 Staggered Grid and Perfectly Matched Boundary Method 2

3 Discretization, inital- and boundary conditions 3

4 Numerical solution 4

5 Code optimization 55.1 Loop-merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65.2 Python-Fortran interface . . . . . . . . . . . . . . . . . . . . . . . 65.3 Code parallelization with OpenMP . . . . . . . . . . . . . . . . . 7

1 Dirac’s equation in (1+1)-DThe Dirac equation, presented here in natural units where ~ = c = 1, can in(1+1)-D1 be compactly written as

i∂tψ(x, t) = Hψ(x, t) (1)

where the following Hamiltonian

H = mσz − i∂xσx − V (x, t)1 =

(m− V (x, t) −i∂x−i∂x −m− V (x, t)

)(2)

governing the temporal and spatial evolution of a fermion (spin 1/2-particles ase.g. electrons) field ψ(x, t) = (u(x, t), v(x, t))T in a simple potential V (x, t) has

1This notation refers to the fact that in Special Relativity space and time are treated onan equal footing.

1

been used. The Hermitian and unitary Pauli matrices entering the Hamiltonianare

σx =

(0 11 0

)and σz =

(1 00 −1

). (3)

In (1+1)-D only the two spinors u(x, t) and v(x, t) make up the fermion fieldψ(x, t), while in full (1+3)-D treatment four equations for the four-componentwave function ψ(r, t) have to be solved for.

The Dirac equation plays a fundamental role in the theory of elementaryparticles and in high energy physics. It is a relativistic wave equation describ-ing the dynamics of fermion fields. The motivation of the Dirac equation isa fundamental aspect of Quantum Field Theory and beyond the scope of thisreport, focussing on the performance optimization of the discretized coupledPDEs.

2 Staggered Grid and Perfectly Matched Bound-ary Method

The attempt to solve Dirac’s equation on a regular Cartesian grid has beenshown to lead to the so-called Fermion-doubling problem. If the equations arediscretized on a space- and time-staggered grid, such that spinor u(x, t) “lives”on integer-indexed space nodes and on half-integer-indexed temporal nodes andv(x, t) vice versa, using a leap frog scheme, Fermion doublers are avoided whilethe dispersion relation of the continuum problem is in addition preserved. TheFermion doubling problem is a subtle complication arising whenever fermionicsystems are attempted to be numerically treated numerically on a lattice. Itis covered extensively in the literature and shall not be discussed in any detailhere.

A further complication arises from the use of hard boundary conditions. Aninital wave pulse that travels across the x-axis gets unphysically reflected. Oneway to avoid this is the implementation of a perfectly matched layer (PML)at the boundaries of the spatial domain. By an analytic continuation of thespatial coordinate x → x(x) = x + if(x) with df/dx = σ(x)/ω and thusf(x) = 1/ω

∫ xdx′ σ(x′) the spinors decay exponentially at the boundary, as

a wave solution decays with the imaginary part of x,

A exp [i(kx− ωt)]→ A exp [i(k<(x) + ik=(x)− ωt] ∝ exp(−k=(x)).

Taking into consideration that the analytic continuation also affects the partialderivative

∂

∂x→ ∂

∂x=

1

1 + iσ(x)/ω

∂

∂x

and upon the definition of auxiliary fields for the spinors ψu(x, ω) := σ/ω(m −V )u(x, ω) in Fourier space and likewise for the spinor v(x, ω) the following four

2

partial differential equations (finally transformed back from frequency to tem-poral space) are obtained:

∂

∂tu(x, t) =

[m− V (x, t)

i− σ(x)

]u(x, t) + ψu(x, t)− ∂

∂xv(x, t) (4)

∂

∂tψu(x, t) = −iσ(x)(m+ V (x, t))v(x, t) (5)

∂

∂tv(x, t) =

[m+ V (x, t)

−i− σ(x)

]v(x, t) + ψv(x, t)−

∂

∂xu(x, t) (6)

∂

∂tψv(x, t) = iσ(x)(m+ V (x, t))v(x, t) (7)

3 Discretization, inital- and boundary conditionsUsing symmetric difference quotients for the derivatives and shifting the spatialindices j−1/2→ j−1, j+1/2→ j and temporal indices n−1/2→ n, n+1/2→ n+1in order to put the equations on staggered grids and explicitly solving for spinorsand auxiliary fields at the following time step n+1, the following four discretizedequations are obtained:

un+1j =

1

1−(mn

j −V nj

i − σj)

∆t2

×{unj

[1 +

(mnj − V nj

i− σj

)∆t

2

]+ ψnu,j∆t−

vnj − vnj−1

∆x∆t}, (8)

ψn+1u,j = ψnu,j − i∆tσj(m

nj − V nj )

un+1j + unj

2, (9)

vn+1j =

1

1−(mn

j +V nj

i − σj)

∆t2

×{vnj

[1 +

(mnj − V nj−i

− σj)∆t

2

]+ ψnv,j∆t−

un+1j+1 − u

n−1j

∆x∆t}, (10)

ψn+1v,j = ψnv,j + iσj∆t(m

nj + V nj )

vnj + vn+1j

2. (11)

As an initial condition, a Gaussian is chosen for spinor u(x, t = 0) :

u(x, 0) = (2απ)−1/4 exp(ik0(x− x0)) exp(−(x− x0)2/4α) (12)

and the second spinor v(x, 0) is self-consistently computed from the first basedon the Dirac equation. In addition to the PML boundary, where a step functionfor σ(x) was chosen on the left and an exponential function on the right end ofthe grid for comparison, periodic Dirichlet boundary conditions are used:

u(0, t) = u(xmax, t) = v(0, t) = v(xmax, t). (13)

For the potential function a step barrier has been considered for this simple testcase. It leads to a partial reflection and a partial transmission of the spinors.

3

0 10 20 30 40 50x (a.u.)

0

10

20

30

40

50

t (a.u.)

Figure 1: Solution for an inital wave package propagating to the right andcrossing a potential barrier. The wave package becomes partly reflected at thepotential barrier. At the right end, where an exponential decay-boundary isimplemented the wave package is found to be perfectly absorbed. On the left,where a step function is implemented, a tiny fraction of the outgoing wavepackage (a reflection from the potential barrier) is reflected and re-enters thecomputational domain.

In order to study physically reasonable scenarios, a sensible potential function,appropriate coordinates, and a grid size/dimension adapted to the experimentalsetup would have to be considered.

For all of the scenarios computed here a spatial grid of 0 ≤ j ≤ Nx =512 + 2, j ∈ Z with step size ∆x = 0.1 has been chosen such that the number ofnode points in spatial direction can be conveniently subdivided in integer-sizedchunks, when distributing the workload on 1, 2, 4, or 8 threads. The temporalgrid is varied in the following, starting with Nt = 50 time steps and a timeincrement of ∆t = 0.1 and continuing with Nt → Nt · 10 while ∆t → ∆t/10,etc., such that all scenarios yield the same result yet on differently resolvedgrids.

4 Numerical solutionThe first implementation prior to this seminar was a pure Python code, solvingthe four coupled quadratures for the two spinors and their auxiliary fields byfour independent temporal loops, executed in such an order that all necessaryinformation for each loop is readily available when queried.

The original Python code runs in reasonable times for this simple test cases

4

Figure 2: Snapshot of propagating spinors just crossing the potential barrier.

but would run impractically long for realistic physical scenarios on large gridsand even more so if a generalization to (1+3)-D would be implemented. It istherefore advantageous at this point already to consider a code optimization,be it by switching to a compiled code version or by parallelizing the code.

5 Code optimizationIn course of the seminar, several versions with increasing runtime optimizationhave been compiled:

• Version 1.0 refers to the original, non-optimized Python code, where thediscretized equations have been coded one-to-one according to the dis-cretized equations shown above.

• As resolving dependencies reveals that while the auxiliary fields have tobe computed after the corresponding spinor fields are finished for a spe-cific time step, two of the for-loops are redundant and were removed inVersion 1.1.

• A significant speedup as shown below is achieved by transfering the nu-merical quadratures from Python to compiled Fortran code, Version 2.0.

• Finally the code is parallelized with OpenMP, Version 2.1.

All benchmarks shown in the following have been performed on the Mephistoserver 143.50.47.128 at the Mathematics institute at the University of Graz. Ithas the following specification: 2.7 GHz CPU, 12.28 MB cache size, 6 cores (12with hyper threading).

5

Figure 3: Computation times for varying grid sizes before (solid line) and after(dotted line) loop merging.

Grid size Before loop-merging (s) After loop-merging (s)50 1.72881793976 1.76668596268500 17.8594789505 18.37669897085000 186.608217955 180.192077875

50 000 1770.54088783 1797.06790781

Table 1: Runtime vs. number of grid points before and after loop-merging.

5.1 Loop-mergingWhile no considerable runtime difference can be gained from this, Fig. 3, loopmerging has been observed to turn the code more numerically stabile whenincreasing the grid size and accordingly reducing the step size.

5.2 Python-Fortran interfaceThe code has been rewritten such that a Python wrapper calls a Fortran subrou-tine computing the time-consuming quadratures, which after completion returnsthe results back to the wrapper for post-processing of the data and/or plotting.The program control and setting of initial and boundary conditions remain eas-ily adaptable via the Python wrapper. In order to call the Fortran code fromPython it has to be compiled first using the compiler tool f2py:

f2py −c −−f c omp i l e r=gnu95 −− f 9 0 f l a g s=’−fopenmp ’ −lgomp −msolve_DiracEqn_on_the_latticesolve_DiracEqn_on_the_lattice . f95

6

Figure 4: Runtime of pure Python code vs. Python-Fortran wrapper code.

No. of grid points Original Pyton code Python-Fortran wrapper50 1.72881793976 0.00372505187988500 17.8594789505 0.03502702713015000 186.608217955 0.349123954773

50 000 1770.54088783 3.17428088188

Table 2: Runtime Python code vs. Python-Fortran wrapper code.

The shared object file solve_DiracEqn_on_the_lattice.so, which is cre-ated in the course of compilation can be accessed from Python as if it were aPython module:

import solve_DiracEqn_on_the_latticeu , v = solve_DiracEqn_on_the_lattice . mainloop ( u00 , v00 ,

inp . dx , dt ,m,V, sigma , inp .Nx, Nt)

5.3 Code parallelization with OpenMPThe final optimization involves distributing the work load on several threads ona single machine using OpenMP. For this purpose the spatial grid is subdividedinto num_threads sectors of equal chunk_size = (Nx - 2)/num_threads, sothat the work load per row, i.e. per time step is equally distributed on thethreads, see Fig. 5.

For this reason the spatial grid size of Nx = 512 + 2 minus two boundarygrid points has been chosen to be a multiple of 1, 2, 4, 8, 16, ... to ensure anequal work load balance.

7

Figure 5: Subdivision of the (1+1)-D grid into columns of equal chunk size.

In the following the Fortran source code with OpenMP implementation islisted.

subrout ine quad ( u00 , v00 , dx , dt ,m, Pot , sigma ,number_of_threads ,Nx, Nt , u , v )

! quadrature f o r the (1+1)−D Dirac Equationuse omp_libimp l i c i t none

in t ege r , i n t en t ( in ) : : Nx , Ntdouble complex , dimension (Nx, Nt) : : f a c to r1 , f a c t o r 2double complex , dimension (Nx, Nt ) , i n t en t ( in ) : : u00 , v00double complex , dimension (Nx, Nt ) , i n t en t ( out ) : : u , vdouble complex , dimension (Nx, Nt) : : psi_u , psi_vdouble p r e c i s i on , dimension (Nx, Nt ) , i n t en t ( in ) : : m, Potdouble p r e c i s i on , dimension (Nx) , i n t en t ( in ) : : sigmadouble p r e c i s i on , i n t en t ( in ) : : dx , dtdouble complex , parameter : : i =(0.0d0 , 1 . 0 d0 )i n t e g e r : : n , j , threadnr , number_of_threads

f a c t o r 1 = complex ( 0 . 0 d0 , 0 . 0 d0 )f a c t o r 2 = complex ( 0 . 0 d0 , 0 . 0 d0 )psi_u = complex ( 0 . 0 d0 , 0 . 0 d0 )psi_v = complex ( 0 . 0 d0 , 0 . 0 d0 )u = u00 ; v = v00

8

do n=1,Nt−1c a l l omp_set_num_threads ( number_of_threads )! $omp p a r a l l e l p r i va t e ( j ) shared (u , v , psi_u , psi_v ,

f ac to r1 , f a c t o r 2 )!#!$omp do ordered schedu le ( s t a t i c ,Nx/number_of_threads )! $omp do schedu le ( s t a t i c ,Nx/number_of_threads )do j =1,Nx

!#!$omp ordered! threadnr = omp_get_thread_num ( )! p r i n t ∗ , ’ thread ’ , threadnr , ’ : ’ , jf a c t o r 1 ( j , n ) = (m( j , n ) − Pot ( j , n ) )/ i − sigma ( j )u( j , n+1) = 1 .0 d0 / ( ( 1 . 0 d0 , 0 . 0 d0 ) &

− f a c t o r 1 ( j , n )∗0 . 5 d0∗dt ) &∗(u( j , n ) ∗ ( ( 1 . 0 d0 , 0 . 0 d0 ) &+ fa c t o r 1 ( j , n )∗0 . 5 d0∗dt ) &+ psi_u ( j , n )∗ dt &− ( v ( j , n ) − v ( j −1,n ) )∗ dt/dx )

psi_u ( j , n+1) = psi_u ( j , n ) &− i ∗dt∗ sigma ( j ) &∗(m( j , n)−Pot ( j , n ) ) &∗0 .5 d0 ∗(u( j , n+1) + u( j , n ) )

!#!$omp end orderedend do

! $omp end do! $omp end p a r a l l e l

! $omp p a r a l l e l p r i va t e ( j ) shared (u , v , psi_u , psi_v ,f ac to r1 , f a c t o r 2 )

!#!$omp do ordered schedu le ( s t a t i c ,Nx/number_of_threads )! $omp do schedu le ( s t a t i c ,Nx/number_of_threads )do j =1,Nx

!#!$omp ordered! threadnr = omp_get_thread_num ( )! p r i n t ∗ , ’ thread ’ , threadnr , ’ : ’ , jf a c t o r 2 ( j , n ) = (m( j , n ) + Pot ( j , n))/(− i ) − sigma ( j )v ( j , n+1) = (1 . 0 d0 , 0 . 0 d0 ) / ( ( 1 . 0 d0 , 0 . 0 d0 ) &

− f a c t o r 2 ( j , n )∗0 . 5 d0∗dt ) &∗( v ( j , n ) ∗ ( ( 1 . 0 d0 , 0 . 0 d0 ) &+ fa c t o r 2 ( j , n )∗0 . 5 d0∗dt ) &+ psi_v ( j , n )∗ dt &− (u( j +1,n+1) − u( j , n+1))∗dt/dx )

psi_v ( j , n+1) = psi_v ( j , n ) + i ∗ sigma ( j )∗ dt &∗(m( j , n ) + Pot ( j , n ) ) &∗( v ( j , n ) + v( j , n+1))∗0.5 d0

!#!$omp end orderedend do

9

Figure 6: Dependencies resolved for computation of spinor u and auxiliary fieldψu in the first loop.

! $omp end do! $omp end p a r a l l e l

end doend subrout ine quad

The first attempt using an ordered schedule does not result in a speedupand is commented out. Since field values at nodes left or right of the currentnode point are not called anyway, no ordered do-loops have to be enforced, seeFig. 6. The calculation of spinor v only needs spinor u from the right, whichhowever has already been calculated before the second loop is launched, Fig. 7.

The following table lists the sequential and parallel execution times, eachfor the full code and the quadrature only, as well as the corresponding speedup(using p cores)

Sp =Ts

Tp(14)

and parallel efficiency

Ep =Spp

=Ts

p Tp. (15)

In case of this program’s simple structure starting with sequential Python code,calling parallel Fortran and returning the result to sequential Python again, thesequential and parallel runtimes are given by

Ts = t1 − t0 + (t2 − t1)∣∣∣on one core

+ t3 − t2,

10

Figure 7: Dependencies resolved for computation of spinor v and auxiliary fieldψv in the second loop.

Tp = t1 − t0 + (t2 − t1)∣∣∣on p cores

+ t3 − t2.

Note that for each set of parameters 10 runs have been performed, butonly the one with the lowest parallel execution time has been considered andthe others discarded. The sequential runtime has been re-evaluated for eachrun, however in principle the minimum value (for each grid size) could havebeen taken as reference value. Table 3 refers to the full (sequential part plusparallel part) code, while Table 4 refers to the parallelizable part, i.e. the actualquadrature in the Fortran code.

The speedup and the parallel efficiency for the full program are illustratedin Fig. 8 and for the parallelizable part in Fig. 9.

Finally, the first version 1.0 of the code prior optimization is compared tothe fastest code version 2.1 executed on 8 cores, Fig. 10. Apparently, the coderuns faster by a factor of ∼ 1000 independently of the grid size.

11

Grid size No. cores Seq. runtime Par. runtime Speedup Par. efficiency50 1 0.022 0.021 1.021 1.02150 2 0.022 0.020 1.071 0.53550 4 0.022 0.020 1.103 0.27650 8 0.022 0.019 1.114 0.139500 1 0.172 0.170 1.012 1.012500 2 0.163 0.149 1.093 0.547500 4 0.163 0.144 1.131 0.283500 8 0.163 0.143 1.142 0.1435000 1 2.136 2.136 1.000 1.0005000 2 2.564 2.432 1.054 0.5275000 4 2.054 1.878 1.094 0.2735000 8 2.517 2.334 1.079 0.13550 000 1 23.427 23.187 1.010 1.01050 000 2 23.152 21.634 1.070 0.53550 000 4 23.952 22.031 1.087 0.27250 000 8 22.378 20.318 1.101 0.138

Table 3: Runtimes, speedup and parallel efficiency for the full program.

Grid size No. cores Seq. runtime Par. runtime Speedup Par. efficiency50 1 0.004 0.004 1.124 1.12450 2 0.004 0.003 1.542 0.77150 4 0.004 0.002 1.996 0.49950 8 0.004 0.002 2.224 0.278500 1 0.042 0.040 1.053 1.053500 2 0.040 0.026 1.543 0.772500 4 0.039 0.020 1.931 0.483500 8 0.039 0.019 2.085 0.2615000 1 0.391 0.390 1.002 1.0025000 2 0.364 0.233 1.564 0.7825000 4 0.374 0.188 1.935 0.4845000 8 0.364 0.180 2.016 0.25250 000 1 3.864 3.624 1.066 1.06650 000 2 3.873 2.355 1.644 0.82250 000 4 3.897 1.977 1.972 0.49350 000 8 3.896 1.836 2.122 0.265

Table 4: Runtimes, speedup and parallel efficiency for the parallelizable part ofthe program.

12

1 2 3 4 5 6 7 8

Number of cores

1.00

1.02

1.04

1.06

1.08

1.10

1.12

1.14

SpeedupSp=

Ts/Tp

linear speedup

Nt = 50

Nt = 500

Nt = 5000

Nt = 50 000

1 2 3 4 5 6 7 8

Number of cores

0.2

0.4

0.6

0.8

1.0

Parallel

efficiency

Ep=

Sp/p

Nt = 50

Nt = 500

Nt = 5000

Nt = 50 000

Figure 8: Speedup and parallel efficiency of the full program for varying gridsizes.

1 2 3 4 5 6 7 8

Number of cores

1.0

1.2

1.4

1.6

1.8

2.0

2.2

SpeedupSp=

Ts/Tp

linear speedup

Nt = 50

Nt = 500

Nt = 5000

Nt = 50 000

1 2 3 4 5 6 7 8

Number of cores

0.4

0.6

0.8

1.0Parallel

efficiency

Ep=

Sp/p

Nt = 50

Nt = 500

Nt = 5000

Nt = 50 000

Figure 9: Speedup and parallel efficiency of the parallelizable part, i.e. thequadrature written in Fortran, for varying grid sizes.

Figure 10: Comparison of the code version 1.0 prior optimization and version2.1 executed on 8 cores.

13

final report high performance computing phd seminar …final report high performance computing phd...

Documents