development of a navier-stokes solver for direct numerical ... · simulations of isotropic...

Development of a Navier-Stokes solver for Direct Numerical

Simulations of Isotropic Turbulence using 2DECOMP

library

Nuno Miguel Viana Rodrigues

Thesis to obtain the Master of Science Degree in

Mechanical Engineering

Supervisors: Prof. Carlos Frederico Neves Bettencourt da Silva

Dr. Pedro Manuel da Silva Cardoso Isidro Valente

Examination Committee

Chairperson: Prof. Viriato Sérgio de Almeida Semião

Supervisor: Prof. Carlos Frederico Neves Bettencourt da Silva

Member of the Committee: Prof. Luís Rego da Cunha Eça

July 2015

i

“Cheshire Cat: You may have noticed that I’m not all there myself.”

In “Alice in Wonderland” by Lewis Carroll;

iii

Acknowledgements

I would like to thank my Family for the enormous amount of patience and sacrifice,

for all the arguments won and lost, and for pressing me onwards towards the completion of

this work and my course.

I would like to thank Dr. Carlos Silva, Dr. Reis and Dr. Valente for their contributions,

on the order on which I interacted with them in the course of this Thesis, as well as thank

them for their time, patience and support. Each of them helped me greatly, by providing me

with a workplace, a goal, teaching me the basics of MPI and, for Dr. Valente, pushing me

forward towards better results.

I would like to thank my Father, who never saw the completion of this work, and my

course, for his love, his patience, his influence in my choice of career and support until his

passing. To my Mother I thank her for her insistence on the completion of the course and

progress towards a higher education. To my Brother, I thank him for being there when I

needed him. I dedicate this work to my family, both to the ones present and those who have

left, while I took this voyage towards goals not yet set.

I also take a moment to thank the LASEF team for their friendship over the course of

these years which I have shared with them.

v

Abstract

This Thesis deals with the implementation of the Pseudo-Spectral Method for

computationally solving the Navier-Stokes Equations, using the 2decomp parallel libraries for

a pencil memory arrangement, providing the capacity for distributed parallel calculation. The

target goals of this thesis are to enable massive simulations, ensure solver portability,

optimize the solver, ensure its scalability and perform trial runs in established international

clusters. The 2DECOMP library is portable in itself, so great care was taken to ensure that

the MPI solver was successfully ported from platform to platform. This culminated with the

eventual set of successful tests run in PRACE at the Marenostrum III supercomputer.

Optimization steps were evaluated several times during the various intermediate versions of

the code using VAMPIR profiling tools, resulting in near-ideal behaviour for when increasing

both the mesh-size and the number of processors involved, until the communication effort

began to degrade the solver’s performance.

vi

Resumo

Esta Tese lida com a implementação do Método Pseudo Espectral para resolver

computacionalmente as equações de Navier-Stokes, utilizando as bibliotecas paralelas

2DECOMP para um arranjo de memória em caneta, providenciando uma capacidade de

cálculo paralelo distribuída. Os objectivos desta tese são permitir simulações numéricas

massivas, assegurar a portabilidade do motor de cálculo, assegurar a sua escalabilidade e

executar testes em clusters internacionais. A biblioteca 2DECOMP é, por si só portável, e foi

tido grande cuidado em assegurar que a versão MPI seria também portável de plataforma

em plataforma. Este trabalho culminou com um conjunto de testes na PRACE do super

computador Marenostrum III. Foram avaliados passos de optimização repetidamente ao

longo das várias versões intermédias deste código recorrendo ao VAMPIR, uma ferramenta

de profiling, o que resultou num comportamento perto do ideal aquando a malha e o numero

de processadores aumentou, até que o custo de comunicação afecta a performance do

código.

vii

Table of Contents

Acknowledgements......................................................................................................................... iii

Abstract ........................................................................................................................................... v

Resumo........................................................................................................................................... vi

Table of Contents........................................................................................................................... vii

List of Figures .................................................................................................................................. ix

List of Tables .................................................................................................................................... x

List of Symbols ................................................................................................................................ xi

List of Abbreviations ......................................................................................................................xiv

Chapter 1 ......................................................................................................................................... 1

Introduction..................................................................................................................................... 2

1.1 Numerical Simulation of Flows............................................................................................ 3 1.1.1. Navier-Stokes Equations................................................................................................. 3 1.1.2. Turbulence...................................................................................................................... 3 1.1.3. The Energy Cascade........................................................................................................ 5 1.1.4. The Kolmogorov Micro Scale.......................................................................................... 7 1.1.5. The Computational Cost................................................................................................. 8 1.1.6. Parallelization Schemes................................................................................................ 10

Chapter 2 ....................................................................................................................................... 15

Background.................................................................................................................................... 16

2.1 Pseudo-Spectral Method................................................................................................... 16 2.1.1 Discrete Fourier Transform: ......................................................................................... 17 2.1.2 The Navier-Stokes Equations in the Fourier Space....................................................... 19 2.1.3 Spectral Space Navier-Stokes representation .............................................................. 21 2.1.4 Numerical Algorithm: ................................................................................................... 22

2.2 Simulation of Homogeneous Isotropic Turbulence............................................................ 23 2.2.1 Decaying Homogeneous Isotropic Turbulence............................................................. 24 2.2.2 Forced Homogeneous Isotropic Turbulence ................................................................ 24 2.2.3 Forcing Method for Homogeneous Isotropic Turbulence ............................................ 25

2.3 Domain Decomposition; Pencil Pattern............................................................................. 28

2.4 2decomp............................................................................................................................ 29

viii

Chapter 3 ....................................................................................................................................... 31

Numerical Developments............................................................................................................... 32

3.1 Parallelization Scheme .......................................................................................................35 3.1.1 Global Summing ............................................................................................................37 3.1.3 Maxima/Minima............................................................................................................37 3.1.2 Local Summing ..............................................................................................................38 3.1.4 Global Access at Local level...........................................................................................39

3.2 Fast Fourier Transforms .....................................................................................................40 3.2.1 Direct FFT Wrapper .......................................................................................................41 3.2.2 Inverse FFT Wrapper .....................................................................................................43 3.2.3 Real to Complex and Complex to Real ..........................................................................44

3.3 Truncation and Hermitian Redundancy .............................................................................45

3.4 I/O ......................................................................................................................................49

3.5 Randomization ...................................................................................................................50

3.6 Statistics .............................................................................................................................51

3.7 Converter............................................................................................................................52

Chapter 4 ....................................................................................................................................... 55

Results ........................................................................................................................................... 56

4.1 Tests and Speed Up ............................................................................................................58 4.1.1 Development of X to X Version .....................................................................................58 4.1.2 Development of X-to-Z Version.....................................................................................64 4.1.3 Pre Allocation Optimization ..........................................................................................67 4.1.4 Final Temporal Results ..................................................................................................74

4.2 Large Scale DNS testing .....................................................................................................75

Chapter 5 ....................................................................................................................................... 77

Conclusions and Further Work ....................................................................................................... 78

5.1 Main results and project considerations ............................................................................79

5.2 Future work ........................................................................................................................80

References ..................................................................................................................................... 81

Bibliographic references ...................................................................................................................81

ix

List of Figures

Figure 1 – Turbulent Flow as depicted by Da Vinci (1452-1519) ................................................. 4 Figure 2 – Eddy sizes within the Energy Cascade ........................................................................ 5 Figure 3 – Description of the range of scales in a turbulent flow ................................................. 6 Figure 4 – Slab Arrangement........................................................................................................... 10 Figure 5 – Pencil Arrangement........................................................................................................ 11 Figure 6 – Slab Parallelization Scheme Examples....................................................................... 12 Figure 7 – Pencil Parallelization Scheme Examples.................................................................... 13 Figure 8 – Fourier Transform Domains (Direct and Inverse) ...................................................... 17 Figure 9 – Velocity and Wave number Vectors in Spectral Space ............................................ 21 Figure 10 – Direct and Inverse FFT Algorithm Flowchart............................................................ 33 Figure 11 – Parallelization Scheme for Physical and Spectral Space....................................... 35 Figure 12 – Global Scheme versus Pencil Scheme ..................................................................... 36 Figure 13 – Global Summing Algorithm Flowchart ....................................................................... 37 Figure 14 – Minima/Maxima Algorithm Flowchart ........................................................................ 37 Figure 15 – Local Summing Algorithm Flowchart ......................................................................... 38 Figure 16 – Global Access at Local level Algorithm Flowchart................................................... 39 Figure 17 – 2decomp Library FFT X to Z (Standard) Implementation....................................... 40 Figure 18 – Direct FFT Wrapper Algorithm Flowchart ................................................................. 41 Figure 19 – Inverse FFT Wrapper Algorithm Flowchart .............................................................. 43 Figure 20 – Complex/Real Subroutine(s) Scheme ....................................................................... 44 Figure 21 – Truncation Global Scheme versus Pencil Scheme ................................................. 45 Figure 22 – Hermitian Redundancy Algorithm Depiction............................................................. 47 Figure 23 – Global Plate Operation Visualization ......................................................................... 48 Figure 24 – Converter MPI NATIVE to FORTRAN90 NATIVE Algorithm Flowchart............... 52

Figure 25 – Temporal Results (mesh 3

128 ) in Galego (X to X)................................................. 60 Figure 26 – VAMPIR Results Visualization, excluding non-parallelizable time expenditure .. 67 Figure 27 – Temporal Results ......................................................................................................... 69 Figure 28 – Adimensional Temporal Results ................................................................................ 69 Figure 29 – Temporal Results ......................................................................................................... 71 Figure 30 – Adimensional Temporal Results ................................................................................ 71 Figure 31 – Temporal Results ......................................................................................................... 73 Figure 32 – Adimensional Temporal Results ................................................................................ 73 Figure 33 – Iteration Time Final Results ........................................................................................ 74 Figure 34 – Iteration Time Final Results ........................................................................................ 79

x

List of Tables

Table 1 – Fourier Transform Properties ..........................................................................................18

Table 2 – Temporal Results (mesh 3

128 ) in Galego ...................................................................60

Table 3 – Temporal results for FFTI (3

128 )...................................................................................61 Table 4 – FFTI Temporal usage breakdown ..................................................................................61

Table 5 – Temporal results for FFTD (3

128 ) .................................................................................62 Table 6 – FFTD Temporal usage breakdown ................................................................................62 Table 7 – Total Temporal expenditure ............................................................................................63 Table 8 – Time per Cycle in workstation.........................................................................................63 Table 9 – Time per Cycle in workstation.........................................................................................65 Table 10 – Total Time Expenditure ................................................................................................65

Table 11 – Time per Cycle, 3

64 , 4 cores ......................................................................................66 Table 12 – Relative Time Expenditure ...........................................................................................66 Table 13 – Time (seconds per Cycle) .............................................................................................68 Table 14 – Adimensional Time per Cycle .......................................................................................68 Table 15 – Time (seconds per Cycle) .............................................................................................70 Table 16 – Adimensional Time per Cycle .......................................................................................70 Table 17 – Time (seconds per Cycle) .............................................................................................72 Table 18 – Adimensional Time per Cycle .......................................................................................72

xi

List of Symbols

Greek Symbols:

kα Runge Kutta Coefficients

kβ Runge Kutta Coefficients

t∆ Time step

x∆ Spatial variation

ε Dissipation rate of turbulent kinetic energy

η Kolmogorov scale

λ Taylor Scale

wλ

Wavelength

υ Kinematic viscosity

[ ]sυ Viscosity

ρ Density

Roman Symbols:

d Eddie Diameter

)(1ke

r

r

Orthogonal vector to kr

)(2ke

r

r

Orthogonal vector to kr

f Generic Function f

ℑ Direct Fourier Transform Operator

1−ℑ Inverse Fourier Transform Operator

g Generic Function

k Wavenumber

k

r

Wave number vector

forceK Energy Input at Forcing

0k Wave number for energy injection in Forcing

L Length

xii

0l

Macroscale Lenght

cL

Characteristic Length Scale

[ ] )(uLs

r

Viscous Term in Navier-Stokes Equations

n Number of wave numbers in a given direction

� Total Number of computational points in the calculation grid

procsn Total number of processing cores

x�

Number of points in the calculation grid along x-direction, also 1�

y�

Number of points in the calculation grid along y-direction, also 2�

z�

Number of points in the calculation grid along z-direction, also 3�

)(u�r

Non Linear and Pressure Term in Navier-Stokes Equations

P Power Input

0P Processor 0 in a multi-processor set-up




p∇ Pressure Gradient

Re Reynolds Number

LRe Reynolds at the computational box of length L

ηRe Reynolds Number at Kolmogorov scale

runT Total Run Time for a simulation

ur

Velocity vector

iu

Velocity in direction i

ηu

Kolmogorov Scale Velocity

λu

Taylor Scale Velocity

0u

Macroscale Velocity

cU

Characteristic Velocity Scale

LU Velocity at box of length L

xiii

t

u

∂

∂r

Temporal derivative of velocity vector

u∇ Velocity Gradient

u2

∇ Laplacian of the Velocity

)(ˆ ku Transform of Function of x-vale into k-value

)(xu Function of x-value

xiv

List of Abbreviations

CFD Computational Fluid Dynamics

DFT Discrete Fourier Transform

DNS Direct Numerical Simulation

FDM Finite Difference Method

FEM Finite Element Method

FVM Finite Volume Method

FFT Fast Fourier Transform

HIT Homogenous Isotropic Turbulence

LES Large Eddy Simulation

MPI Message Passing Interface

PDE Partial Differential Equation

PSM Pseudo-Spectral Method

RAM Random Access Memory

1

Chapter 1

Chapter outline:

Chapter 1 deals with the theoretical fundamentals that underpin the developed work. It

also provides insight on why this work was approached and its intended goal. In

addition, it provides with a short explanation on the parallelization scheme used and a

comparison with other options.

2

Introduction

Fluid flows are of engineering significance due to the variety of applications that

depend on understanding its behaviours. The subject has been extensively studied both

physically and mathematically. The particular dynamics of a fluid flow can be described by

the Navier-Stokes equations, presented in the next section. In turbulent flows, there are no

analytical solutions to for these equations. Computational Fluid Dynamics provides us with

engineering-grade responses to how a given flow can behave in a particular situation, with

several methods used to solve the Navier-Stokes equations or simplified forms of the same.

The most commonly used methods in commercial software are the Finite Volume Method

(FVM), the Finite Differences Method (FDM) and the Finite Element Method (FEM), uses the

same approach but has a different mathematical foundation. The previous methods all

benefit from being applicable in a range of different flow geometries, but require extensive

care when designing the computational mesh domain in order to obtain results that

successfully model the flow field.

This thesis deals with the application of the Pseudo Spectral Method (PSM), which is

characterised by its high accuracy Its main disadvantages are the need of a regularly

distributed mesh, which prevent its use in complicated flow geometries, and, like all methods

previously mentioned, a high physical memory requirement to be capable of modelling the

smaller scales of motion present in any fluid flow. As such, this work is aimed at increasing

the capacities of a pre-existing algorithm into massive calculations to be used in reference

simulations. The solver is designed to use the Message Passing Interface standard, and

takes advantage of the 2DECOMP library [1]. The 2DECOMP library uses a pencil-

arrangement for each processing core and provides full communication capacities for inter-

core and inter-node operations.

3

1.1 Numerical Simulation of Flows

1.1.1. Navier-Stokes Equations

The Navier-Stokes equations, for incompressible fluids may be written as:

For mass continuity:

0. =∇ u

r

(1.1)

For momentum continuity:

upuut

u rrrr 21. ∇+∇−=∇+

∂

∂ν

ρ (1.2)

The solution of these equations describes a flow’s velocity field, defined at every

point in its domain for a given time. This enables the calculation of several other properties,

such as mass gradients, temperature gradients and pressure gradients. However, these

equations cannot currently be generically solved, and require simplification in order to obtain

solutions for the velocities fields under study. The general form of the Navier-Stokes

equations still lacks a solution for all ranges of applicability, and there is considerable effort

in using the equations for engineering projects. A Turbulent flow is chaotic in nature due to

the interaction of convection, and results in rapid variations of local pressure and flow

velocity both spatially and temporally.

This thesis is aimed at the numerical simulation of Turbulence using a computational

method suitable for calculation of all of the flow’s properties in a simplified computational

domain.

1.1.2. Turbulence

Turbulence is a phenomenon which Peter Bradshaw describes as:

“Turbulence is a three-dimensional movement dependent on the time in which the

vortex stretching makes it so that fluctuations in the velocity field extend into all wavelengths,

between a minimum set by viscous forces and a maximum set by the boundary conditions of

the flow. It is the usual state of fluids except when at low Reynolds numbers.” According to

Tsinober [10], this quote is accurate in describing the Turbulence phenomena, but is not very

helpful when first attempting to understand the events taking place. At low Reynolds

4

numbers, a flow behaves in laminar fashion, with the fluid viscosity taking over the fluid

inertial behaviour. As the Reynolds Number increases, the inertia present in the fluid

molecules begins to take over and the flow becomes more disorganised, acquiring a

turbulent nature. Eventually point the flow is entirely dominated by Turbulence, which greatly

alters a flow’s properties at a local level. Globally, one may see large temporary movement

structures in the flow, but these are not predictable. Leonardo da Vinci had already observed

and attempted to study Turbulence, but, much as today, the phenomena is still under heavy

study and requires additional models to compute, if using the Navier-Stokes Equations.

Figure 1 – Turbulent Flow as depicted by Da Vinci (1452-1519)

5

1.1.3. The Energy Cascade

In order to understand Turbulence one must first take into account the energy

transfer processes that occur within the flow. The first of these is the Energy Cascade

Concept, which attempts to explain the energy transfer from large, macroscopic flow scales,

to microscopic flow scales, until the eventual dissipation via viscosity, where a conversion

into internal energy occurs.

Turbulent motions span a wide range of scales ranging from a macro scale at which

the energy is supplied, to a micro scale at which this energy is dissipated by viscosity [ref

turbulence book]. The interaction among eddies of various scales passes energy

sequentially from the larger eddies to the smaller ones. This process is called Turbulent

Energy Cascade, depicted in Figure 2 presented next.

Figure 2 – Eddy sizes within the Energy Cascade

If Turbulence is statistically in equilibrium, then the rate of energy transfer from one

scale to the next must be the same for all scales, so that no group of eddies sharing the

same scale sees a total energy level increase or decrease over time. It then follows that the

rate at which energy is supplied at the largest scales, is equal to that dissipated at the

smallest scales

6

Let us imagine that the flow under study is within a cube box of length L , then the

range of scales is shown next, in Figure 3:

Figure 3 – Description of the range of scales in a turbulent flow

Each of the scales represented is part of the Energy Cascade mechanism, and

energy injected at the Large Scales, travels continuously towards the smaller scales down

towards the molecular scales. Turbulent flows contain instantaneously generated vortices,

observable in practice, which transmit energy from macroscopic scales to molecular scale,

these vortices, according to their relative size and order of magnitude, are a form of energy

dissipation and mixing in the moving fluid, and vortices, at each length scale behave

differently, transferring energy into the next generation of smaller vortices. Large-scale and

Integral-scale vortices are anisotropic, and dissipate little to no energy, while also containing

the highest amount of energy in a fluid flow. Smaller vortices are isotropic, and contain very

low amounts of energy in the fluid. Both Large-scale and Integral-scale vortices have an

apparently simple structure, but are in fact extremely complex and interact with all other

scales at all points in their flow, which allows for the dissipation of turbulent kinetic energy

down towards the small vortices and eventually to the molecular scale.

L

Large Scale Integral Scale

Taylor Microscale

Kolmogorov Scale

Molecular Scale

7

1.1.4. The Kolmogorov Micro Scale

The Reynolds number at the Kolmogorov scale is υηηηu=Re = 1, which is

consistent with energy dissipation by molecular viscosity. If 1Re =η

, using the first similarity

hypothesis, as enunciated by Kolmogorov [7],[8]:

First similarity hypothesis:

In every turbulent flow at sufficiently high Reynolds numbers, the statistics of small

scale motions have a universal form that is uniquely determined by the kinematic viscosity,

υ , and the dissipation rate of turbulent kinetic energy, ε .

By performing a dimensional analysis, the following unique length, velocity and time

scale are obtained in Eq. (1.3):

4

1

3

=

ε

υη ( )4

1

ευη=u

4

1

=

ε

υ

τη

(1.3)

By taking into account the concept of the Energy Cascade, the dissipation can be

estimated from the large scales of the flow, by taking:

0

3

0~

l

uε (1.4)

Where 0

u and 0l and velocity and length scales of the largest eddies.

Taking all of the previous into account, the relationships between largest scales and

the smallest scales may be derived, resulting as follows:

4

3

0

Re−

=

l

η 4

1

0

Re−

=

u

uη

2

1

0

Re−

=

τ

τη

(1.5)

8

1.1.5. The Computational Cost.

Direct Numerical Simulation (DNS) has the capability to solve the Navier-Stokes

equations directly without any simplification. However, in order to capture the smallest

scales of Turbulence, a fine mesh is required, of the order of the Kolmogorov Scale, which

will lead into a growing necessity for memory in order to model more minute and detailed

flow properties. If this is not the case, then the flow energy is not properly dissipated by the

smallest scales.

The approach taken requires full 3-dimensional precise fields of dependent

variables, often unreachable via experimentation, at the expense of a growing necessity of

memory as the computational mesh size increases to catch these smaller scales. Assuming

a cube-shaped computational domain, of length L , with L also being the characteristic

length of the largest scales present in the simulated flow, the number of points � required is:

3

~

η

L� (1.6)

Therefore, for� :

4

94

93

Re~~~L

LLuL

�

υη (1.7)

As the simulation’s Reynolds number increases, an increasing in number of points in

the mesh is required, which is translated directly into an increasing memory usage.

For stability, the maximum allowed time-step can be found by following the Courant-

Friedich-Levy (CFL) condition:

L

L

Ut

x

tU η=∆↔

∆

∆1~ (1.8)

9

To reach a fully developed turbulent scale, the time required is proportional to the

time scales of the largest eddies,

L

run

U

LT ~ (1.9)

Therefore, the number of time steps is of the order of:

4

3

Re~~~ L

run

timesteps

L

t

T�

η∆ (1.10)

This implies that this kind of simulations is restricted to low Reynolds Numbers as

the order of nodes increase with 4

9

ReL

, as indicated by Eq. (1.7) while the number of time

steps increases with 4

3

ReL

as in eq. (1.10), with the number of mode-steps increasing by

3

ReL.

10

1.1.6. Parallelization Schemes

Due to the necessity of a higher memory as the Reynolds Number of the flow

increases, common workstations rapidly become overwhelmed by both the memory

requirements and the calculation procedures. As such, when aiming for exceedingly large

simulations, there is a requirement to move up to large clusters, which host thousands of

cores and have large memory banks. In order to take advantage of these, specific languages

are required. For this thesis, Message Passing Interface (MPI) is used. MPI is designed to

be used with large clusters and consists on breaking up the larger memory into smaller

portions of the memory capable of being handled by individual nodes and cores with specific

physical memory banks. This is themed parallelization, and for the purposes of this thesis,

two parallelization schemes were initially approached:

a) Slab Scheme

Figure 4 – Slab Arrangement

The slab parallelization scheme is the simpler of the two schemes approached

during this thesis. Each processor has access to a physical memory bank, which is, for a

three-dimensional field, operated in two dimensions prior to a global exchange in order for

the third dimension to be operated. Only one global communication is required, although the

message is of extremely large size. The Slab scheme has the advantage of, as mentioned,

for a three-dimension field, only requiring one communication. It has the disadvantage of

occupying more memory than the pencil scheme. The main disadvantage of the slab

decomposition, is the fact that it requires more memory available on a given calculation

processor, requiring at least ( )procs

n� 1× memory, with � being the selected mesh size

for calculations, placing a limit on the possible calculations which is linked to the available

Physical memory that the node, to which the processor belongs, has available.

11

b) Pencil Scheme

A Pencil scheme consists of having each processor use a one-dimensional piece of

the three-dimensional field under study. It uses less memory per processor, but requires

more processors in order to be fully efficient. Each processor requires less memory than a

slab-memory arrangement, but two communications are required in order to fully calculate

the field.

Figure 5 – Pencil Arrangement

The disadvantage of this approach is the fact that the communication effort that take

place will eventually begin to degrade code performance depending on the localized space

distribution and its relationship with the global dimensions. Due to the goal of having an as

large as possible simulation, the pencil scheme was selected and based on the 2decomp

library, which provides a framework to build the remainder of the code.

A further benefit of having a pencil-memory arrangement is the option to select

which arrangement one may use during the final calculation involving the memory fields.

This option was explored in the work developed for this thesis with two variations, one using

an x-dimension pencil arrangement to x-dimension pencil arrangement, which was intended

to, even if using a higher communication cost; capitalize on the calculation procedures, and

an x-dimension pencil to z-dimension pencil, which instead capitalized on the reduced

communications model.

12

Memory decomposition plays a key role in calculation procedures and

communications, and initially, a slab decomposition pattern was approached, as when

compared to the Pencil decomposition, it allows for two 1-Dimensional FFT procedures to be

done in the local space, requiring only one communication step and a third and last FFT

operation. This results in a rotated local view of the memory pattern.

The main disadvantage of the slab decomposition, is the fact that it requires more

memory available on a given calculation processor, requiring at least ( )procs

n� 1× memory,

with � being the selected mesh size for calculations, placing a limit on the possible

calculations which is linked to the available Physical memory that the node, to which the

processor belongs, has available.

Figure 6 – Slab Parallelization Scheme Examples

x

y

z

13

A Pencil memory decomposition, presented in figure 6, side steps this limitation by

further restricting the memory space to only what is required for a 1-Dimention FFT to be

performed, but requires two global sets of communications, with two intermediate FFT

operations prior to output.

The disadvantage of this approach is the fact that the communication effort that take

place will eventually begin to degrade code performance depending on the localized space

distribution and its relationship with the global dimensions.

Both approaches, however, produce, if one assumes the minimum communications

model, a different data distribution at output from the data distribution at input, with the slab

decomposition producing a differently orientated slab, and the pencil producing a differently

orientated pencil.

Further, more detailed comparisons of these decompositions may be found at the

2Decomp Library‘s Overview section [1].

Figure 7 – Pencil Parallelization Scheme Examples

x

y

z

15

Chapter 2

Chapter outline:

Chapter 2 introduces the computational tools used in this work and is intended as a

bridge between the theoretical introduction presented before, and the numerical

formulation used in the Navier-Stokes solver

16

Background

The present work is created as an evolution from the previous spectral code that has

been used in several publications in the research team in [3], [4], [5]. The subject of this

thesis was to develop a code that could eventually be comparable to with the best existing

codes in the world in terms of performance, and for that it was decided to use the 2DECOMP

library available at (http://www.2decomp.org/)

At the time of the creation of the MPI solver version, the original code existed in

OpenMP version, and is more developed than the present version.

2.1 Pseudo-Spectral Method

Pseudo-Spectral Methods (PSM) originate in 1970, and are a class of numerical

methods used in applied mathematics and scientific computing for the solution of partial

differential equations (PDE), PSM is used extensively in computational fluid dynamics (CFD),

for Turbulence simulation and is related with the Spectral methods, where the “pseudo” in

the nomenclature refers to the treatment of the non-linear term of a PDE. Spectral solutions

to time-dependent PDE are formulated in the frequency / wave number domain and solutions

are obtained in terms of Fourier coefficients. For PSM, the PDE are solved point wise in

Fourier space, but the spatial derivatives are calculated using orthogonal functions (e.g.

Fourier Integrals). They are evaluated using matrixial multiplications, FFT or convolutions.

17

2.1.1 Discrete Fourier Transform:

For a function ( )xu in the Physical domain, it is possible to execute a mathematical

operation to convert it into ( )kû in the Spectral space (or Fourier space). This operation is

called the Direct Fourier Transform, denoted byℑ :

( ) ( ) dxexuxukûikx−

+∞

∞−∫=ℑ=

π2

1}{)( (2.1)

The wave number k is computed as follows:

w

kλ

π2= (2.2)

The reverse operation is also possible, the Inverse Fourier Transform, which returns

the transformed function in the Spectral space to the Physical space, and the operation is

described as:

( ) ( ) ( ) dkekûkuxuikx

∫+∞

∞−

−

=ℑ= }{1

(2.3)

The comparison between a function’s appearance in the Physical domain and its

Fourier Transform in the Spectral space is presented next in Figure 8, which represents a

real-domain function, and its transform in the spectral space. The transformed functions

have only a wave number and amplitude, and when transformed via the Inverse DFT return

the same real function. This property allows for the apparent simplification of operations

affecting the real space to take place in the frequency domain.

Figure 8 – Fourier Transform Domains (Direct and Inverse)

18

The main advantage of this method, as previously mentioned is the simplicity of

most terms when they are in the spectral domain, even if they are apparently complicated in

the physical domain. To illustrate this, Table 1 lists the correspondent between several

mathematical operations in the physical and Fourier space

Property Physical Space Spectral Space Notes

Linearity bgaf + ^^

gbfa +

Derivative

ix

f

∂

∂

^

fiki

Laplace Operator f2∇

^2 fk−

Divergence

i

i

x

u

∂

∂

iiûik

Curl u×∇ ),( tkûik ×

Convolution gf ∗ ( )tkgf ,

^^

∗ ∗ - Convolution

Table 1 – Fourier Transform Properties

The table above lists the properties of the DFT used in the MPI DNS solver, showing

the simplicity of the operations in the Spectral space, with the singular exclusion of the

convolution operation.

To do a Spectral space convolution operation, it is more time efficient to return the

Spectral function ( )tkgfG ,

^^

∗= to it’s ( ) ( )txgtxf ,, ∗ from in the Physical space, and

calculate the external product at normal, only to then perform a direct DFT operation in order

to obtain the convolution result.

( ) ( ){ } ( )

∗ℑℑ=∗ℑ −

},{,,^^

1 tkgftxgtxf (2.4)

This defines the class of pseudo spectral methods.

19

2.1.2 The Navier-Stokes Equations in the Fourier Space

The Navier-Stokes equations can be easily written in the Fourier space:

Using tensor notation, if we apply a Fourier Transform to the continuity equation, we

get:

ii

i

iûik

x

u=

∂

∂ℑ (2.5)

And the previous may be written as follows:

0=iiûk (2.6)

For the momentum conservation equation, by applying the Fourier Transform to

each term:

t

û

t

uii

∂

∂=

∂

∂ℑ (2.7)

[ ] [ ]i

s

j

isûk

x

u2

2

2

υυ −=

∂

∂ℑ (2.8)

pikx

pi

i

i ˆ11

ρρ−=

∂

∂−ℑ (2.9)

( )iji

j

Guux

ˆ=

∂

∂ℑ (2.10)

Thus, the Spectral Form of the Navier-Stokes Momentum equation is:

[ ]iii

si Gpikûkt

ûˆˆ

12−−=+

∂

∂

ρυ (2.11)

By multiplying the above equation byi

ik , an operation equivalent to applying the

divergence in the Physical space, we obtain:

20

iiGikpk ˆˆ

12

=

ρ

(2.12)

From the previous, we may obtain the pressure:

2

ˆˆ

k

Gikp iiρ= (2.13)

Thus, the final form of the Navier-Stokes Momentum Equation in the Spectral space

is obtained:

[ ]k

kj

jkj

sjG

k

kkûk

t

ûˆ

2

2

−−=+

∂

∂δυ (2.14)

After deducing the Navier-Stokes equations in the Spectral space, a note is

warranted about the difficulty of each term. While the majority of the terms are apparently

simple to calculate, the term G is a convolution, and as such, while in the Spectral space, it

would be difficult to calculate. As such, as the terms are calculated, the term G is subjected

to an inverse FFT operation, where it is calculated as an external product, and then

subjected to a direct FFT for Spectral space calculation.

This is the only step in the entirety of the code that requires the real field to be used,

in order to reduce computational operation time. Doing so however, will produce errors as

the products of the Fourier transforms will yield ghost terms. These are called aliasing errors,

and are solved via truncation of the corresponding wave numbers. For truncation the 3/2

rule was used, that removes the aliasing errors by discarding all Fourier coefficients for

which:

max

3

2kk > (2.15)

This allows for the higher energy nodes to be removed and retain accuracy of the

PSM, which is vital in the present work.

21

2.1.3 Spectral Space Navier-Stokes representation

In the Spectral domain, the Navier-Stokes equations acquire the following

configuration (see Figure 9). The wave numbers k are three dimensional vectors originating

at the coordinate centre, and u , the complex velocity is located on a plane perpendicular to

the vector k to which it corresponds. Each wave number vector corresponds to a single set

of amplitude and angle in the plane normal to its direction. This is called the complex velocity

and has the same implicit meaning as a conversion between a real-domain function and its

representation in the spectral/Fourier space.

The complex velocity, ur

is in itself, a complex number, which, when constricted to

the plane, requires only a module and angle.

Figure 9 – Velocity and Wave number Vectors in Spectral Space

22

2.1.4 Numerical Algorithm:

The MPI DNS solver developed, as its predecessors, uses a full explicit temporal

advancement scheme (rd

3 order Runge Kutta) for the Navier-Stokes Equations:

The equations to be solved may be written as:

( ) [ ]( )uLu�tt

usrr

r

+=

δ

δ and 0. =∇ u

r

(2.16)

In which the terms ( )u� and [ ]( )uLs

are the convective term and the viscous term.

( ) puu�rrrr

∇−×=ρ

ω1

(2.17)

[ ]( ) [ ]uuL

ssrr 2

∇=υ (2.18)

The rd

3 order Runge-Kutta time stepping scheme computes, at each time-step, the

new velocity tenser at the new sub-step nk

uu = from the last sub-step

1−ku and

12 −−

=nk

uu , allowing us to write:

( ) [ ]( ){ } ( ) [ ]( ){ }2211

1

−−−−

−

+++=∆

− ksk

k

ksk

k

kk

uLu�uLu�t

uuβα (2.19)

And still subject to:

0. =∇k

u (2.20)

With the coefficients k

α andk

β , according to Williamson:

15

8

1=α ; 0

1=β

12

5

2=α ;

60

17

2−=β

4

3

3=α ;

12

5

3−=β

(2.21)

23

2.2 Simulation of Homogeneous Isotropic Turbulence

With the previously stated, one now turns to the simulation of Turbulent flows. Given

that DNS is the chosen tool, with LES being used in conjunction with PSM; this effectively

allows us to capture the instantaneous and chaotic essence of the Turbulence phenomena.

For reference, this thesis is designed with Homogeneous Isotropic Turbulence (HIT) in mind.

As such, a turbulent flow is isotropic if two conditions are met, which follow: Rotation and

buoyancy are not important, and therefore may be neglected and, there is no mean flow.

Rotation and buoyancy forces tend to suppress vertical motions in the fluid, and

create anisotropy between the vertical and horizontal directions. The presence of a mean

flow with a particular orientation may also introduce anisotropy in the turbulent velocity and

pressure fields. Further, a flow is homogenous if there are no spatial gradients in any

averaged quantity with this is equivalent to assuming that the statistics of the turbulent flow

are not a function of space. The present work is aimed at performing HIT DNS simulations,

of two related kinds: Decaying HIT, and Forced HIT. For Decaying HIT simulations, an initial

condition is set and the turbulent flow is allowed to decay via dissipation of the turbulent

kinetic energy. For Forced HIT simulations, energy is added to the flow, resulting in a

statistically steady flow. Following the mechanisms for the energy cascade, energy is added

to the low-wave numbers components of the velocity field. Comparing both decaying and

forced HIT DNS simulations, the advantage of the forced HIT is that stationary time series

are obtained and may be analysed, while also providing long-time series, at the cost of

polluting the large-scale (low frequencies) motions of the flow, which are different from

decaying (natural) turbulence, although the small-scale (or high-frequency) statistics are

unaffected due to the Energy Cascade concept. Decaying HIT, which has the advantage of

having large scales unpolluted, does not allow one to obtain long-time series as the flow will

tend towards dissipating the entirety of turbulent energy initially present in the initial velocity

field. For Forced HIT, for the velocity, as mentioned before, the low-frequency statistics are

affected by the forcing, while the high-frequency statistics are not appreciably affected.

Details of the forcing method may vary, but the underlying concept remains, that forcing

mimics the energy transfer to the inertial sub-range from the larger energy-containing scales

and the small scales are accurately solved. As such, forced turbulence has been used

extensively in the study of physics pertaining to the small scales of turbulence.

24

2.2.1 Decaying Homogeneous Isotropic Turbulence

To simulate decaying HIT, a Homogeneous Isotropic velocity field is generated as an

initial condition, and the flow evolves by dissipation of turbulent kinetic energy, requiring no

forcing.

2.2.2 Forced Homogeneous Isotropic Turbulence

A Homogeneous Isotropic Turbulence simulation consists on the following:

When a Turbulent flow is statistically invariant under rotation about arbitrary axis

and, in consequence, statistically invariant under translations, it is deemed homogenous and

isotropic. While the previous corresponds to an idealized type of turbulent flow, one may find

it in practical experimentation, such as in a wind tunnel.

Now, if one takes a cubic domain with periodic boundary conditions, in all three

directions in space, we may simulate the phenomena of Homogenous Isotropic Turbulence

numerically, allowing the use of efficient and fast numerical schemes, such as PSM, as is the

case of this particular section. Of note is the following. Turbulent Kinetic energy, present in

any flow with a turbulence component, needs to be forced in order to avoid the dependence

of the small scales from the large scales, where the forcing is processed. This is done to

maintain the Turbulent Kinetic energy steady. The process consists on artificially adding

energy to low-wave number components of the velocity field. If this is not performed, one

instead obtains a decaying turbulence simulation, which, while also accommodated by the

solver, is a much simpler use.

For the statistically stationary isotropic turbulence, with forcing as mentioned, the

characteristic length scale and velocity scales are 1

0

−

= kLc

and 3

1

0

=

k

PU

c thus, the

corresponding Reynolds number becomes:

υ

3

4

0

3

1

Re

−

=

kP

(2.22)

25

2.2.3 Forcing Method for Homogeneous Isotropic Turbulence

The forcing mimics the energy transfer to the inertial sub-range from the larger

energy-containing scales and the small scales are accurately solved, as mentioned already.

As such, forced turbulence has been used extensively in the study of physics

pertaining to the small scales of turbulence. Forcing, in this solver, is done in the Spectral

Space, requiring additional parameters. These parameters are the Power Input parameter,

P which is defined by the user on program start. The Power Input parameter controls the

energy of the flow, forceK which is to be constant through the simulation. At a stationary

stage, the rate of change in the integrated turbulent kinetic energy

dt

dK force is zero since

the dissipation rate ε matches the input power P .

Other inputs are the wave numbers k where the energy is injected at for the

simulation to progress. The implementation of forcing is as mentioned, performed in the

Spectral space, by means of a forcing term

^

f , a 3-dimensional vector.

The Navier-Stokes equation in the Spectral space can be now written as:

[ ] ( )tkfGk

kkûk

t

ûk

kj

jkj

sj,ˆ

^

2

2 +

−−=+

∂

∂δυ (2.23)

With the term ^

f defined as:

( ) )(),()(),(,21

^

ketkBketkAtkfrandomrandom

+= (2.24)

Where the terms ),( tkArandom

and ),( tkBrandom

are complex randomized numbers

and the terms )(1ke and )(

2ke are unit vectors. The random Force is additionally chosen to

be divergent free, which wields the following condition:

( ) 0,^

=tkfk (2.25)

This has the significance of implying that the force is projected in the same plane as

the velocity field, leading to the implication that vectors )(1ke and )(

2ke must be orthogonal

to each other and to k .

26

The vectors )(1ke

r

and )(2ke

r

are defined as follows:

( ) ( )

+

−

+

== 0,,),,(

2

1

222

1

22

1111

yx

x

yx

y

zyx

kk

k

kk

keeee

r

(2.26)

( ) ( )

( )

+

−

++

==2

222

2

1

2222

1

222

2222,,),,(

k

kk

kkk

kk

kkk

kkeeee

yx

yx

zy

yx

zx

zyx

r

(2.27)

With the random numbers random

A and random

B also being given by:

( )( )φ

π

θ

A

i

randomge

k

kFA 1

2

1

2

2

= (2.28)

( )( )φ

π

θ

B

i

randomge

k

kFB 2

2

1

2

2

= (2.29)

With ( )kF being the prescribed force spectrum, and functions A

g and B

g are two

real-value functions related by 1=+BA

gg . Values 1θ and

2θ are random angles so that

[ ]πθθ 2,0,21∈ with [ ]πφ ,0∈ being a random number as well generated at each wave

number and discrete time level.

In order to produce isotropic forcing, A

g and B

g are defined as:

φ2sin=A

g (2.30)

φ2cos=B

g (2.31)

27

To cancel any correlation with the velocity field 1θ must satisfy the following:

( ) { } ( ) { } { }( )

( ) { } ( ) { } { }( )22

22

1

cossin1

cossin1

tanξϕξϕφξφ

ξϕξϕφξφθ

IMAGREALgIMAGg

REALIMAGgREALg

BA

BA

++−

++

= (2.32)

With 11eu

r

=ξ and22eu

r

=ξ , and ϕ being a randomly generated angle on a

[ ]π2,0 defined by:

12θθϕ −= (2.33)

For the force, its spectrum shape is:

( )( )

c

kk

AekF

2

0−

−

= (2.34)

Which wields a concentrated force focused at wave number0k , with a degree of

concentration defined by c . The force is limited to be active at the wave number

range [ ]BAkkk ,∈ . To match the power input to the value P, A must satisfy the following:

( )

dke

t

PA

b

a

k

k

c

kk

∫−

−

∆=

2

0

1

(2.35)

The randomness of the scheme in time uncorrelates the velocity field from the

forcing, avoiding the enhancement of a certain time scale. Since it is independent of the

velocity field, there is the possibility of starting simulations from a zero velocity field, and

turbulence is generated by the forcing itself, ensuring that the final solution is independent of

the initial conditions.

28

2.3 Domain Decomposition; Pencil Pattern

The memory pattern used for the parallelization of the code was the pencil

decomposition, for the reasons mentioned earlier. This involves a minimum of two global

transposition steps, for each FFT-style operation, namely from the real to complex domain,

and from the complex to real domain for the inverse operation. The memory pattern is

described by an object-like structure, which includes and calculates all of the information

pertaining to the localized information and global information, with the procedures being

automated in the 2Decomp Library. A separate object is used, but not required, for Physical

and Spectral space, only for further work development, given that any further work done on

the code must keep in mind that the Physical/real space has a radically different structure

and organisation from the Spectral/Fourier space. Further objects are defined with different

sets of compatible, but different global dimensions for calculations, such as for the Isotropic

Turbulence initialization routines and further work will be necessary for Jet initializations.

Due to the Hermitian Redundancy routine, there is a need for each pencil to have

matching dimensions in order for this version of the solver to work, which has the added

benefit of ensuring or attempting to ensure that there is a even memory load in all

participating cores, coupled with a even processing load whenever possible with some minor

exceptions, such as some read/write to file operations. The 2Decomp Library supports this

option automatically, if requested at compilation time. This approach is to a simplification in

the algorithm for the transposition of a processor-shared plate, which is made simpler, if all

participating pencils have matching dimensions. Further, due to the previously mentioned

routine, there is a need for at least two participating cores in any given direction, since it

relies on the possibility of mirroring the entire domain, which involves a set of

communications outside the library and implemented in the functional code.

The communications performed rely on pre-allocation of the sending/receiving

buffers coupled with a minimum set of send/receive blocking operations thusly ensuring that

while the messages may be large, space is reserved and no run-time is spent on allocating

these blocks, with only the nodes/cores own communication times being relevant.

29

2.4 2decomp

As mentioned several times along the document, the Spectral code was developed

atop the 2Decomp Library available at (http://www.2decomp.org/), which is based around the

concept of memory pencil distribution. The library deals with the vast majority of the

parallelization effort, and is generic for any application that uses its fundamental basis. The

Library in itself is not aimed exclusively at FFT operations, but instead is based around the

memory pattern and subsequent necessary operations, such as global transposition, halo-

cell behaviour, reading and writing to file, and is reported as being compatible with the

architecture of several computational centres, although deviations may be expected.

According to its creator, the 2Decomp Library is a framework for FORTRAN for large-scale

parallel applications designed using a 3-Dimensional structured mesh.

At its core, the Library consists on a 2-Dimensional Pencil decomposition for

distributed memory calculations. It is, as a design feature, scalable and efficient, with an

interface aimed at its implementation and usage at supercomputers. The majority of the

communication programming is left to the Library in itself, and it includes a FFT interface

able to be used with several different FFT interfaces. The FFT module is also, given its basis

in the 2Decomp library, scalable, but requires specific information, such as being of a

different shape than the starting domain. The output is converted from the Real to a Complex

type in order to deliver the data correctly, as a FFT output must consist of module and angle

rather than a real variable. The Library is designed with portability in mind, which further

prevented any attempt of modification of the Library when applying it to the MPI Version, in

order to allow for fast portability from computational centre to computational centre.

Amongst the parallelization scheme, the Library also includes functionality to allow

for the interpolation procedures necessary for calculating, say, a velocity derivate by means

of finite differences. This functionally, called Halo-Cell, consists of a further set of operations

that, for cores with globally adjacent memory blocks, allow for sharing relevant information

based on the user’s needs. This functionality is reported as having also been optimized

thusly justifying its use, albeit, being communication intensive (in a different fashion than

global transpositions), there is some caution in using it, as each core must send/receive at

least eight messages, of largely different sizes. The Halo-Cell functionality is used in

enforcing the Hermitian redundancy algorithm, which requires information from adjacent

memory blocks present in participating processors.

30

For applications requiring periodical boundary conditions, the Halo Cell functionality

also allows for, when the memory is not immediately present in the core, to request

information from the opposite side of the domain to which the periodicity is related to. The

exception occurs when the memory is entirely present in the core, thereby voiding a need to

perform communications. For this work, only a first level memory access was required, but

the library supports a larger level of halo functionalities, to allow for higher order

interpolations to be performed on a global basis.

On the topic of pencil orientations, the Library does support x-dimension (real) to z-

dimension (complex) outputs, for an FFT for example, as well as supporting the reverse,

from z-dimension (real) input pencil to x-dimension (complex) pencil output. However, when

doing so, the FFT operations must start at z-dimension, which causes the auxiliary extra

variables to be stored as the end of the z-pencil, but horizontally. This causes the output

domain x-dimension shape to be halved* at start, with an auxiliary layer, and this

functionality was not used since the majority of this code rests on the organisation of the z-

dimension being halved*.

The Library also supports I/O operations, done with either routine-file opening or

closing operations, or, with previous file opening and closing. For the latter case, there is a

transcription of all RAM-used memory directly saved into disk by each processing core, with

the information requiring opening and conversion at a posterior time. The file saved does not

carry any information pertaining to the structural core organisation and may be opened with

different core arrangements at any time, allowing flexibility for program running at later

stages if a given cluster has different core and node availability. Some functionality for

performing portable file saves are provided, as well as a TECPLOT routine for writing slabs,

but in this code, these were left for exterior operations that read the memory stored during

execution time at discrete intervals.

This library is used successfully in Incompact3D:

(https://code.google.com/p/incompact3d/)

The successful use of the library and its subsequent release for academic and

research prompted its selection into being used for the purposes of this Thesis, which had

the intent of providing a completely distributed parallel PSM calculation engine for research

purposes, coupled with the possibility of later expansion and refinement.

31

Chapter 3

Chapter outline:

Chapter 3 deals exclusively with the work done in the implementation of the principles

mentioned in Chapter 1 and Chapter 2, and consists on a collection of algorithms used

in the set of routines that were operated during implementation.

32

Numerical Developments

In order to develop the MPI version of the Spectral code currently under usage, a

deep study had to be performed into the current logical processes used in the parent code,

in order to plan and develop a functional MPI variant for use in computational clusters.

Coupling the previous goal with the requirement that all participating subroutines

needed to be parallelized on a case-by-case and verified against Serial or OpenMP versions

of the same solver for result verification, meant that the option of completely parallelizing the

entirety of the solver code, for the purpose of this Thesis, was it was deemed beyond the

scope of the thesis.

Instead, there was a pressing need to first develop the capacity to calculate FFT with

proper real kind outputs and verify these against functional variants in other codes.

At this stage, in order to induce a minimum of changes to the way most of the

memory was accessed, the choice was taken to use, for the Spectral space, x-pencil

orientated memory pattern, which demanded two extra communications for proper memory

transposition and global Spectral space dimensions.

However, such an approach would require double the communication effort, as seen

the next flowcharts presented in the next page, for the Inverse and Direct FFT operations.

As seen in the flowcharts listed in Figure 10 present on the next page, there exists a

minimum of four communications in order to transpose the memory, with two

communications required for FFT processes, and two more to rotate the memory back into

an x-dimensional memory pencil.

Realizing that such communication effort would eventually require optimization, the

decision to re-create all participating subroutines was taken and pursued, in order to ensure

that the vast majority was orientation independent, in order to pave way for further

optimization steps to be taken.

No pre-allocation was done at this stage, which would be a separate step, if such

would be taken.

33

Figure 10 – Direct and Inverse FFT Algorithm Flowchart

Inverse X-to-X FFT Procedure Inverse X-to-Z FFT Procedure

1D FFT (Z)

Z to Y Global Transposition

1D FFT (X)

1D FFT (Y)

Y to X Global Transposition

1D FFT (Z)


1D FFT (X)

1D FFT (Y)


X to Y Global Transposition

Y to Z Global Transposition

Physical Output

Spectral Input Spectral Input

Physical Output

Direct X-to-X FFT Procedure

Direct X-to-Z FFT Procedure

1D FFT (X)


1D FFT (Z)

1D FFT (Y)


1D FFT (X)


1D FFT (Z)

1D FFT (Y)




Spectral Output

Physical Input Physical Input

Spectral Output

34

Not all subroutines were readily set up for a posterior communication effort

optimization step, such as the Truncation and Hermitian redundancy routine, which is

presented later in this Chapter, due to specifics on pencil orientation and minimal

communication requirements.

Following the correct functioning of the initial x-dimension to x-dimension pencil MPI

solver, there was an effort to profile the majority of the code using the VAMPIR Tracing Tool,

which is standard to MPI in order to find possible optimization regions in the code.

In doing such, there was the confirmation that the communication effort forced the

highest time consumption routines to be the Direct and Inverse FFT, prompting the first

optimization procedure to involve the reduction of communication while retaining code

stability

The last and most major step, which enabled faster processing times, was the

removal of almost all possible allocation procedures for regular calculation cycles and

communication buffers, with all variables of three-dimensional and two-dimension nature

being pre-allocated and its sizes remaining unchanged for the vast majority of the calculation

cycles.

While Buffered communications were not optimized, the current pre-allocation steps

taken should be sufficient to ensure proper message handling, although such a comment

may be addressed more in-depth in the Further work section.

35

3.1 Parallelization Scheme

In the course of the development of the MPI DNS solver, several different routines

required translation to local memory access and operation.

The following section introduces the schemes used ranging from the simplest to

those that demand extensive local information correlated with global localization.

These schemes affect mostly the statistical routines and are used extensively in

these cases. Due to the pencil-memory distribution, some routines require global variable

knowledge in regards to its array location when compared with the global memory.

Accessing these arrays is sometimes dependent on the local position and global

information, while in other cases, the previous two are minor details and not required at all.

Regardless of the memory disposition, the schemes used satisfy the requirements of

the original code in terms of address access when such is relevant.

The parallelization scheme chosen is similar to the default 2decomp Library, with the

Physical domain distributed along x-dimensional pencils and the Spectral space distributed

along z-dimensional pencils. The z-pencils hold the complex output of the FFT, but in a real

kind array of near identical size, resulting in the same memory usage for the Physical space

information and the Spectral space information, as depicted in Figure 11.

Figure 11 – Parallelization Scheme for Physical and Spectral Space

x

y

z

Physical Space (Global)

Dimensions: Nx Ny Nz

Spectral Space (Global)


KIND = REAL KIND = REAL

X-Pencil Orientation Z-Pencil Orientation

36

However, in the initial stages of development, and due to the functioning of the base

code, a further step was made in order to proceed towards a completely distributed parallel

MPI version, by making key one-dimensional arrays related in size and shape to the global

pencil arrangement. Figure 12 shows a visualization of what was intended.

The previous disposition enables the majority of routines to operate without global

information by ensuring that, at generation of arrays that affect specified entries in the field,

there have the same shapes as the corresponding z-dimensional pencils, allowing for the

developer to synchronize the local arrays at any point.

The prime example of usage of this logic is when initializing the wave numbers

arrays, which contains information regarding the local energy and wave number, at a global

level. Parallelizing these arrays allows for direct, index-independent correlation between the

local portion of the global memory and the wave numbers, with this capacity being exploited

at, for example, higher energy node truncation in the particular subroutine.

Figure 12 – Global Scheme versus Pencil Scheme

x

y

z

KIND = REAL

Z-Pencil Orientation

y

z

KIND = REAL

Serial Implementation



x

37

3.1.1 Global Summing

Global summing is dealt with efficiently in the MPI standard.

A local sum is performed, either manually by inspecting the entire local portion of the

array entry by entry, and summing it to an auxiliary local variable, after which the local

auxiliary variable is globally reduced using an MPI Reduce or All reduce operation.

Some routines have been further simplified by a local Reduce operation coupled with

a global MPI derived all-reducing operation with the MPI_SUM command.

Figure 13 – Global Summing Algorithm Flowchart

3.1.3 Maxima/Minima

To find a global minimum or maximum, the simplest route is to find the local

minimum or maximum and then have each processing core find its minimum or maximum,

using a global reducing operation.

Figure 14 – Minima/Maxima Algorithm Flowchart

MIN/MAX

PROCEDURE

MPI_ALLREDUCE

(Global MIN/MAX)

Output Local

MIN/MAX

Output Global MIN/MAX

Input

Local Field

SUMMATION

PROCEDURE

MPI_ALLREDUCE

(Global SUMMATION)

Output

Local SUM

Output Global SUM

Input

Local Field

38

3.1.2 Local Summing

In some routines, local information is relevant on a global level, but global

information may not be present in order to perform summation or comparison.

In these cases, information present on the local array under scrutiny must be

accessed individually, with an externally defined object used as input describing the global

details of the local information.

The prime examples are statistical routines operating in the Spectral space, where

information pertaining to a specific global slab must be summed differently than the

remainder of the domain.

Given that only a few cores will have the information of this slab, and that each core

divides the memory into pencils, the entire slab is not present anywhere in the memory at

that core. As such, the usage of the globally defined object is relevant to allow for particular

treatment of the cores containing the slab.

Figure 15 – Local Summing Algorithm Flowchart

Output Value

Local Field Global –> Local

Object

Cycle Field k-j-i

Compare k-i-j with Object info

SUM Procedure 1 SUM Procedure 2

IF

MPI ALL REDUCE (SUM)

39

3.1.4 Global Access at Local level

In specific cases only certain entries in the local arrays, defined on a global basis,

are accessed and the manner of access might be simple to parallelize or not. The prime

example of this is when the accessing index is outside the bounds of the local information

and, by extension, when the point of reference to this index is also outside of the bounds of

the local space. In this case, conditional accessing correlated with global information is

required. The conditional sections keeps each processor from performing unnecessary

checks in inner loops, thereby removing unnecessary operations from the processor.

Figure 16 – Global Access at Local level Algorithm Flowchart

Output Value

Local Field Global –> Local

Object

Cycle Field k-j-i

Compare k-i-j with Object info

IF

MPI ALL REDUCE (SUM)

Global Condition

Access Information

Next k-j-I Entry IF

SUM Procedure

40

3.2 Fast Fourier Transforms

For the MPI version of the PSM DNS code developed, there was a need to develop

and implement wrappers for the Direct and Inverse Fourier Transform routines present in the

2Decomp Library, in order to match the output variable Type, which would require a

minimum set of modifications to the parent code. The chosen FFT was the FFTW engine,

given that the Library is flexible in regards to FFT engine selection. The logic used in

designing the wrappers was to make use of pre-allocated auxiliary variables with type and

dimensions matching the original Spectral space output and dimensions, to obtain faster

computational time by removing the need for run-time allocation. Memory considerations

also played a role, allowing the pre-allocated memory array to be reused, at the cost of

coding complexity. Since these dimensions accommodate auxiliary variables, the wrappers

then convert the complex output to a real type output variable, and remove the auxiliary

variables space from the memory pattern, returning a global sized memory block of

dimensions identical to the real global dimension, a step taken in order to facilitate further

calculations, which will retain only the relevant data and nothing else, a departure from the

original code where, due to in-house requirements, additional space was present and

complicated code development.

Figure 17 – 2decomp Library FFT X to Z (Standard) Implementation

x

y

z

Physical Space (Global)



Dimensions: Nx/2+1 Ny Nz

KIND = REAL KIND = COMPLEX

X-Pencil Orientation Z-Pencil Orientation

Direct

Inverse

Aux. Variables

For FFT calculation

41

3.2.1 Direct FFT Wrapper

The Direct FFT wrapper was developed to take advantage of the wrappers already

in existence in the 2Decomp Library, although in order to provoke a minimal amount of

changes in the remainder of the code, a variable type change is required.

Figure 18 – Direct FFT Wrapper Algorithm Flowchart

1D FFT (X)


1D FFT (Z)

1D FFT (Y)


Spectral Output (COMPLEX)

COMPLEX to REAL (Z)

Normalization

Spectral Output (REAL) (Z)

Aux. Input (Nx, Ny, Nz)

Spectral Output

(REAL, Normalized) (Z)

Spectral Input (REAL) (Z) Aux. Input (COMPLEX) (Z)

Physical Input (REAL) (X)

42

Figure 18 describes the algorithm used, where an auxiliary variable is used to store

2Decomp’s FFT wrapper’s output, which is of complex kind with variable precision.

As the majority of the code operates in Spectral space, and uses the logic of using

the standard FFT output, which is of real kind, a further operation is required using the output

kind real variable, where the information in the complex structure is transcribed to a real kind

variable, used in the remainder of the code. This output variable has differing dimensions

from the complex output in order to simplify the remainder of the code’s structure. A

normalization procedure is then performed

43

3.2.2 Inverse FFT Wrapper

Due to the Direct Wrapper ‘s requirement of returning a real kind output, the Inverse

operation is required in order to make use of the potential in 2Decomp’s calculation

procedures. The flowchart presented in Figure 19 introduces the algorithm used.

Figure 19 – Inverse FFT Wrapper Algorithm Flowchart

The algorithm consists on using the already previously allocated memory block of

complex kind to store the input for the operation, with the real input being transcribed into its

correct locations. The inverse FFT operation is then performed using this auxiliary complex

kind memory block, and the result is returned to the user.

REAL to COMPLEX (Z)

Spectral Output (COMPLEX) (Z)

Aux. Input (Nx, Ny, Nz)

Spectral Input (REAL) (Z)

1D FFT (Z)


1D FFT (Y)


1D FFT (X)

Physical Output (Real) (X)

Physical Input (REAL) (X)

Aux. Input (COMPLEX) (Z)

44

3.2.3 Real to Complex and Complex to Real

As mentioned in the previous sections regarding the Wrappers, there are two

routines with perform the conversion of the 2decomp FFT output, which is a complex type

array, into a similarly dimensioned real type array, storing the same information. However, as

depicted in the following image, 2decomp’s FFT output, in order to match existing FFT

outputs in serial programs, has dimensions of ( )121

+�2

�3

� , as previously stated, with

the N1 dimension having complex kind, consisting of two numbers in a structured pair.

In terms of memory usage, this implies that 2decomp’s FFT output has an effective

( )21+�

2�

3� shape, which is the size of the Spectral Output/Input Auxiliary Memory

variable. In order to be able to use the 2decomp Library capabilities, during development of

the optimized version, these extra variables, not being necessary past FFT calculation, are

not translated into the final Spectral space variable, of real kind, in addiction to being

rendered into a 1

�2

�3

� global shape, with the amplitude and angle structure separated in

two sequential memory blocks.

The FFT global shape in 2decomp, if enforcing identical sized memory pencils, is

used during compilation ensures that the last complex z-pencil containing these auxiliary

variables is the single pencil with a different size (of +1 complex pair per line). Outside of

this, the necessary even number of nodes in the x-direction ensures that departing and

arrival memory blocks have matching dimensions. These routines are direction independent.

.

Figure 20 – Complex/Real Subroutine(s) Scheme


Dimensions: Nx/2+1 Ny Nz

KIND = REAL

C to R

R to C

KIND = COMPLEX



45

3.3 Truncation and Hermitian Redundancy

As mentioned in section 2.2, there exists a routine which enforces the Hermitian

Redundancy in the Spectral space. This routine’s implementation is by no means trivial, with

several steps being taken prior to the Hermitian Redundancy in itself, such as the truncation

of higher energy nodes in global locations, using the 2/3 rule.

The Truncation set of instructions is required in order to remove aliasing errors from

the calculation procedures due to the presence of higher energy (wave numbers) which may

produce a cumulative numerical error which, given time, may eventually cause a divergent

solution. The Truncation instructions are largely parallelizable with simple instructions, given

that the mesh’s dependence on whether a specific set should or not be truncated is related

mainly related to the wave number’s own numerical value, and since these were previously

panellized, a set of IF instructions ensures proper mesh treatment at a local level, while

correctly correlating with global locations.

Figure 21 – Truncation Global Scheme versus Pencil Scheme

The Hermitian redundancy on the other hand, depends strongly on the starting

pencil orientation and while its operation set is conceptually simple, the implementation of it,

coupled with the need for pre-allocation and communication makes it a non-trivial task at

parallelization. Two versions of this routine were created and tested, one which relied on

Spectral space being orientated along the x-dimension, and one which was orientated along

x

y

z

KIND = REAL

Z-Pencil Orientation

y

z

KIND = REAL

Serial Implementation



x



46

the z-dimension. In either case a complete mapping and inter-core querying and

broadcasting effort is required for the localized information to be shared so that

communication partners are correctly set-up.

This communication and coordination step is simply the production and sharing of a

communications table at the master core, coupled with a global broadcasting of this table to

all participating cores. Two versions were required for either x-dimensional pencil memory

orientation or z-dimensional pencil orientation.

Going back to the Hermitian redundancy routine, the algorithm used relies on the

copying of a local part of an otherwise global plate, offsetting some regions of it in different

manners and styles, mirroring each offset region, again differently depending on the region,

and relaying the mirrored and offset region to the destination core where a conditional

replacement is performed. This is visualized in Figure 22, and explained in Figure 23.

To do so, the Halo-Cell set of instructions present in 2Decomp are used to allow an

inter-core memory sharing operation where each core expands its memory using adjacent

node information. The degree of memory expansion can be controlled by the programmer,

and in this case it is the minimum required to perform the offset operation. As before this

requires the memory to be pre-allocated at a given dimension, and at the first time this

operation is performed, it is reshaped into a shape that remains until its eventual de-

allocation, corresponding to the necessary indexation for proper functional operation. The

memory block to be expanded, given the pressing need to optimize memory, has an auxiliary

nature and is reused repeatedly, as well as any communication memory pattern. This was

done by design and again prevents data corruption from the functional memory fields

corresponding to the velocity fields. Only once the final plate is correctly done, is it replaced

on the original data fields, in order to safeguard calculation procedures, while at the expense

of a higher memory usage, implying that the initial information is at all times present and can

be accessed should the need arise at any point during posterior code development.

The algorithm is better explained using the following visual depictions, with figure 22

representing the domain which is to be operated and the participating pencils. Figure 23,

represents the operations done on that memory plate with the grey spaces being set at zero:

47

Figure 22 – Hermitian Redundancy Algorithm Depiction

y

z



KIND = REAL

y

z

KIND = REAL

Auxiliary Base Plate (Global)

Dimensions: 2 Ny Nz

Global Operations on Auxiliary Plate:

1. Break up Plate Sections 2. Perform Sectional Mirroring 3. Perform Sectional Offset 4. Rebuild Plate from Section

5. Enforce Hermitian on Plate

48

Figure 23 – Global Plate Operation Visualization

This routine is critically important to the correct functioning of the code, as it ensures,

if working properly that the Hermitian redundancy is truly present and enforced. As such it

was extensively tested in both the x-pencil variant and the z-pencil variant, with only the

latter being used in the final version, and the one here depicted.

It is done in a different fashion to the rest of the code, with an emphasis more on

ensuring that any and all participating variables, which are mostly 3-dimensional in nature,

are treated using FORTRAN’s own internal array-access procedures rather than externally

defined conditional access which is the case with the remainder of the code. This routine

requires that all participating pencils have the same size and shape in every processor

holding the memory plate where these operations are performed.

Global Operations on Plate:

1. Section Breakup:

2. Section Mirroring:

3. Section Offset:

4. Section Rebuild:

49

3.4 I/O

Given the code complexity and the simulation requirements for a large number of

calculations, it was required that during runtime, at specific points the velocity fields are

saved for study using smaller capacity workstations and for post-processing. The specific

details about opening, reading and writing to file using the MPI standard are available in a

wide range of specific literature, with further operations being devised by the Library. For the

purposes of this code, to ensure less time is spent during saving, the files are opened

outside the main cycle, and are kept open and written to when needed, using the faster

version of data writing present in the 2decomp library. This functionality has a couple of

situations that require addressing. Firstly, a displacement variable is required, updated on a

local basis, implying that it is already allocated and given the correct type. Initially, this

variable simple measures the byte-size of a given number. Due to the presence of parts of

the global variable field in different processors, the displacement variable is then updated,

and accrued to the displacement corresponding to the start location of the localized field in

regards to the global field. The displacement is measured length-wise, regardless of the

nature of the field in question, and given the parallelization scheme, it may be forced to skip

values periodically. The details of this operation are inserted into the 2decomp Library, with

the user only being required to create and size the displacement variable correctly. At the

time of code development, a starting value of zero is used, with follow-up update. This set of

instructions insures that when the writing operation is executed, the data is saved with the

proper size, for its individual memory block, in what pertains to bit-length. The same variable

is used for reading from the file, having the same values and nature.

The second point to be addressed is that the saving and reading operations use

MPI-native type numbers, which correspond to transcription a processor’s memory directly

unto disk. Evidently, this information cannot be read easily with non MPI-native data. The

advantage of this kind of operation is simply reading and writing speed, as the information

present on file is directly transcribed into and onto RAM addresses, making for less time

expenditure during the operation. The aforementioned details about saving and writing files

are applicable to the majority of the operations used in the code, where calculation and

computational operation speed is more relevant than the actual manner in which data is left

on the disk, although this requires that post processing tools have the ability to read the MPI

file and convert it into a FORTRAN90 native standard or standardize it prior to post-

processing with a converter program. Due to the inability of a workstation in loading the

entire global variables data into its RAM, both a Spectral space and a Physical space version

50

of the velocity field is stored, requiring a FFTI operation shortly before file writing. This file is

not supposed to be read during this code’s execution, since it functions mainly in the

Spectral space, where calculations are performed, but is seen as useful for post-processing

tool development. There was also the need to save planes from the working fields during

cluster run-time, to which the 2Decomp Library has a functionality included to perform such

operations. These fields are not meant to be read outside visualization efforts, and their

storage is done in MPI-Native format much like the majority of available file-saving

procedures in 2decomp. The reading of these files is also left to a post-processing converter

tool which was developed shortly after the successful trial runs of the MPI solver in

Marenostrum.

3.5 Randomization

Due to the Forcing method selected, a randomization method was required that took

advantage of the parallel nature of the solver insofar developed. Computational logic uses a

pseudo-random formulation in order to generate a string of numbers which statistically,

correspond to a random sequence of numbers as otherwise found in nature. Pseudo-random

algorithms use a set of techniques to attempt to achieve a random number distribution, but

the vast majority relies on a seed, which effectively forms the basis for the formulation used

in the random function present in FORTRAN standard. In order to match results to serial

versions of the code, there was a need to develop a number of possible random number

generators that could be adjusted with more models and algorithms at a later stage. The

solution was to develop, using a pre-allocation logic, arrays containing the random numbers,

generated before they’d be necessary in calculation. This was done in order to provide for

separate optimization possibility. As mentioned before, and especially applicable to a

completely distributed parallel program, should all processors initialize with the same seed,

there will not be a true random number dispersion, with each processor having the same

numerical sequence. Coupled with the fact that whenever possible blocks are of identical or

very similar size, this would eventually imply that all processors were generating the same or

very similar, random sequence, thereby betraying the purpose of using randomization. The

solution to this problem relies on the creation of a parallel random number generator. The

generator in itself may then be altered to obtain different number distributions depending on

the intended results. For this task, the first goal was to create a generator which in effective

terms, had the same random number generation as the original serial version, for the same

seed.

51

To do this, whilst keeping in mind that access to the RANDOM function must be

kept, two routines, one for a large 3-dimensional array, and another for two large 3-

dimensional arrays were created, ensuring that in both cases the generation was done in the

same manner in which the serial version performed calls to the RANDOM routine, for the

same seed. Secondly, in order to pave the way for future developments, the choice of this

random number generator was tied to a run-time option variable, which allows any user to

select between ranges of developed randomized algorithms at a later stage. A number of

different algorithms were implemented and constructed, based on different possibilities to

access the RANDOM routine. At some point in the future, further work may be done in order

to more closely approximate other distributions for randomised entries.

3.6 Statistics

During the development and execution of the code, several routines to present

varied statistical information are presented to the user.

These routines, in previous versions of the code, are reliant on the entirety of the

memory being available for calculation in either a processor (for the Serial version) or in a

Node (for the OpenMP version). For the MPI version, as mentioned before, this is not the

case. At this stage, the memory is physically separate from node to node, although shared

by processors at a given node. This implies that for simpler routines, using local checksum

operations coupled to a global checksum is the most efficient way to proceed.

The routines that calculate statistics do so, for the most part, in the Spectral space,

with only two to three cases involving a inverse Fourier transform to calculate the same

statistics in the Physical space. For more complex entities, such as vorticity, more

complicated procedures do exist in the Serial and OpenMP code version, and the

transcription into MPI, with its specific behaviours, is not immediate. These procedures were

grouped together in an extensive routine responsible by calculating and delivering to the

user the relevant information.

52

3.7 Converter

Due to MPI file reading/saving procedures, which have an inherent focus in

calculation speed, data is stored in MPI-native format, which corresponds to a near memory

dump of RAM information unto the Physical hard drive, from each node, with corresponding

memory displacements corresponding to the global information.

This method of storing information is similar to the native means of FORTRAN

saving its calculation data unto a File at the hard drive, but the two standards are neither

identical, nor directly compatible.

Due to the need of running the code in large clusters and the great difference

between available RAM at a cluster, even if distributed, and RAM at a workstation, a

converter is required to analyse any data saved during runtime.

Figure 24 – Converter MPI NATIVE to FORTRAN90 NATIVE Algorithm

Flowchart

This data access is user defined at run time and reads only the necessary fields

saved during runtime at the cluster. Further, any other details which are stored using MPI

Output Array

(FORTRAN90 NATIVE)

Global MPI FILE

(MPI NATIVE)

Cycle Field k-j-i

Compare k-i-j with Start/End Info

IF

Next k-j-i Entry Store in Intermediate

Desired Field Section

Start/End Information

Store in Output Array

53

derived operations can be accessed by the converted during code execution at a cluster

following a successful save, without affecting the data saved from the cluster.

The converter was thusly created to read the entire field line by line, and if the

position/displacement of the current read entry corresponds to a globally defined coordinate

of interest, it is stored into the workstation ram until the workstation’s ram is filled with the

relevant data.

Storage of this information is then done in Fortran90’s standard for file saving

enabling the users to use tools such as TECPLOT or other codes to investigate or

manipulate the converted data for posterior investigation.

To complement this step, the order of reading and storing may be different between

themselves, enabling the converter to bridge this fact by delivering a saving order compatible

with the eventual reading by other relevant codes already available for post-processing.

The converter may read planes or smaller fields from the main memory block as

dependent on the start/end info provided by the user.

55

Chapter 4

Chapter outline:

Chapter 4 delivers the final results obtained following the successful implementation of

all algorithms presents in Chapter 3, for the subjects addressed by Chapter 1 and

Chapter 2. It also addressed previous code versions performed during the coding of the

solver and presents temporal results for the most important calculation routines and the

finalized solver.

56

Results

As mentioned in the document, at least two main versions of the MPI DNS solver

were executed, with fundamental differences in how the memory pattern was handled in the

Spectral space. The first version, as a requirement for minimal error at the code verification

stage, was a Physical x-pencil to Spectral x-pencil solver.

The majority of the Serial code, and by extension the OpenMP solver, operates in

the x-wise direction in the Spectral code.

In order to prevent code malfunction, the logic was to use the original solution

parameters and processes so that any further modification to the Spectral space memory

arrangement could be verified at each step.

However, the fact that a pencil parallelization scheme is always fundamentally

different to the Serial version implied that regardless of the pencil orientation for memory

handling procedures, all participating routines needed to be re-created from scratch.

Further to the above, there was also the fact that the entirety of the code demanded

all subroutines to be Interfaced, greatly changing the programming structure of the entire

code, due to the possibility of altering the code structure to run from single to double

precision using only a 2decomp library related compilation flag.

However, one fact that could not be removed at this stage was that the dimensions

of the field, of ( )121

+�2

�3

� (of complex-type). Even though as previously mentioned

this is turned into a ( )21+�

2�

3� (of real-type) in order for the remainder of the code to

operate, the ( )21+� x-direction in global memory terms was unchanged at this time.

Similarly to what happened in the Serial solver and in the OpenMP solver versions,

the 2+ memory plate was, following an Inverse FFT operation, numerically forced to be 0 .

While at this stage the necessity of avoiding the extra global memory transpositions

could be guessed at, numerical testing was performed with the aforementioned logic, using

VAMPIR to extract information from the local IST clusters and from the local workstations.

Once the program was verified, the highest temporal expenditures were verified to

be the communication-related MPI operations, as well as code operations.

In order to remove the first time-sink, the entirety of the code was re-structured so

that a minimal communication model was achieved, and this promptly removed the

communication times as the largest temporal bottleneck.

57

Following that, steps were taken to reduce the localized code operations, starting

with fixed-time, hardware related allocation times. At this stage, the Spectral space

dimensions of ( )21+�

2�

3� were adjusted so that past the FFT calculation procedures, a

Spectral space of dimensions1

�2

�3

� was retained, which allowed for all routines that

cycled the field to stop being dependent on external information and voided the need for the

wrappers to transform the 2+ to zero.

At this stage the majority of the code was seen as being ready to advance into using

more expensive resources made available at the Marenostrum cluster.

Subsequently, a minor glitch was detected, due to the impossibility of testing higher

sized meshes locally, relating to FORTRAN90’s maximum integer value, and this was

corrected using the –backtrace flag in-situ at Marenostrum, as this error had been

unexpected. Once corrected, the MPI DNS solver code was extensively tested at Mare

Nostrum. At this stage the code was functionally tested at Marenostrum on a cubic mesh of

4096 points using the x-pencil to z-pencil version, but the test was an isolated attempt

without further progress, as the goal of this thesis was to enable at least a 2048 cubic mesh.

Further statistical routines were then adapted as temporal testing was performed in

order to bring the project into fruition, and once these were completed, the same logic of first

checking code behaviour in local workstation, then moving up to the internal IST cluster, and

lastly, into Marenostrum, was performed. At this stage, random number generation was

added, and tested at Mare Nostrum for the final temporal results here presented.

Past these initial steps, the MPI DNS solver code presents a near-ideal scalability

curve, has several random number generator options, is verified against the Serial and

OpenMP versions and is completely developed to run using a double-precision calculation

procedure.

58

4.1 Tests and Speed Up

4.1.1 Development of X to X Version

The initial development of the MPI version of the Spectral code consisted on the

creation of the wrappers and subsequent application to the initial version of the code, which

uses x-dimension as the foundation for all present DO cycles, given that FORTRAN is

reportedly faster for k- j - i memory access order, therefore implying that the i-th index must

be the innermost at any DO cycle.

However, with a closer inspection of all participating routines outside the wrappers,

there was the conclusion that such an approach, while desirable would still require a major

overhaul and re-writing of all participating routines.

Still, at this stage, the desired goal was the development of a MPI code with minimal

induced functional modifications to the parent code, aimed at continuing to take advantage of

the k - j - i order of access to major arrays.

Regardless of this particular sub-set of instruction, there was a decision by the

implementer to verify and prepare all basic participating subroutines for posterior steps

pertaining to an eventual optimization with the goal of reducing communications.

The initial steps taken relate directly to the usage of the 2decomp library in itself, as

there was a need to alter all participating real variables to the standard used by the library.

Given the heavy communication patterns and subsequent buffers as well as internal

processes, there was a necessity of altering the current code’s programming structure into

one guided primarily by modules and interfaces.

Such an approach demands that all variables be of identical nature, with the nature

defined by a 2decomp Library variable which is defined during compilation time.

This variable, as known in the program done, is dubbed mytype, and is responsible

for ensuring that any variable, when declaration is performed, is or proper nature/length in

terms of memory. Further versions of this variable exist in the library to assist a user, when

dealing with MPI communication routines, in ensuring proper variable identification such as

real_type and complex_type, which should be used in preference to MPI_DOUBLE or

MPI_COMPLEX instructions.

The utilization of this variable, coupled with the modular nature of 2decomp, forces

the user to then construct the program with recourse of the INTERFACE, which ensures that

59

any given variable, when affected or created within a subroutine, has the same nature as the

parent code. The INTERFACE ensures that the nature of the variable is correctly translated

from the parent code to functional subroutines.

This logic demanded that all re-created subroutines required an explicit external

interface and were of a procedure nature, with all of the larger variables retaining the

allocatable nature despite the level in which they were summoned.

The benefit of this is that all variables may be allocated, de-allocated and reshaped

at any point in the program, which while dubiously useful, allows separating the declaration

and allocation of the variables from the parent code to a different routine. The usefulness of

this approach was used in one particular subroutine which demanded the in-situ reshaping of

a particular variable into a proper size during the first temporal cycle execution, which then

retains this shape and all correct indexing.

The heavy use of Interfaces ensures the scoping of a given variable, allowing its

treatment to be more object-like. In fact this was the intended goal of redoing all of the

participating subroutines at this stage, in order to ensure that the memory blocks, when

accessed had the same nature and were de-facto global entities, and as such, given Global

Scope. At this stage, the most development effort consuming of all routines was the

Hermitian redundancy routine, which demanded heavy parallelization but minimal

communications. This subroutine, as explained in its relevant section, when primarily done in

an x-pencil to x-pencil logic, demands a communication table set up, and would not be

immediately transferrable into an x-pencil to z-pencil logic.

With the usage of objects containing all global information, all other subroutines were

made to be dependant of the object and on the pencil orientation, so that the various access

types and operation types were correctly connected to globalized operations.

Once the x-pencil to x-pencil version was completed it was subjected to testing at

three different machines before implementation of the code in Instituto Superior Técnico’s

local cluster.

During these tests, some difficulties and assumptions were corrected, since that

when testing upon a node; some variables were not broadcasted correctly. Once these steps

were corrected the program was tested using one node up to its maximum capacity, and

then expanded into two-nodes up the cluster’s capacity at these two nodes, to exemplify the

proper implementation of the MPI version.

Numerous tests were done at this stage to compare the output of the MPI version to

the Serial and OpenMP versions in order to ensure proper calculation of all intermediate

steps. At this time, VAMPIR was required for further testing and numerous trials were done

60

in order to ascertain the greatest temporal sinks and decisions were taken to advance into

the optimization stage.

A representative temporal results table (Table 2) of the tests done using the Galego

Cluster for the x-to-x MPI DNS solver is presented next, for a cubic 128 points mesh.

nprocs T (s)

4 3,73

8 3,07

16 1,61

32 1,51

Table 2 – Temporal Results (mesh 3

128 ) in Galego

These values, when plotted and compared against an ideal behaviour for scaling,

provide the following visualisation (Figure 25), indicating a need for optimization, as well as

the possibility of such due to the intensive communication pattern.

0,1

1

10

1 10 100

128 Ideal Scaling

Figure 25 – Temporal Results (mesh 3

128 ) in Galego (X to X)

Further testing, for the same mesh of 3

128 points, using 4 processes, focused on the

primary cause of the temporal results deviation as reported by VAMPIR, the FFT operations.

61

The average values after a sample of 100 cycles of FFT (Direct followed by Inverse)

operations, taking ten representative values wields the following results (Table 3), with the

average being calculated for the listed iterations on a given field.

TEST CR→ yx → zy → FFTI Units

1 1,2 3,8 2,8 7,3 210

−

×s

2 1,4 3,6 3,0 7,2 210

−

×s

3 1,4 3,7 3,0 7,2 210

−

×s

4 1,4 3,6 3,3 7,0 210

−

×s

5 1,3 3,6 3,1 7,2 210

−

×s

6 1,4 3,7 3,1 7,2 210

−

×s

7 1,4 3,7 2,6 7,2 210

−

×s

8 1,4 3,9 2,8 7,2 210

−

×s

9 1,4 3,5 3,1 7,3 210

−

×s

10 1,3 3,6 3,0 7,2 210

−

×s

AVG 1,36 3,58 2,68 7,2 210

−

×s

Table 3 – Temporal results for FFTI (3

128 )

If we take the average values, and sum the entire time, we may estimate the

average time consumption of the FFTI operation by its participating components, and take

the percentages of the time expenditures, and we obtain:

( )FFTIwrapper

CR→ 9,2 %

yx → 24,2 %

zy → 18,1 %

FFTI 48,6 %

Table 4 – FFTI Temporal usage breakdown

Table 4 indicates that for the x-to-x version, a 42,3% of the time is spent not

performing FFT operations, simply transposing the memory and performing the Real to

Complex kind conversion.

62

For the same sample of 100 iterations, but focusing on the FFT Direct component,

the following times were obtained, and listed in Table 5:

TEST FFTD yz → xy → RC → Units

1 7,7 2,6 3,8 1,8 210

−

×s

2 7,6 2,6 3,8 1,7 210

−

×s

3 7,7 2,6 3,8 1,8 210

−

×s

4 7,7 2,6 3,8 1,7 210

−

×s

5 7,7 2,6 3,7 1,7 210

−

×s

6 7,7 2,6 3,8 1,7 210

−

×s

7 7,7 2,6 3,8 1,7 210

−

×s

8 7,7 2,6 3,7 1,8 210

−

×s

9 7,7 2,6 3,8 1,7 210

−

×s

10 7,7 2,7 3,8 1,7 210

−

×s

AVG 7,69 2,61 3,78 1,73 210

−

×s

Table 5 – Temporal results for FFTD (3

128 )

Taking the average values, and performing the same breakdown as before, we get:

( )FFTDwrapper

FFTD 48,6 %

yz → 16,5 %

xy → 23,9 %

RC → 10,9 %

Table 6 – FFTD Temporal usage breakdown

Table 6 indicates that on average, 40.4% of the time is spent outside the FFT

operation.

Further, by using VAMPIR Trace, available in the MPI standard, the results

suggested that about 30% of the time was spent in FFT calculations (FFT plus Kind

conversion and Allocation), with 70% being MPI derived temporal usage, thusly,

communication time for global communications and transpositions involved both externally

and internally to numerical calculation procedures.

63

The results, when taken in conjunction, lean heavily towards the reduction of

communications, even prior to an allocation removal optimization step, which at this stage,

was not yet an obvious time sink.

As such, the primary step was the development of a new wrapper, which, using the

same logic of allocation and de-allocation of intermediate buffers, which would allow for a

direct comparison of the temporal expenditures.

Such a wrapper, when developed, compared to the previous wrapper, produced the

following results, for 1000 iterations, an arbitrarily large number of iterations taken, using a

64 cubic mesh, on a workstation:

IFFTD + 1000 iterations

XTOX −− 12,203 s

ZTOX −− 6,327 s

Table 7 – Total Temporal expenditure

The results as listed in Table 7 point towards a 48.15% speed up in simply reducing

the communications involved in the wrapper-related operations.

Another set of tests, now for the entire program, provided similar results, for 100

runs with a time per cycle as follows:

SOLVER 100 iterations

XTOX −− 0,4 s/cycle

ZTOX −− 0,278 s/cycle

Table 8 – Time per Cycle in workstation

The results at Table 8 point towards a 29.5% expected speed up if using the

workstation, and while far from the 48.15% speed up presented earlier, both of these taken

in conjunction pointed towards the need of developing an alternate communication and

calculation pattern.

64

4.1.2 Development of X-to-Z Version

Initially, the program was designed to behave, when in Spectral space, with x-pencil

logic, with the x-dimension being halved*. However, the library creator opted to, when

delivering an x-pencil Spectral space, to reduce its dimension and therefore greatly alter the

data distribution, with the z-dimension being halved*.

Such a fact was not acceptable as the code relies on the data distribution along x-

dimension being halved, as already mentioned, so the possibilities were discussed and the

option to progress to Spectral z-pencil was taken.

Thanks to the work mentioned earlier, the translation of the vast majority of code

was made simpler by all the variable definitions and procedural nature, with the modifications

consisting of re-doing all cycle index information report to z-pencil information rather than x-

pencil information.

At this stage, some instructions were opportunistically removed and the objects

pertaining to this information were made a subroutine input variable rather than a local

version further cleaning up the code.

At this level of programming, the most time consuming operation were the FFT

operations, which indicated a heavy communication effort and immediate gain for further

work due the initial implementation of the x-pencil to x-pencil logic devised in the initial

version.

Due to the way 2Decomp delivers the x-pencil FFT output, it was unacceptable, for

the reasons mentioned earlier, to use that version; therefore, in order to reduce the

communication effort, a real space x-pencil to Spectral space z-pencil logic had to be

pursued.

The vast majority of the participating subroutines had already been extensively

reworked due to the pencil-memory logic, so the change of logic demanded only minor

changes.

The greatest complication was the truncation and Hermitian redundancy routine,

which had to be extensively re-worked due to its own internal logic in order to enable the a

proper global reflection of a single plate, which had to be adjusted from a x-pencil to z-pencil

logic, demanding a different communication pattern, logic and implementation.

65

Given the earlier results, posted in the previous section, as well as the comparison

tables, the program was completely altered to obey the new Spectral space distribution. As

mentioned, this required the re-construction of the Hermitian redundancy routine, but,

additionally, all cycles required an update to the new distribution, since the localised

information was now organised differently.

For the completed program, after verification and validation of the results, the timing

tests could begin. Again a new arbitrary vale of iterations was chosen, and to avoid cluttering

up computational resources, the mesh was downsized to 3

64 to allow workstation usage. In

this case 1500 iterations were performed, and the timed results, are listed in Table 9:

Version 1500 iterations

xx → 0,400 s/cycle

zx → 0,2953 s/cycle

Table 9 – Time per Cycle in workstation

While these temporal results fall in line with the FFT temporal results for a same-

sized mesh, the speed ups obtained were calculated taking the new temporal expenditure

value with the previous value, obtaining a 26,2% speedup.

For the FFT operations (D+I), VAMPIR reported the time expenditure of calculation

versus communication times as being, for 1500 iterations:

Total 162,124 s (1500 iterations)

nTimeCalculatio 57,991 s

MPI 104,438 s

Table 10 – Total Time Expenditure

Table 10 yields a 35.6% time spent in communication outside numerical calculation

procedures, which is a reduction from the previously obtained 70% time expenditure in

global communications.

66

Turning again to VAMPIR, the results obtained matched up with the previously

obtained, with slight disparity due to the added instrumentation effort as listed in Table 11:

( )zx�ew → 0,3 s/cycle

( )xxOld → 0,4 s/cycle

Table 11 – Time per Cycle, 3

64 , 4 cores

With VAMPIR automated instrumentation, the reported speed up is 25%, for the two

competing versions. The results are tabled next, in Table 12:

RungeKutta 33,8 %

AllToAll 19,3 %

FFTI 14,0 %

FFTD 6,2 %

CR→ 2,7 %

RC → 1,3 %

ableParalelizeon − 22,7 %

Table 12 – Relative Time Expenditure

Table 12 lists the obtained results, although, at this stage, the only result which

needs explained is the one related to the Runge Kutta calculation scheme. However, the

FFT operations stopped being the highest temporal expenditure operations, with a particular

note pertaining the highest FFTI temporal usage. In the code, two FFTI operations are called

per FFTD operation, which explained the temporal discrepancy.

The Runge Kutta, now the highest temporal expenditure routine, is a complicated

routine that accommodates several other routines. By analysing its constituent cycles, and

its processes, which had been remade during parallelization, these were found not to be at

fault for the highest temporal usage. At this stage, the only logical step was to assume that

the fixed hardware related times were the responsible portion of this temporal sink.

Given that all DO cycles had been optimized in the x-to-x to x-to-z transition, and

even the calculation space had been optimized by adapting the complex-to-real and real-to-

complex routines, only the several allocation and de-allocation operations, of calculation

spaces, message buffers, and intermediate variables of large size were still in need of

optimization.

67

4.1.3 Pre Allocation Optimization

0 5 10 15 20 25 30 35 40

% of Time Expenditure

Runge Kutta Routine MPI_AlltoAll FFTI

FFTD Real to Complex Routine Complex to Real Routine

Figure 26 – VAMPIR Results Visualization, excluding non-parallelizable time expenditure

During the development of the x-pencil to z-pencil version of the code, there was the

opportunity to remove some key objects that defined global structures.

At this stage, and following the completion of the x-pencil to z-pencil code

conversion, it was verified that the program became around 25% faster, but that value was

not what expected as VAMPIR data profiling indicated that the communication time had

ceased to be the major time-consuming operation. Instead, the largest time consumption

was now non MPI-related routines, which had the highest values, as it might be seen in the

above image, where the Runge Kutta routine presents the highest time sink.

Since the basic DO loops order was left unchanged and there were no barriers to

explain the excess time usage, the fault was traced to hardware related mechanics, such as

file opening and closing and memory allocation, of which an extensive amount was done

during code development. In order to verify the possibility of further time gains, all

participating memory blocks were allocated externally at program start and de-allocated only

at program conclusion, coupled with reducing the amount of memory usage by allowing

temporary and auxiliary memory arrays to be reused given the identical sizes and pencil

orientations present in the sub steps of the code functional routines.

This step vastly improved the program runtime execution, and at its conclusion the

communication effort of the library occupied the majority of time during execution, and while

further optimization may be possible, the major limitation is currently external to the code

here developed.

The various stages are presented next, using a format which includes a small

explanation of the modification and presents the temporal results, both in tabular form and in

graphical form, with Adimensional results presented for an easier visualization.

68

The initial stage prior to de-allocation of the large three-dimensional variables,

presented the following temporal results, as seen in Tables 13 and 14:

meshcores 128 256 512 1024

4 1,9 18,1 155,3 1242,4

16 1,2 9,1 77,8 622,4

32 2,0 6,9 48,0 229,7

64* 0,5 4,1 48,9 152,1 Table 13 – Time (seconds per Cycle) (X to Z) (Galego; with Allocation)

meshcores 128 256 512 1024

4 1,86 2,26 2,43 2,43

16 1,17 1,13 1,22 1,22

32 1,98 0,86 0,75 0,45

64* 0,50 0,51 0,76 0,30 Table 14 – Adimensional Time per Cycle

(X to Z) (Galego; with Allocation)

These results, as visible on the charts (Figure 27 and 28) presented on the next

page of this document, show a less than ideal behaviour in term of scalability.

Given that by this stage all cycles had been redone, and the vast majority of

unnecessary operations had been removed from the code, only repetitive allocation and de-

allocation of intermediate buffers and communication sending/receiving buffers was present,

as well as some global instructions pertaining to global information at a local level, necessary

for specific routines.

As such, the guidelines to follow were established as:

1. Communication Buffers for MPI messages should be allocated at program start.

2. Intermediate buffers, if required, should be allocated at program start.

3. Buffers may be re-used if they are available, to reduce physical memory usage.

4. All global objects should be created at program start and referred to.

5. If a buffer requires re-shaping, its dimensions should be compatible from cycle to

cycle.

6. Minimize memory expenditure if possible, but without sacrificing calculation

speed.

A note on all tables in this section pertaining to Galego is now presented. Galego’s node

structure has a maximum of 32 cores per node, so results of 64 cores include inter-node

communication.

69

0,0

0,1

1,0

10,0

100,0

1000,0

10000,0

1 10 100

128 256 512 1024 Ideal Scaling

Figure 27 – Temporal Results (X to Z) (Galego; with Allocation)

0,10

1,00

10,00

1 10 100


Figure 28 – Adimensional Temporal Results

(X to Z) (Galego; with Allocation)

70

After following all the previously established guidelines, the results obtained are

listed in Tables 15 and 16, and visualized in Figure 29 and 30:

meshcores 128 256 512 1024

4 1,9 15,7 137,4 1099,2

16 0,6 4,8 46,0 324,4

32 0,3 3,3 27,6 258,0

64* 0,2 2,3 18,9 190,7 Table 15 – Time (seconds per Cycle) (X to Z) (Galego; without Allocation)

meshcores 128 256 512 1024

4 1,91 1,96 1,96 1,96

16 0,56 0,59 0,66 0,58

32 0,34 0,42 0,39 0,46

64* 0,24 0,29 0,27 0,34 Table 16 – Adimensional Time per Cycle (X to Z) (Galego; without Allocation)

While the magnitude of the calculation time is initially the same, the new results point

towards a better scalability, and no further optimization was possible on the code without a

much larger overhaul to the entire initial algorithm.

Given that no further locations of the code were seen as being a time sink, the

decision to progress towards MareNostrum III was taken, since by this stage, any further

issues, if present, would manifest themselves in a large cluster. There was also the need to

test the MPI DNS solver with recourse to a large number of processors and explore its

viability in a production environment. Few problems were encountered, except when testing

a mesh with a grid size larger than 3

1024 . The reasons for this were trivial, but had not been

accounted for, with the value 3

1024 being close to the max integer value permitted by the

FORTRAN90 standard. As such, at this stage, the solver required slight modifications to

come of the normalization procedures to side-step this limitation and once all results were

numerically validated and confirmed, for the mesh sizes possible at Galego and at a

workstation, further testing was done up to a grid size of 2048 as standard for result

gathering and presentation.

71

0,0

0,1

1,0

10,0

100,0

1000,0

10000,0

1 10 100


Figure 29 – Temporal Results (X to Z) (Galego; without Allocation)

0,10

1,00

10,00

1 10 100


Figure 30 – Adimensional Temporal Results (X to Z) (Galego; without Allocation)

72

For Marenostrum, the results obtained were as follows:

meshcores 128 256 512 1024 2048

4 1,60

16 0,25 2,59

32 0,15 1,55 10,80

64 0,09 0,80 5,90

128 0,06 0,43 3,12 14,00

256 0,05 0,23 1,80 6,86

512 0,06 0,18 1,09 3,91 29,90

1024 0,14 0,21 0,65 1,83 13,99

2048 0,52 0,52 0,80 1,46 20,54

4096 1,90 1,93 2,03 2,66 12,50 Table 17 – Time (seconds per Cycle)

(X to Z) (MareNostrum; without Allocation)

meshcores 128 256 512 1024 2048

4 1,600

16 0,250 0,250

32 0,145 0,150 0,150

64 0,090 0,077 0,082

128 0,060 0,042 0,043 0,043

256 0,050 0,022 0,025 0,021

512 0,060 0,017 0,015 0,012 0,008

1024 0,140 0,020 0,009 0,006 0,004

2048 0,520 0,050 0,011 0,005 0,006

4096 1,900 0,186 0,028 0,008 0,003 Table 18 – Adimensional Time per Cycle


These results, when plotted as in the next page (Figures 31 and 32), point towards a

stable behaviour and, as the number of cores increases for the larger meshes, towards

coherent scalability.

Further, if comparing to the ideal scalability curve, the results obtained present near-

ideal scalability. We can use these result as a case-study for the behaviour of smaller mesh-

sizes and the behaviour of calculation time versus communication time, as, for a 128 mesh,

past 64 processing cores, the program actually becomes slower due to intensive

communication effort.

Of particular note are the 2048 results. Further testing was not possible due to a

maximum of 4096 cores made available to the developer, but the results obtained point

towards continued scalability past using 4096 cores.

73

0,00

0,00

0,00

0,01

0,10

1,00

10,00

100,00

1 10 100 1000 10000

128 256 512 1024 2048 Ideal Scaling

Figure 31 – Temporal Results


0,0001

0,0010

0,0100

0,1000

1,0000

10,0000

1 10 100 1000 10000

128 256 512 1024 2048 Ideal Scaling

Figure 32 – Adimensional Temporal Results (X to Z) (MareNostrum; without Allocation)

74

4.1.4 Final Temporal Results

Following the successful transition of the serial solver, to a x-to-x solver, to a x-to-z

solver, with the temporal results now numerically confirmed and the optimization steps taken,

further work was required.

A new random number generator had to be created that provided compatible

number generation to previous versions, while at the same time, the option for using different

random engines was put on the table. Following the same pre-allocation logic, the random

number generator was created to separate itself from any calculation procedure, while

providing the same number distribution, as well as other options. Further work was done to

implement a set of statistical calculation routines, which were also to function during program

execution, as in the original solver, but only at particular time steps. The results for the

random-version are plotted next, with no numerical table given, as the deviation from the

temporal results already attained was negligible.

Figure 33 – Iteration Time Final Results

The results in figure 33 presented have been confirmed by the research team

several times and were delivered to the entity regulating and monitoring Mare Nostrum for

subsequent projects.

With the result attained, all goals initially set out by the research team and the

developer have been attained, in almost the best way possible, given the near-ideal program

behaviour.

75

4.2 Large Scale DNS testing

One of the stated goals of this work is to provide the capacity to generate large

simulations to produce finer results for research purposes. With the successful creation of a

functional MPI PSM calculation engine, the program can be used to generate new data with

finer resolutions. With the further successful creation of a MPI to FORTRAN data converter

and subsequent possibility of accessing sections of the stored field, old post-processing tools

may be used until the creation of new post-processing tools.

Although the loading of the file as performed in the converted may be conceptually

slower, such a data upload must only be done once per field being investigated, with the

added benefit of complete portability from workstation to workstation.

The steps taken in generating a randomizing routine allow for different initialization of

the same statistical fields, allowing for different simulations to be done for a given data set.

Even with the same previous results, the possibility of swapping the random number

generator on the fly and therefore generating different number sequences depending on the

option, enabling for slightly different simulations with statistically identical results.

Further, the program is completely independent of the number of cores and nodes

used, enabling initialization or the continuation of previous simulations in different memory

patterns, although, if the correct options are chosen for the random number generator, this

may produce yet another set of different results to allow for research to progress.

With such possibilities, the generation of a large-scale DNS simulation was desired

from the get-go of the program design, with the goal of achieving higher mesh sizes.

The goal was successfully attained, with the only difficulties relying on the fact that

the maximum integer number in FORTRAN presented a cap for the mesh global dimensions

with a slight modification done to solve the problem. This modification was required in order

to avoid any possible errors up to a cubic mesh size of 2147483647 points per dimension, at

which point the FORTRAN standard is unable to process the dimension.

Regardless, the stated goal was to successfully run a double-precision, cubic mesh

of 2048 points per dimension, in Mare Nostrum and to provide all data for posterior analysis

using existing post-processing tools. The previous achieved highest result was a cubic mesh

of 1024 points per dimension, under double precision.

This goal was routinely achieved during testing procedures, with testing of up to

4096 cubic mesh sizes.

77

Chapter 5

Chapter outline:

i) Chapter 5 concludes the thesis by presenting the final results, and lists further

functionalities that were not slated for parallelization at the present time. It

serves as the conclusion of the thesis.

78

Conclusions and Further Work

Following the creation of the DNS MPI solver, and the results obtained, this section

deals with further programs that may be required in the future, hopefully taking advantage of

the code designed.

The translation of the OpenMP version to the MPI version was not completed, as

such was not the goal of the current thesis.

Further work is required to complete the translation of the existing code, namely the

jet portion to MPI, but another developr has already started taking steps towards the

deployment of the modification.

Since the solver was designed to be as simplistic and as expandable as possible, in

the near future, further development is expected to take advantage of the scalability

achieved at this stage.

While an OpenMP version of the solver was indeed designed, an OpenMP version of

the post-processing tools was not. A new MPI version may eventually be required and

planned, in order to be able to study fields in clusters directly, rather than using slower and

less RAM capable for the post-processing capabilities.

Further, other smaller routines that allow scalar quantity measurement were not

implemented, as were other functionalities present in the code for other types of simulation,

such as particle-tracking. However, the work done in the present routines may and should be

used as a basis for quickly devising functional variants of the routines that are present in the

previous considerations.

79

5.1 Main results and project considerations

The project taken as the basis of this Thesis was successfully accomplished,

although the remainder of the code was not translated into MPI due to architecture-specific

issues with parallelization, which usually demands a complete memory handling pattern

paradigm shift from the serial implementation to a pencil-implementation.

Scalability was a concern and achieved successfully, although further slight

optimizations may be eventually done, although, given the code’s current status, expansions

are more readily required. Small adjustments to code velocity may eventually take place, and

there is certainly room for such, namely in the largest statistical routines due to a lack of pre-

allocation logic.

The remainder of the code attempts to use as little communications as required, but

memory expenditure is a concern, given that the statistical routine last applied prevented

IST’ cluster, Galego, from being able to achieve 1024 cubic mech sizes, while in double

precision.

A hardware expansion is required and is currently under study in order to facilitate

the work on the next stages of code expansion.

Section 4.1.4 includes the main results, but these are here repeated, in Figure 34:

Figure 34 – Iteration Time Final Results

As it can be readily seen, the scaling curves follow the ideal scaling curve rather

closely as the number of participating cores increases, thereby allowing use of the program

past the experimented limits in larger, faster clusters and provide access to European, and

world-wide simulation size competition.

80

5.2 Future work

Due to the amount of work already done in the serial code version and in the

OpenMP version, the entirety of the code was not able to be translated into the MPI version.

Not only was that not possible given the time-frame, but the real focus of the solver

was to provide an optimized calculation engine capable of being further expanded by other

researchers in the near future. Future work should focus on both expanding and verifying all

results by using no random number generation in both the serial version and in the MPI

version, with comparable mesh sizes in other to be fully validated in terms of numerical

computation. Further, some work may be done in earlier routines where pre-allocation may

not occur, such as in the stat-run physical routine, where some variables (of arguably smaller

size than the velocity fields, such as 1-dimensional arrays) are allocated at ru time.

These allocations take place in each core, and when all are taken in conjunction for

efficiency purposes, it might be advisable to remove the allocation and de-allocation steps

and include them in the already present structure of allocation prior to run time.

Minor tweaks may eventually also be performed, such as the removal of the

complex-outputs from the library at a later stage. A change that was heavily considered at all

stages was to use the complex output from the library directly, instead of translating it from

the output into the working variable, but while a modest speed up of around 10% may be

obtained, the kind=complex structure may or not be in fact slower to access at a larger scale.

Instead, a finer, more optimal approach would be to contact the library developer, or create a

FFT wrapper set of routines that returned a kind=real output directly after the FFT operation.

The amount of work in this option would be staggering and would greatly reduce portability of

the code, and as such, was not taken. To turn the entirety of the code into a complex kind,

while also functionally possible, would imply a great change to the entire structure of the

code, and as such, despite initially being planned, was not pursued.

As such future work should mostly concern itself with expanding and validating the

expansions from the current code.

81

References

Bibliographic references

[1] N. Li and S. Laizet, "2DECOMP&FFT – A highly scalable 2D decomposition library and FFT interface", Cray User Group 2010 conference, Edinburgh, 2010. [2] M. Lesieur. Turbulence in fluids. Springer Verlag, 2008. [3] C.B. da Silva and J.C.F. Pereira. Invariants of the velocity-gradient, rate-of-strain, and rate-of rotation tensors across the turbulent/nonturbulent interface in jets. Physics of fluids,20:055101, 2008. [4] C. B. da Silva and Miguel Cortez Teixeira. Turbulence dynamics near a turbulent/non-turbulent interface. JOURNAL OF FLUID MECHANICS, 695(1):257-287, 2012.

[5] C. B. da Silva, Ricardo José Nunes dos Reis and J C F Pereira. The intense vorticity structures near the turbulent/non-turbulent interface in a jet. JOURNAL OF FLUID MECHANICS, 685(685):165-190, 2011. [6] P. A. Davidson. Turbulence - An Introduction for Scientists and Engineers. Oxford University Press, 2004. [7] Kolmogorov, Andrey Nikolaevich (1941). "The local structure of turbulence in incompressible viscous fluid for very large Reynolds numbers". Proceedings of the USSR Academy of Sciences (in Russian) 30: 299–303., translated into English by V. Levin: Kolmogorov, Andrey Nikolaevich (July 8, 1991). [8] Kolmogorov, Andrey Nikolaevich (1941). "Dissipation of Energy in the Locally Isotropic Turbulence". Proceedings of the USSR Academy of Sciences (in Russian)32: 16–18., translated into English by Kolmogorov, Andrey Nikolaevich (July 8, 1991)

[9] G. K. Batchelor, The theory of homogeneous turbulence. Cambridge University Press,1953.

[10] Arkadi Tsinober. An Informal Introduction to Turbulence. Kluwer Academic Publishers, New York,2004. [11] P.A. Davidson. Turbulence: An introduction for Scientists and Engineers. Oxford University Press,New York, 2004. [12] H. Tennekes and J.L. Lumley. A First Course in Turbulence. Massachusets Institute of Technology, Cambridge, 1972. [13] David C. Wilcox. Turbulence Modeling for CFD. DCW Industries, Inc., La Caada, 1994. [14] M. Lesieur, O. M_etais and C. Pierre. Large-Eddy Simulations of Turbulence Cambridge University Press, 2005.

82

[15] C.B. da Silva. The role of coherent structures in the control and interscales interactions of round, plane and coaxial jets. PhD thesis, Institut National Politechnique de Grenoble, 2001. [16] A.A. Townsend. The Structure of Turbulent Shear Flow. Cambridge University Press, second edition, UK, 1976. [17] M. Phillips. The irrotational motion outside a free boundary layer. Proc. Camb. Phil. Soc., 51:220, 1955. [18] S. Corrsin and A.L. Kistler. Free-stream boundaries of turbulent ows. NACA Technical Report, 1244, 1955. [19] F.O. Thomas and K.M.K. Prakash. An investigation of the natural transition of an untuned planar jet. Phys. Fluids A, 3(1):90{105, 1991. [20] L. Bradbury. The structure of a self-preserving turbulent plane jet. J. Fluid Mech., 23:31{64, 1965 [21] C.B. da Silva and O. Métais. On the inuence of coherent structures upon interscale interactions in turbulent plane jets. J. Fluid Mech., 473:103{145, 2002a. [22] J. Ferziger and M. Perich. Computational methods for uid dynamics. Springer-Verlag, 1996. [23] J. Willianson. Low-storage runge-kutta schemes. J. Comp. Phys., 35:48{56, 1980. [24] P.K. Yeung and S.B. Pope. An Algorithm for Tracking Fluid Particles in Numerical Simulations of Homogeneous Turbulence. J. Comp. Phys., 79:373{416, 1988. [25] K. Alvelius. Random forcing of three-dimensional homogeneous turbulence. Phys. Fluids, 11: 1880, 1999

development of a navier-stokes solver for direct numerical ... · simulations of isotropic...

Documents