parallel fourier transformations using shared memory nodes

Parallel Fourier Transformations using sharedmemory nodes

Solon Pissis

October 6, 2008

MSc in High Performance Computing

The University of Edinburgh

Year of Presentation: 2008

Abstract

The Fast Fourier Transform (FFT) is of great importance for various scientific applica-tions used in High Performance Computing (HPC). However, a detailed performanceanalysis shows that the FFT routines used in these applications, prevent them from scal-ing to large processor counts. TheAll-to-All type communication required inside thesetransformation routines, which becomes extremely costly when large processor countsare involved, seems to be the limiting factor. In the scope of this dissertation, we mainlyfocus on whether and how the performance of the parallel two-dimensional (2D) FFTcan be improved, by exploiting the access to the shared memory nodes of HPCx, acluster of POWER 5 SMP nodes. In particular, we investigate how to efficiently trans-fer the data between the processing elements involved in the parallel 2D FFT. Differ-entOpenMP strategies are proposed for the parallelisation of the 2D FFT. The resultsdemonstrate that, for certain problem sizes between16

2 and81922, the access to the

shared memory of an HPCx node (16 processors) can produce gains in performancecompare to theMPI implementation. In addition, for large processors counts, we useour results from the 2D case to optimise the parallelisation of the three-dimensional(3D) FFT with theHybrid, a mixed mode programming model between shared memoryprogramming and messaging passing. In our implementation, we use theMaster-onlystyle, a version of theHybrid model, where theMPI communication is handled only bythe master thread, outside theOpenMP parallel regions. The results demonstrate a goodscaling of the code for problem sizes between64

3 and5123 up to 1024 processors. The

performance comparisons illustrate that, in certain cases, theHybrid model can provebeneficial compare to the 2D data decomposition with pureMPI.

Subject area: High Performance Computing

Keywords: Fast Fourier Transform, SMP nodes, Mixed mode programming

{ân oÚda åti oÎdàn oÚda}

Swkr�thc

”I know that I know nothing”

Socrates

Contents

1 Introduction 1

2 Fourier Transform 5

2.1 Continuous Fourier Transform . . . . . . . . . . . . . . . . . . 52.2 Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . . 72.3 Parallel Fourier Transform . . . . . . . . . . . . . . . . . . . . 82.4 Fast Fourier Transform . . . . . . . . . . . . . . . . . . . . . . 11

2.4.1 Cooley-Tukey Algorithm . . . . . . . . . . . . . . . . . 122.4.2 Fastest Fourier Transform in the West Library . . . . . 14

3 Parallel 2D FFT inside a shared memory node 17

3.1 The Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.1.1 The Strided strategy . . . . . . . . . . . . . . . . . . . 183.1.2 The Transposition strategy . . . . . . . . . . . . . . . . 193.1.3 The MPI strategy . . . . . . . . . . . . . . . . . . . . 26

3.2 Experimental design . . . . . . . . . . . . . . . . . . . . . . . 273.2.1 The OpenMP implementation of the test program . . . 273.2.2 The MPI implementation of the test program . . . . . 293.2.3 Verification of the 2D FFT results . . . . . . . . . . . . 30

3.3 Hardware Overview . . . . . . . . . . . . . . . . . . . . . . . . 313.3.1 HPCx System . . . . . . . . . . . . . . . . . . . . . . . 323.3.2 EPCC HPC Service Ness . . . . . . . . . . . . . . . . 32

3.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 333.4.1 The 1st FFT . . . . . . . . . . . . . . . . . . . . . . . . 333.4.2 The transposition . . . . . . . . . . . . . . . . . . . . . 343.4.3 The 2nd FFT . . . . . . . . . . . . . . . . . . . . . . . 353.4.4 Total execution time of the 2D FFT . . . . . . . . . . 383.4.5 The Nested parallelism version . . . . . . . . . . . . . . 403.4.6 Scaling, Optimisation flags and Data creation . . . . . 41

3.5 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

i

ii CONTENTS

4 Parallel 3D FFT between nodes using the Hybrid model 45

4.1 The Hybrid model . . . . . . . . . . . . . . . . . . . . . . . . 464.2 Experimental design . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.1 The Master-only implementation of the test program . 514.2.2 Verification of the 3D FFT results . . . . . . . . . . . . 52

4.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 544.3.1 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.3.2 MPI Tasks × OpenMP threads . . . . . . . . . . . . . 544.3.3 Hybrid model and pure MPI . . . . . . . . . . . . . . . 56

5 Conclusions 61

A Parallel 2D FFT results 65

B Parallel 3D FFT results 73

List of Tables

3.1 Transposition factors influencing OpenMP versions . . . . . . 26

A.1 Average execution times (s) for the 1st FFT on 16 processorsof an HPCx node . . . . . . . . . . . . . . . . . . . . . . . . . 66

A.2 Average execution times (s) for the transposition on 16 pro-cessors of an HPCx node . . . . . . . . . . . . . . . . . . . . . 67

A.3 Average execution times (s) for the 2nd FFT on 16 processorsof an HPCx node . . . . . . . . . . . . . . . . . . . . . . . . . 68

A.4 Average execution times (s) for the 2D FFT on 16 processorsof an HPCx node . . . . . . . . . . . . . . . . . . . . . . . . . 69

A.5 Average execution times (s) for scaling test of the 2D FFTwith the first OpenMP version run on 1 up to 16 processorsof an HPCx node . . . . . . . . . . . . . . . . . . . . . . . . . 70

A.6 HPCx vs Ness percentage (%) of the total execution time forthe 2D transposition with MPI on a 16 processor node . . . . 71

B.1 Average execution times (s) for scaling test of the 3D FFT withthe Hybrid balanced scheme run on 1 up to 1024 processors ofHPCx system . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

B.2 Average execution times (s) with a different combination ofMPI tasks × OpenMP threads on 4 nodes (64 processors) ofHPCx system . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

B.3 Average execution times (s) with the Hybrid model and pureMPI, as presented in [18], on 4 nodes (64 processors) of HPCxsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

B.4 Average execution times (s) with the Hybrid balanced schemeand pure MPI, as presented in [18], on 64 nodes (1024 proces-sors) of HPCx system . . . . . . . . . . . . . . . . . . . . . . . 75

iii

iv LIST OF TABLES

List of Figures

2.1 Processing steps for the parallel 3D FFT using the slab datadecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 The first two processing steps for the parallel 3D FFT usingthe 2D data decomposition . . . . . . . . . . . . . . . . . . . . 10

2.3 The last processing step for the parallel 3D FFT using the 2Ddata decomposition . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 A recursive 8-point unordered FFT computation - Figure from[15] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 The Strided strategy . . . . . . . . . . . . . . . . . . . . . . . 183.2 The Transposition strategy . . . . . . . . . . . . . . . . . . . . 193.3 Parallelise the 1st loop version . . . . . . . . . . . . . . . . . . 213.4 Parallelise the 2nd loop version . . . . . . . . . . . . . . . . . . 223.5 Loops Exchange of the First version . . . . . . . . . . . . . . . 233.6 Loops Exchange of the Second version . . . . . . . . . . . . . 233.7 Nested parallelism version - 2 threads at each level . . . . . . 243.8 Loops Exchange of the Nested parallelism version - 2 threads

at each level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.9 The MPI Strategy . . . . . . . . . . . . . . . . . . . . . . . . 273.10 1st FFT results on 16 processors of an HPCx node . . . . . . . 343.11 Transposition results on 16 processors of an HPCx node . . . 363.12 MPI vs OpenMP transposition results on 16 processors of an

HPCx node . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.13 MPI vs OpenMP transposition results on 16 processors of a

Ness node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.14 2nd FFT results on 16 processors of an HPCx node . . . . . . 383.15 Performance of the 2nd FFT . . . . . . . . . . . . . . . . . . . 393.16 2D FFT total execution time results . . . . . . . . . . . . . . 393.17 2D FFT total execution time results with Nested parallelism

on 16 processors of an HPCx node . . . . . . . . . . . . . . . 41

v

vi LIST OF FIGURES

3.18 2D FFT scaling test from 1 to 16 processors with the FirstOpenMP version . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.19 2D FFT total execution time results with Optimisation flagson 16 processors . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.20 2D FFT total execution time results with different Data cre-ation on 16 processors . . . . . . . . . . . . . . . . . . . . . . 43

4.1 The first two processing steps for the parallel 3D FFT usingthe Hybrid (2 MPI Tasks × 2 OpenMP threads) . . . . . . . . 47

4.2 In preparation for the final processing step of the Hybrid model,a single All-to-All call between 2 MPI tasks to swap x and ydimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 3D FFT scaling test with the Hybrid balanced scheme run on1 up to 1024 processors . . . . . . . . . . . . . . . . . . . . . . 55

4.4 3D FFT average execution time with a different combinationof MPI Tasks × OpenMP threads on 4 nodes (64 processors)of HPCx system . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.5 3D FFT execution time with the Hybrid and pure MPI, aspresented in [18], on 4 nodes (64 processors) of HPCx system . 58

4.6 3D FFT execution time with the Hybrid and pure MPI, aspresented in [18], on 64 nodes (1024 processors) of HPCx system 60

Listings

2.1 Cooley-Tukey Algorithm - Algorithm from [15] . . . . . . . . . 143.1 The Strided strategy . . . . . . . . . . . . . . . . . . . . . . . 183.2 The Transposition strategy . . . . . . . . . . . . . . . . . . . . 203.3 Parallelise the 1st loop version . . . . . . . . . . . . . . . . . . 213.4 Parallelise the 2nd loop version . . . . . . . . . . . . . . . . . . 223.5 Loops Exchange of the First version . . . . . . . . . . . . . . . 233.6 Loops Exchange of the Second version . . . . . . . . . . . . . 243.7 Nested parallelism version . . . . . . . . . . . . . . . . . . . . 253.8 Loops Exchange of the Nested parallelism version . . . . . . . 253.9 2D dynamic allocation of continuous memory . . . . . . . . . 273.10 Measuring the accuracy of omp get wtime() timer . . . . . . . 284.1 Parallelising the 3D FFT with the Hybrid . . . . . . . . . . . 484.2 3D dynamic allocation of contiguous memory for z and y di-

mension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

vii

viii LISTINGS

Acknowledgements

I would like to thank my project supervisor Dr Joachim Hein for his moralsupport, excellent guidance and inventive ideas throughout this project.

I would also like to thank Dr Lorna Smith for her constructive commentson the final period of the project.

Special thanks go to my family for making this experience possible forme, and especially my mother, who has always been invaluably supportiveduring my time in Edinburgh.

Chapter 1

Introduction

In practice, time and again scientists need to determine the output of asystem, when this is stimulated by a signal, namely, a physical property thatchanges according to temporal duration, spatial extension or, to any otherindependent variable or variables. The output of a system often contains thesame frequencies as the input signal, differing however with respect to lengthand phase. It is therefore quite facile to determine the output of a systemwhen the inputted signal can be analysed into signals of ordinary frequencies.

The Fourier Transform (FT) was developed in order to analyse such com-plex signals into signals of ordinary frequencies. Subsequently, by exploitingthe linearity property, we can define the output of a system as a sum ofsignals that contain the same frequencies as the inputted signal, exhibiting,nonetheless, certain changes with respect to the length and the phase that arecaused by the system. The FT is a mathematic tool that defines a relation-ship between a signal in the time- or spatial-domain and its representationin the frequency domain, and is also known as Fourier Space.

The birth and the roots of this theory are owed to the French Mathemati-cian Jean Baptiste Joseph Fourier (1768-1830), who imported the decompo-sition of a function in terms of sinusoidal functions of different frequencies,which can be recombined to obtain the original function. The Fourier serieswere initially introduced for solving the heat equation in a metal plate. Theanalysis of a complex quantity in simpler components, with a view of how aproblem becomes easier, is a more general scientific methodology, which ledto a revolution in Mathematics.

The FT has many applications in many fields of Physical Sciences suchas Engineering, Physics, Applied Mathematics, and Chemistry. These appli-cations running on large processor counts, use the Fast Fourier Transform(FFT) routines with multi-dimensional input data. As referred in [17], themost direct applications of these routines are in the convolution or decon-

1

2 Chapter 1 - Introduction

volution of data, correlation and autocorrelation, optimal filtering, powerspectrum estimation, and the computation of Fourier integrals. However, asit turns out from a detailed performance analysis of these applications whenusing high performance resources, these routines introduce communicationoverheads. As expected, these overheads, introduced due to the extremelycostly All-to-All type communications called inside these routines, seem toincrease when large numbers of processors are involved, preventing the ap-plications from scaling.

In the past, many different approaches were proposed for parallelising theFFT with a variety of high performance equipment. Regarding vector andparallel computers, in [1], a practical technique for computing the FFT isproposed, which appears to be near-optimal.

Nowadays, with the enormous increase in large-processor-count comput-ing hardware, and the development of multi-core processing elements, animprovement in the scalability of parallel FFT is a fundamental challenge.To address these scalability issues with the most commonly used paralleli-sation strategy for parallel 3D FFT, the one-dimensional (1D), or slab datadecomposition, an alternative approach, based on a two-dimensional (2D)data decomposition is proposed in [4]. This approach does not impose thelimitations in terms of scalability to large numbers of processors.

In [13], a recent MSc dissertation, this alternative approach is imple-mented and compared to the slab decomposition using the message passingmodel. It is demonstrated that the implementation of the parallel three-dimensional (3D) FFT algorithm using 2D decomposition, scales well up to1024 nodes for different problem sizes between 1283 and 5123. Furthermore,as a next step, it illustrates the importance of task placement for the parallel3D FFT on the IBM BlueGene/L system, a massively parallel high perfor-mance computer, organised as a 3D torus of compute nodes. It is shownthat, indeed, the overall performance of the algorithm is improved. Choos-ing a careful task placement, by taking into consideration the specificationsof the BlueGene/L torus network, plays an important role in the performanceimprovement.

In [18], another recent MSc dissertation, the optimisation of the paral-lel 3D FFT by applying the alternative 2D data decomposition approach, isinvestigated on the HPCx system, a cluster of POWER5 Symmetric MultiProcessor (SMP) nodes. The nodes of the system are connected via a hier-archical network. They have different latency and bandwidth properties forcommunications within an SMP node and between nodes. The propertiesof the IBM HPS interconnect are examined as the All-to-All communica-tion plays an important role in the parallel FFT. Based on the above, itis demonstrated, once again, how the virtual 2D processor grid mapping to

3

the processors in the SMP nodes can improve the overall performance. Theresults, for problem sizes between 1283 and 5123 and up to 1024 processors,demonstrate a good scaling of the code when applying the 2D data decom-position approach.

In this dissertation, we investigate whether and how shared access to thememory in multi-core and shared-memory nodes can affect the performanceof the parallel FFT. More specifically, we investigate how to efficiently trans-fer the data between the processing elements involved in the parallel 2DFFT, in order to reduce the introduced overheads by using the shared mem-ory within an HPCx node. This includes the transposition of the initial input2D array to swap between the two dimensions. In Chapter 3, we propose 3basic strategies for parallelising the 2D FFT inside a shared memory node.The first one is the Strided strategy, where no actual transposition of theinitial input array takes place. The swapping between the two dimensions isachieved by strided FFT calls. The second one, the Transposition strategy,suggests the actual transposition of the array with different versions concern-ing the parallelisation of the expensive re-sort. The last proposed is the MPIstrategy, where the swapping between the two dimensions is achieved by anAll-to-All call.

For large processor counts, we use the results in Chapter 3, to optimisethe parallel 3D FFT with the Hybrid, a mixed mode programming modelbetween shared memory programming and message passing. The Hybridmodel, investigated thoroughly in [16], attempts to exploit features of theSMP cluster architecture, thus resulting in a more efficient parallelisationstrategy, by combining advantages of both OpenMP and MPI parallelisa-tion strategies. In Chapter 4, we propose the parallelisation of the 3D FFTwith the Hybrid model in order to avoid the limitations that the 1D datadecomposition approach introduces and to reduce the extremely costly All-to-All type communications that the 2D data decomposition involves withpure MPI.

The rest of this document is organised as follows. Chapter 2 containsthe basic mathematical background on the FT, followed by a discussion onthe parallelisation strategies of the FFT with the message passing model andsome relevant information on the FFT algorithm. In Chapter 3 we presentour results on the parallelisation of the 2D FFT inside a shared memory node,along with an overview of the hardware used. Chapter 4 presents the Hybridmodel, and suggests how that can exploit possible benefits for parallelisingthe 3D FFT, followed by our results. In Chapter 5 we summarise the results,and draw our conclusions, pointing out areas for future research that couldaid in unraveling further information and advancing theoretical knowledgeregarding the parallelisation of the 3D FFT.

4 Chapter 1 - Introduction

Chapter 2

Fourier Transform

In this chapter, we provide the basic theoretical information regarding theFT as well as some relevant information on the FFT algorithm. In the firsttwo sections, the mathematical background of the Continuous and DiscreteFT is presented. The next section presents the basic parallelisation strategiesof the FT with the message passing model, followed by the last section, wherewe make a relatively detailed discussion on the FFT algorithm.

The FT is a linear transform which defines a relationship between a signalin a specific domain e.g. time or spatial domain, and its representation inthe frequency domain.

Any periodic signal can be expressed as an infinite sum regarding the dis-crete case, or an integral regarding the continuous case [10, 14]. The abovemathematical expressions contain trigonometric sines, which are related tothe symmetrical information, and cosines, which are related to the asym-metrical information. Through this document instead of using trigonometricfunctions, we prefer to use Euler’s formula, which is defined by (2.1).

eiθ = cosθ + isinθ

→ cosθ =1

2(eiθ + e−iθ)

→ sinθ =1

2i(eiθ − e−iθ)

(2.1)

2.1 Continuous Fourier Transform

For simplicity, we firstly present the 1D case. FT tool converts a function fof a single variable x, i.e. f(x), from a specific domain, into function F of

5

6 Chapter 2 - Fourier Transform

frequencies u, i.e. F (u) , into the frequency domain. The FT of a continuousfunction f of a single variable x is defined by (2.2).

F (u) =

∫ +∞

−∞

f(x)e2πixudx (2.2)

Recalling that FT is actually a linear transform, we can observe that in(2.2) there are indeed two expressions of the same function.

The input data x of function f can either be real or complex. However,even if the original data is real, then F (u) in (2.2) will generally be complex.

The inverse FT is used to reproduce the original function f from its FTand is defined by (2.3). As we can easily observe from (2.3), it looks quitesimilar to (2.2), except from that the exponential term has the opposite sign.Hence, if we are able to compute the FT of a given function f(x), we canalso compute its inverse, easily.

f(x) =

∫ +∞

−∞

F (u)e−2πixudu (2.3)

The 2D and 3D FT equations, which include two and three variablesrespectively, can be developed from the equations (2.2) and (2.3) in a verystraightforward way. Equations (2.4) and (2.5) present the continuous FTand its inverse for the 2D case respectively.

F (u, v) =

∫ +∞

−∞

∫ +∞

−∞

f(x, y)e2πi(xu+yv)dxdy (2.4)

f(x, y) =

∫ +∞

−∞

∫ +∞

−∞

F (u, v)e−2πi(xu+yv)dudv (2.5)

In a very similar way, equations (2.6) and (2.7) present the 3D continuousFT and its inverse respectively.

F (u, v, w) =

∫ +∞

−∞

∫ +∞

−∞

∫ +∞

−∞

f(x, y, z)e2πi(xu+yv+zw)dxdydz (2.6)

F (x, y, z) =

∫ +∞

−∞

∫ +∞

−∞

∫ +∞

−∞

F (u, v, w)e−2πi(xu+yv+zw)dudvdw (2.7)

Section 2.2 - Discrete Fourier Transform 7

2.2 Discrete Fourier Transform

It turns out that, many times, computer scientists, due to the available ar-chitecture, are enforced to use functions which are defined in discrete do-mains. The Discrete Fourier Transform (DFT) is crucial for a variety ofboth scientific and industrial applications. These applications, as referred in[15], mainly focus in signal processing, image filtering, time series, waveformanalysis and solutions to linear PDEs. We firstly define the 1D DFT case ofN samples by (2.8) and its inverse by (2.9).

F (u) =N−1∑

x=0

f(x)e2πi(xuN

) (2.8)

f(x) =1

N

N−1∑

x=0

F (u)e−2πi(xuN

) (2.9)

Once again, we can easily observe from (2.9) that it looks quite similarto (2.8) except from the 1

Nfactor and the fact that the exponential term has

the opposite sign.

The 2D DFT and its inverse, for a Nx×Ny grid along x and y dimension,can be defined in very a similar way, as shown in (2.10) and (2.11).

F (u, v) =

Ny−1∑

y=0

Nx−1∑

x=0

f(x, y)e2πi( xu

Nx+ yv

Ny)

(2.10)

f(x, y) =1

NxNy

Ny−1∑

y=0

Nx−1∑

x=0

F (u, v)e−2πi( xu

Nx+ yv

Ny)

(2.11)

The 3D DFT and its inverse for a Nx ×Ny ×Nz data grid along x, y andz dimension are defined similarly by (2.12) and (2.13).

F (u, v, w) =Nz−1∑

z=0

Ny−1∑

y=0

Nx−1∑

x=0

f(x, y, z)e2πi( xu

Nx+ yv

Ny+ zw

Nz)

(2.12)

f(x, y, z) =1

NxNyNz

Nz−1∑

z=0

Ny−1∑

y=0

Nx−1∑

x=0

F (u, v, w)e−2πi( xu

Nx+ yv

Ny+ zw

Nz)

(2.13)


Figure 2.1 Processing steps for the parallel 3D FFT using the slab data decom-position

2.3 Parallel Fourier Transform

In this section, we attempt to describe the main parallelisation strategies ofthe FT with the messaging passing model, as these have been implementedin [13] and [18]. In order to present a more general case of the parallel FT, a3D array of complex numbers is chosen as the input array, which has to bedecomposed among the available processors. The two proposed decomposi-tion strategies are: the 1D or slab decomposition and the 2D decompositionalso known as volumetric or pencil.

Concider f(x, y, z) as a 3D array of Nx × Ny × Nz complex numbers asshown in 2.14.

f(x, y, z) ∈ C

x ∈ Z∀x, 0 6 x < Nx

y ∈ Z∀y, 0 6 y < Ny

z ∈ Z∀z, 0 6 z < Nz

(2.14)

Concerning the slab decomposition (see Figure 2.1), the 3D input data

Section 2.3 - Parallel Fourier Transform 9

array is decomposed along the one dimension, e.g. x, into the P available pro-cessors. Hence, each processor gets the data of one slab in its local memory.For a problem size of Nx×Ny×Nz, the data is split along the x dimension inP slabs. Hence, each of those slabs contains Nx

P×Ny ×Nz complex numbers.

F (u, v, w) =Nx−1∑

x=0

(

Ny−1∑

y=0

(Nz−1∑

z=0

f(x, y, z)e2πi( wzNz

))

︸︷︷︸

1stFT

e2πi( vy

Ny))

︸︷︷︸

2ndFT

e2πi( uxNx

)

︸︷︷︸

3rdFT

(2.15)

Equation (2.12) can be expressed as shown in (2.15), in order to demon-strate that the 3D case can be calculated in three 1D DFT steps. Eachprocessor can then compute the 1st and the 2nd FT along the z and y di-mension respectively, on the data on its local memory, without having tocommunicate with other processors. Actually on the first two steps, eachprocessor computes a single 2D FFT on each of its Nx

Pslices (see Figure 2.1).

At this point, since the data is distributed across the processors the pro-cessors must perform, as a last step, an All-to-All call to swap the y and xdimension, in order to be able to compute the 3rd FT along the x dimension.

The main problem indicated when applying the slab decomposition, isthat the maximum number of processors is limited by the maximum numberof available slabs N . A poor performance, in scalability terms, can occurwhen the number of available processors is greater than the slabs. That isexactly what the 2D decomposition tries to address.

The second proposed strategy is the 2D decomposition (see Figure 2.2-2.3), which maps the 3D input data on a 2D virtual processor grid. For thesame problem size, the data is distributed on a Px×Py processor grid (where

Px×Py = P ), with each processor now holding Nx

Px×Ny

Py×Nz complex numbers

in its local memory. As a first step, each processor calculates Nx

Px× Ny

Py1D

FFTs along the z dimension. Then, the processors within the same row ofthe virtual processor grid (e.g. PO and P1 in 2.2) perform an All-to-Allcommunication to swap y and z dimension. As a next step, each processorcalculates Nx

Px× Nz

Py1D FFTs along the y dimension. In preparation for the

final processing step, the processors within the same column of the virtualprocessor grid (e.g. PO and P2 in 2.3) must perform an All-to-All call toswap y and x dimension. The final result of the 3D FFT is calculated byperforming Ny

Px× Nz

Pz1D FFT along the x dimension.

Imagine that initially, each processor holds a pencil piece of the entire


Figure 2.2 The first two processing steps for the parallel 3D FFT using the 2Ddata decomposition

Section 2.4 - Fast Fourier Transform 11

Figure 2.3 The last processing step for the parallel 3D FFT using the 2D datadecomposition

data with the same colour. The All-to-All call within the rows transfers thedata in the horizontal plane, while the All-to-All call within the columnstransfers the data vertically. It is easily observed from Figure 2.2, that inorder to compute the 3D FFT of the Nx × Ny × Nz problem, a maximum

of the min(Nx

P× Ny,

Nx

P× Nz,

Ny

P× Nz) processors can be used [18]. As it is

obvious, this approach comes at the extra cost of one All-to-All call. Since,generally speaking, All-to-All communication is slower than a local re-sort,we suppose that the 2D decomposition approach will be more fruitful only ifthere are more processors available than we can utilise by applying the slabdecomposition [18].

2.4 Fast Fourier Transform

The DFT is computationally expensive to calculate. Its computational costcan be comprehended, for the 1D case, by the fact that each of the N pointsin the FT is calculated in terms of all the N points in the original function [10,14]. Therefore, in the sense of complexity, the DFT appears to be a process


of O(N2) operations. An FFT is an efficient algorithm to compute the DFTand its inverse in no more than O(NlogN) operations. This algorithm’s basicidea, which is the factorisation of N , including its recursive application andcontrary to popular conception, was invented around 1805 by Carl FriedrichGauss [7].

2.4.1 Cooley-Tukey Algorithm

Cooley-Tukey algorithm is the most well-known and common FFT algorithmamong the FT applications and HPC codes. It was named after J.W. Cooleyand J. Tukey, who firstly proposed the algorithm in [3]. It is worthwhileto mention that, after the Cooley-Tukey algorithm, the same algorithm wasrediscovered many times with minor differentiations. As already mentioned,the existing FFT algorithms are based on the factorisation of N , however, asreferred in [15], there are also algorithms, with the same cost of O(NlogN)operations for all N , including the set of prime numbers.

F (u) =∑(N/2)−1

x=0 f [2x]ω2xu +∑(N/2)−1

x=0 f [2x + 1]ω(2x+1)u

=∑(N/2)−1

x=0 f [2x]e2(2πı/N)xu +∑(N/2)−1

x=0 f [2x]ωue2(2πı/N)xu

=∑(N/2)−1

x=0 f [2x]e2πıxu/(N/2) +∑(N/2)−1

x=0 f [2x]ωue2πıxu/(N/2)

(2.16)In equation (2.8), let ω = e2πı/N , where e is the base of natural logarithms.

Imagine that if N is a power of two, then, the basic step of the algorithmis that an N -point DFT computation, can be split and expressed as the two(N/2)-point DFT computations shown in (2.16)1.

F (u) =

(N/2)−1∑

x=0

f [2x]ω′xu + ωu

(N/2)−1∑

x=0

f [2x + 1]ω′xu (2.17)

Let ω′ = e2πı/(N/2) = ω2. Equation (2.16) can be expressed in a differentform, as illustrated in (2.17), where the right hand side of it proves that eachone of the two summations is an (N/2)-point DFT computation [15]. Themain idea, resulting in the recursive FFT algorithm, is that if N is indeed apower of two, then each of these DFT computations can be recursively splitand expressed into smaller ones. As demonstrated in (2.17), at each level ofrecursion, the input sequence can be indeed split into two equal halves.

1Note that all equations and mathematical proofs in this section are takenfrom [15]


Figure 2.4 A recursive 8-point unordered FFT computation - Figure from [15]

Given as an example an 8-point sequence, Figure 2.4 demonstrates thebasic steps of the recursive Cooley-Tukey algorithm. Starting from the deep-est level of recursion, it is illustrated that some relevant computations takeplace. For computing these, we need to use those elements who differ byN/2. Subsequently, in each of the levels remaining, we observe that the ele-ments used for a computation, have a difference which decreases by a factorof two[15].

For the general case of an input sequence of length N , as the size of thesequence decreases by a factor of two at each level of recursion, the maximumnumber of levels of recursion is logN . Figure 2.4 illustrates that given an8-point sequence, 3 levels of recursion are indeed needed. Regarding the sizeand the number of FFTs to be computed at each level, it turns out that,at the mth level, the recursive algorithm calculates 2m FFTs, which each isof N/2m size[15]. Jointly assessing, the fact that the maximum number ofrecursive levels in the algorithm is logN , and the fact that the operations ateach of those level is O(N), we conclude that the overall cost of the algorithmis O(NlogN).

A different version of the same algorithm, the iterative Cooley-Tukeyalgorithm, is presented in Listing 2.1 [15]. This algorithm performs logN


iterations of the outer loop (see line 5), while on each iteration it performs aset of complex calculations, just as in each level of recursion in the recursiveversion.

1. procedure ITERATIVE FFT(X, Y, N)2. begin3. r = logN ;4. for i = 0 to N - 1 do R[i] = X[i];5. for m = 0 to r - 1 do /* Outer loop */6. begin7. for i = 0 to N - 1 do S[i] = R[i];8. for i = 0 to N - 1 do /* Inner loop */9. begin/* Let (b0b1...b(r−1)) be the binary representation of i */10. j = (b0...b(m−1)0b(m+1)b(r−1));11. k = (b0...b(m−1)1b(m+1)b(r−1));12. R[i] = S[j] + S[k] × ω(bmb(m−1)...b00...0)13. endfor; /* Inner loop */14. endfor; /* Outer loop */15. for i = 0 to N - 1 do Y[i] = R[i];16. end ITERATIVE FFT

Listing 2.1 Cooley-Tukey Algorithm - Algorithm from [15]

As we can see in Listing 2.1, the algorithm is mainly constituted from twoloops: the outer and the inner. The first occurred is the outer loop (see line5) and for an N -point FFT is executed logN times. The second one is theinner loop (see line 8), and for each iteration of the outer loop is executedN times. Since the operations in the inner loop are all of constant time i.e.O(1), we conclude that the cost of the algorithm is O(NlogN).

In addition, since the inverse DFT is very similar to the DFT but with theopposite sign in the exponent term and the 1/N factor, any FFT algorithmcan easily be adapted for it as well.

2.4.2 Fastest Fourier Transform in the West Library

In the scope of this project, all the applications call 1D single-processorFFT kernel routines. The Fastest Fourier Transform in the West (FFTW )3.1.2 open-source library has been used for all the implementations in this


project. The FFTW is a library written in C language that implementsthe Cooley-Tukey algorithm in order to compute the DFT of certain inputdata. This input data may have arbitrary length and different structure,complex or real. Other structures, like arbitrary multi-dimensional may alsobe used. All the information regarding the possible structure of the inputdata can be found in [6]. Since the FFTW library implements the Cooley-Tukey algorithm, its computational cost is of O(NlogN) operations for anygiven N . The optimisation of the FFT calculation on different architecturesand platforms is based on empirical approaches applied by the library. Asreferred in [6], FFTW’s performance is good and competitive with vendor-tuned codes. Its main advantage though, in contrast to vendor-tuned codes,is that the FFTW’s performance is the most reliable and portable solution.

The DFT computation of certain input data is divided into two phasesas described in [6]. As a first step, the FFTW’s planner tries to find outwhich is the fastest way to compute the DFT on the specific platform used.The relevant information is kept in a data structure called a plan, createdby the planner. As a second step, the array of input data is transformed asdemonstrated by the plan, and this can be reused if required. In addition, asreferred in [6], heuristics methods and previously computed plans are usedby the FFTW library in order to provide fast planners.

Chapter 3

Parallel 2D FFT inside a

shared memory node

In this chapter, we present the main idea, the implementation and the re-sults of three proposed strategies, with their different versions, along with anoverview of the hardware used for the parallelisaton of the 2D FFT inside ashared memory node. As illustrated in Figure 2.1 concerning the parallel 3DFFT, each processor can compute a single 2D FFT on each of its Nx/P sliceson its local memory. The main idea for this project was that if the virtualprocessor grid’s rows would fit inside the memory of a shared memory node,it would be worth investigating, for the 2D case of the FFT, whether it isbeneficial to exploit shared access to the memory in an SMP node to avoidthe first step’s costly All-to-All communication.

3.1 The Strategies

Consider f(x, y) as a 2D array of Nx × Ny complex or real numbers as inequation 2.10. The 2D FFT would be an array F (x, y) of Nx × Ny complexnumbers. This computation can be performed in two single stages withinan SMP node. Firsty, the 1D FFT can be computed along the x dimen-sion and as a second step along the y dimension. The problem we need toinvestigate thoroughly though, is the transposition between the two dimen-sions x and y. In this dissertation, we present three basic strategies for thistransposition: the Strided strategy in OpenMP where no transposition takesplace, the Transposition strategy in OpenMP with different versions, and thetransposition strategy in MPI where the transposition is achieved with anAll-to-All call.

17

18 Chapter 3 - Parallel 2D FFT inside a shared memory node

(a) Computation of 1st 1D FFT alongthe x dimension

(b) Computation of 2nd 1D FFT alongthe y dimension

Figure 3.1 The Strided strategy

3.1.1 The Strided strategy

For the Strided strategy, we can imagine (see Figure 3.1) the input parame-ters of an FFT routine as:

fft(size, howmany, in array, in stride, in distance, out array, out stride,out distance)

where size is the size of each transformation to be computed, howmany isthe number of transformations to be computed by each processor, in arrayis the starting point of the input array, in stride is the distance betweensuccessive numbers of the transformation array, in distance is the distancebetween two successive, of the howmany, transformations and out array isthe starting point of the output array. The same values used for the inputparameters in stride and in distance can be used for the output parameters,out stride and out distance. An algorithm to calculate the 2D FFT with theStrided strategy is presented in Listing 3.1.

f o r i = 0 to P do in p a r a l l e lf f t (Nx,Ny/P, f [ i ∗Ny/P, 0 ] , 1 ,Nx, Z [ i ∗Ny/P, 0 ] , 1 ,Nx ) ;

f o r i = 0 to P do in p a r a l l e lf f t (Ny,Nx/P,Z [ 0 , i ∗Nx/P] ,Nx, 1 ,F [ 0 , i ∗Nx/P] ,Nx , 1 ) ;

Listing 3.1 The Strided strategy

Note that in Figure 3.1(a) the distance between two successive numbersin the howmany transformations of the initial input array is equal to one,thus in stride = 1. However, the distance between two successive transfor-

Section 3.1 - The Strategies 19

(a) Computation of 1st 1D FFT alongthe x dimension

(b) Computation of 2nd 1D FFT alongthe y dimension, after the transposition

Figure 3.2 The Transposition strategy

mations is Nx, thus in distance = Nx. Figure 3.1(b) illustrates that for thecomputation of the 2nd FFT, the distance between two successive numbersis Nx and the distance between two successive transformations is 1, thusin distance = 1 and in stride = Nx.

Contrary to the previous figures for the distributed memory model, theinitial transformation array is kept, in its entirety, inside the shared memoryof the node and therefore is accessible by all processors. Figures 3.1(a)-(b)just present the elements that each processor will use to calculate the 2DFFT, which actually consists of two 1D FFTs.

The results in [13], for the message passing model, show that the over-all performance of the Strided fft() is poor compared to the in stride = 1unstrided case. The important factor for the Strided strategy is the phe-nomenon of false sharing. When multiple threads (i.e. many processors), asin the Strided case, try to write (or read) to neighbour addresses on the samecache line, each write will invalidate the other processors copy, causing a lotof remote memory accesses. As referred in [2], this factor depends on theprocessor counts and the problem size. A possible symptom could be lotsof cache misses and therefore poor performance. Hence, we should investi-gate whether alternative re-sort strategies with in stride = 1, i.e the actualtransposition of the array, would give better performance.

3.1.2 The Transposition strategy

The main idea of the Transposition strategy is the implementation of anefficient as possible re-sort depending on the problem size. This re-sort takesplace immediately after the calculation of 1st 1D FFT along the x dimensionand before the calculation of 2nd 1D FFT along the y dimension (see Figure


3.2(a)-(b)). We expect that the key factor of this investigation is how toparallelise the transposition between the two dimensions x and y, thus thetwo for loops shown in Listing 3.2, in order to efficiently exploit sharedaccess to the memory.

f o r i = 0 to P do in p a r a l l e lf f t (Nx,Ny/P, f [ i ∗Ny/P, 0 ] , 1 ,Nx, Z [ i ∗Ny/P, 0 ] , 1 ,Nx ) ;

f o r i = 0 to Nx dof o r j = 0 to Ny do

Ztrans [ j , i ]=Z [ i , j ] ;

f o r i = 0 to P do in p a r a l l e lf f t (Ny,Nx/P, Ztrans [ i ∗Nx/P, 0 ] , 1 ,Nx,F [ i ∗Nx/P, 0 ] , 1 ,Nx ) ;

Listing 3.2 The Transposition strategy

OpenMP, which is designed for programming shared memory parallelcomputers, can be used for the first two strategies, the Strided and the Trans-position. The shared memory programming model is based on the notion ofthreads. Shared data can be accessed by all threads, whereas private datacan only be accessed by the thread which owns it. The only way that threadsare able to communicate is via the data in shared memory [2]. Hence, weare able to use this parallel model in order to compute the 1st FFT alongthe x dimension, do the transposition, and compute the 2nd FFT along they dimension.

The parallel region is the basic parallel construct in OpenMP [2]. Whenthe first parallel region is encountered, the master thread generates a set ofthreads. Subsequently, each thread will execute the statements which areinside the parallel region. Inside the parallel region, variables can either beshared or private. The same copy of shared variables is accessible by eachthread. In contrary to that, each thread has its own copy of private variables.In our case, the shared variable is the transformation array, and the privatevariables are the loop indices, depending on whether an actual transpositionwill take place.

As in most of the HPC applications, loops are the main source of par-allelism. If the iterations of a loop are independent, then the compiler candistribute the iterations among the set of created threads. With the use of aparallel do/for loop we can divide up loop’s iterations between threads. Asreferred in [2], with no additional clauses, the do/for directive will usually di-vide up the iterations as equally as possible. Hence, in our case, each threadcomputes Nx/P transformations for the 1st FFT and Ny/P transformations


(a) Input array after the 1st 1D FFT (b) Transposed array

Figure 3.3 Parallelise the 1st loop version

for the 2nd FFT. Different versions of the Transposition strategy, which havebeen implemented, distribute those loop iterations differently.

When someone wants to parallelise a for directive, this can be achieveddirectly by combining the omp parallel directive, which refers to the openingof a parallel region, and omp for, resulting to omp parallel for directive[2].Using this combined directive, enables us to open a parallel region, whichcontains a single for directive. By the use of the omp for directive, weachieve the distribution of the loop iterations to the available threads ofthe parallel region by the compiler, resulting in the parallelisation of a fordirective.

For the Transposition strategy, there are four basic versions. The firstversion involves parallelising the first for loop. That means a single parallelregion containing a single for directive is defined. Figure 3.3(a) illustratesthat each processor gets to read Ny/P rows of the input array. Figure 3.3(b)illustrates that each processor gets to read Nx/P rows of the transposedarray. Concerning C memory layout, the processors read continuous chunksof memory but the writes on the output array are strided. On the firststep and in parallel, PO reads the blue elements of the input array and willwrite them on the transposed array, P1 handles the reds in the same way,P2 handles the greens, and P3 handles the yellow ones. The algorithm inListing 3.3 presents the first proposed version of the Transposition strategy.

f o r i = 0 to Nx do in p a r a l l e lf o r j = 0 to Ny do

Ztrans [ j , i ]=Z [ i , j ] ;

Listing 3.3 Parallelise the 1st loop version



Figure 3.4 Parallelise the 2nd loop version

The second proposed version involves the parallelisation of the secondfor loop. That means P parallel regions containing a single for directiveare defined. Figure 3.4(a) illustrates that each processor gets to read Nx/Pcolumns of the input array. Figure 3.4(b) illustrates that each processor getsto read Ny/P rows of the transposed array. The processors read continuouschunks of memory but the writes on the output array are strided. On thefirst step and in parallel, PO reads the blue elements from the input arrayand will write them on the transposed array, P1 handles the reds in the sameway, P2 handles the greens and P3 handles the yellow ones. The algorithmin Listing 3.4 presents the second version of the Transposition strategy.

f o r i = 0 to Nxf o r j = 0 to Ny do in p a r a l l e l

Ztrans [ j , i ]=Z [ i , j ] ;

Listing 3.4 Parallelise the 2nd loop version

The third proposed version is to exchange the loop order of the firstversion. That means a single parallel region containing a single for directiveis defined. Figure 3.5(a) illustrates that each processor now gets to read Nx/Pcolumns of the input array. Figure 3.5(b) illustrates that each processor nowgets to read Nx/P rows of the transposed array. Concerning C memorylayout the processors read strided chunks of memory but the writes on theoutput array are continuous. On the first step and in parallel, PO reads theblue elements from the input array and will write them on the transposedarray, P1 handles the reds in the same way, P2 handles the greens and P3will handle the yellow ones. The algorithm in Listing 3.5 presents the third



Figure 3.5 Loops Exchange of the First version


Figure 3.6 Loops Exchange of the Second version

version of the Transposition strategy.

f o r j = 0 to Ny do in p a r a l l e lf o r i = 0 to Nx do

Ztrans [ j , i ]=Z [ i , j ] ;

Listing 3.5 Loops Exchange of the First version

The fourth proposed version is to exchange the loop order of the secondversion. That means P parallel regions containing a single for directive aredefined. Figure 3.6(a) illustrates that each processor now gets to read Ny/Prows of the input array. Figure 3.6(b) illustrates that each processor now gets



Figure 3.7 Nested parallelism version - 2 threads at each level

to read Ny/P columns of the transposed array. The processors read stridedchunks of memory but the writes on the output array are continuous. On thefirst step and in parallel, PO reads the blue elements from the input arrayand will write them on the transposed array, P1 handles the reds in the sameway, P2 handles the greens and P3 handles the yellow ones. The algorithmin Listing 3.6 presents the fourth version of the Transposition strategy.

f o r j = 0 to Ny dof o r i = 0 to Nx do in p a r a l l e l

Ztrans [ j , i ]=Z [ i , j ] ;

Listing 3.6 Loops Exchange of the Second version

An alternative way, with Nested parallelism, is provided by OpenMP inorder to parallelise the transposition of the initial array. In particular, thatmeans that both for loops can be parallelised as shown in Figure 3.7 andFigure 3.8. On the first step and in parallel, PO reads the blue numbersfrom the input array and writes them on the transposed array, P1 handlesthe reds in the same way, P2 handles the greens and P3 handles the yellowones.

In OpenMP, Nested parallelism is enabled with the OMP NESTED en-vironment variable. In a case where a parallel directive is indicated withinanother parallel directive, a new set of threads will be generated. The newset will contain only one thread unless the OMP NESTED environment vari-able is enabled. As referred in [2, 12], the only way to control the number of


(a) Input array after the 1st 1D FFT (b) Tranposed array

Figure 3.8 Loops Exchange of the Nested parallelism version - 2 threads at eachlevel

threads used at each level is with the NUM THREADS clause.As we have seen before with the loops exchange of the first and the second

OpenMP version, there are also two versions of Nested parallelism as well.The algorithms in Listings 3.7 and 3.8 present these two versions, where L1is the number of threads used at the specific level (Level 1), and P is thetotal number of threads.

f o r i = 0 to Nx do in p a r a l l e l num threads (L1)f o r j = 0 to Ny do in p a r a l l e l num threads (P/L1)

Ztrans [ j , i ]=Z [ i , j ] ;

Listing 3.7 Nested parallelism version

f o r j = 0 to Ny do in p a r a l l e l num threads (L1)f o r i = 0 to Nx do in p a r a l l e l num threads (P/L1)

Ztrans [ j , i ]=Z [ i , j ] ;

Listing 3.8 Loops Exchange of the Nested parallelism version

Concerning the Transposition strategy and throughout all of our investi-gations, we take into account 4 basic factors which differentiate the transpo-sition: reading, writing, number of parallel regions and false sharing. Table3.1 presents these factors for all the OpenMP versions. Note that the secondcolumn of the table refers to the same version as the first column, but withthe two loops exchanged.


Parallelise 1st Loop Loops Exchange 1

Continuous reads Strided readsStrided writes Continuous writesSingle parallel region Single parallel regionFalse sharing on writing False sharing on readingParallelise 2nd Loop Loops Exchange 2

Continuous reads Strided readsStrided writes Continuous writesNx parallel regions Ny parallel regionsFalse sharing on reading False sharing on writingNested Parallelism Loops Exchange 3

Continuous reads Strided readsStrided writes Continuous writesNx + 1 parallel regions Ny + 1 parallel regionsFalse sharing on both False sharing on both

Table 3.1 Transposition factors influencing OpenMP versions

3.1.3 The MPI strategy

The third and last proposed strategy is the manual transpose in pure MPI,by using the All-to-All type communication to swap the x- and y- axis asshown in Figure 3.9. For this strategy the message passing model is used. Onthe first step, each processor has the data on its local memory and thereforeit is able to compute the 1st FTT along the x dimension. Figure 3.9(a)shows that each processor gets to compute Ny/P rows for the 1st FTT.As a second step, in order to achieve the transposition between the twodimensions, each processor packs its local data array into an All-to-All sendbuffer. Subsequently, an All-to-All call is performed and in order to completethe transposition, each processor unpacks its data from the receive buffer intoits local transposed array. As a final step for the calculation of the parallel2d FFT, the 2nd FFT along the y dimension is computed. Figure 3.9(b)illustrates that each processor gets to compute Nx/P rows for the 2nd FFT.

Implementing and comparing the results of these three strategies can leadus to some important conclusions concerning the possibility of exploitingbeneficially the shared access to the memory of shared memory nodes.

Section 3.2 - Experimental design 27

(a) Computation of the 1st 1DFFT along the x dimension

(b) Computation of the 2nd 1D FFT along the y

dimension, after the All-to-All call

Figure 3.9 The MPI Strategy

3.2 Experimental design

In this section, we describe in detail the implementations we made for ourinvestigations. We implemented two programs for a detailed performanceanalysis of the parallel 2D FTT. The first one implements the strategies toexploit the shared access to the memory of an SMP node, and the second oneimplements the MPI strategy, described in section 3.1. The code is writtenin C language and therefore we use the C convention for the arrays memorylayout throughout the rest of the document.

3.2.1 The OpenMP implementation of the test pro-

gram

Concerning the Strided and the Transposition strategies described in subsec-tions 3.1.1-2, the OpenMP library was used for the implementation. The 2Dinput array of size Nx × Ny is declared dynamically by the Master threadand is of contiguous memory (see Listing 3.9 for details). The complex 2Dinput data for the parallel FFT algorithm is constructed using the FFTW’slibrary fftw complex type of double precision complex numbers. The datais created in parallel by each thread, using an easy to verify trigonometricfunction which we will see in detail in 3.2.3. The values for the complexinput array are calculated based on the row number of the 2D input arraythat each thread undertakes to create during the parallel data creation. Theprocedure of initialising the input data is not included in our timing mea-surements. Hence, we simulate an environment where the SMP node has theinput data initialised on the shared memory.

complex∗∗ input ;input=mal loc (Nx∗ s i z e o f ( complex ∗ ) ) ;


input [0 ]= mal loc (Nx∗Ny∗ s i z e o f ( complex ) ;f o r i = 1 to Nx do

input [ i ] = input [0 ]+ i ∗Ny∗ s i z e o f ( complex ) ;

Listing 3.9 2D dynamic allocation of continuous memory

In order to allow detailed performance analysis of the execution time ofour implementation, we introduced a set of timers to measure the executiontime of each step. The OpenMP ’s timing function omp get wtime() was usedfor all the measurements made in the code.

The omp get wtime() function has a constant overhead of 1µs for a com-plete start/stop sequence [13]. In order to check the timer’s accuracy, on eachrun we make 3 successive calls of the omp get wtime() function, as shown inlisting 3.10. After that, we check that (3.1) holds on each run. The over-head of the function is not removed from the results. This does not seem tobe a problem as the timing overhead is constant and very small. However,for small problem sizes, which are computed on small processor counts, thisoverhead does become crucial. Results with values close to the resolution ofthe function are discarded.

accuracy [0 ]= omp get wtime ( ) ;accuracy [1 ]= omp get wtime ( ) ;accuracy [2 ]= omp get wtime ( ) ;

Listing 3.10 Measuring the accuracy of omp get wtime() timer

(accuracy[1] − accuracy[0]) − accuracy[2] − accuracy[0]

2< 1µs (3.1)

The timings are taken from a number of iterations, between 10 and 6250depending on the problem size. On each run, we specify a number of itera-tions to be warm-up runs, which we do not include in the results. As it isobvious, the omp get wtime() function is called outside any parallel region.That means that it is called only by the Master thread, and the use of anadditional barrier is not needed to avoid distortion of the results.

On each iteration we collect timings for 4 step timers from the Masterthread only. The step timers are: the beginning of the procedure, the end ofthe 1st FFT, the end of the transposition and the end of the whole procedure.The end of the 1st FFT minus the beginning of the procedure gives us thetiming for the 1st FFT. The end of the transposition minus the end of the1st FFT gives us the timing for the transposition. The end of the wholeprocedure minus the end of the transposition gives us the timing for the2nd FFT. Finally, the end of the whole procedure minus the beginning of


the procedure gives us the timing for the whole procedure. Between the1st and the 2nd FFT, there are 6 if/else statements. The program’s flowis differentiated according to which one of the OpenMP versions is to befollowed. Note that for the Strided strategy there is no transposition. Inthat case, the end of the transposition minus the end of the 1st FFT gives usthe time that the thread needs in order to pass the if/else statements. Thattime seems to be much less than 1µs and therefore it is not considered tobe significant.

Based on the above, the implementation reports as an output, for eachtimer, three different time values: the minimum, the average and the maxi-mum. The minimum total timing is the time of the fastest iteration for thewhole procedure. The values of all the other minimum individual timingsare the minimum times measured by any iteration. For the average timing arunning total is kept for each timing result over all relevant iterations. Thistotal is divided by the number of relevant iterations and finally reported asthe average value of each timer. The maximum timing of the whole procedureis the time of the slowest iteration. The values of all the others maximumindividual timings are the maximum measured by any iteration. Using thedifference between the minimum and the average values is useful in the sensethat we can get a feeling of a possible scattering of the timing results.

3.2.2 The MPI implementation of the test program

MPI strategy described in sections 3.1.3 was implemented with the use of theMPI library. The 2D input array of size Nx/P ×Ny is declared dynamicallyby each processor and is of contiguous memory in a similar way as in Listing3.9. The complex 2D input data for the parallel FFT algorithm is againconstructed using the FFTW’s library fftw complex type of double precisioncomplex numbers. The data is created in parallel by each processor, using thesame trigonometric function which we will see in detail later on. The valuesfor the complex input array are calculated based on the coordinates of theprocessor in the virtual 1D processor grid. The procedure of initialising theinput data is not included in our timing measurements. Hence, we simulatean environment where the input data is already distributed on the processorgrid. In order to avoid any possible distortion, the phase of initialisation isencountered within a global barrier.

Once again, in order to allow a detailed performance analysis of the run-time, we introduced a set of timers to measure the time taken for each stepof the whole procedure. The MPI ’s function MPI Wtime() was used for allthe measurements made in the code. This function has a constant overheadof 1µs for a complete start/stop sequence. In order to check the timer’s


accuracy, the same method as in the omp get wtime() function was used.The timings are taken again from a number of iterations, between 10 and

6250 depending on the problem size. On each run, we specify a number ofiterations to be warm-up runs, which we do not include in the results. Inorder to avoid distortion, at the beginning of each iteration, and before anytimer, the processes are synchronised with the use of a global barrier.

On each iteration we collect the time for 6 step timers from each processor.The step timers are: the beginning of the procedure, the end of the 1st FFT,the end of the packing of the All-to-All send buffer, the end of the All-to-Allcall, the end of the unpacking of the All-to-All send buffer and the end ofthe whole procedure. The end of the 1st FFT minus the beginning of theprocedure gives us the timing for the 1st FFT. The end of the packing minusthe end of the 1st FFT gives us the timing for the packing. The end of theAll-to-All minus the end of the packing gives us the timing for the All-to-Allcall.The end of the unpacking minus the end of the All-to-All call gives usthe timing for the unpacking. The end of the whole procedure minus theend of the unpacking gives us the timing for the 2nd FFT. Finally, the endof the whole procedure minus the beginning of the procedure gives us thetiming for the whole procedure. As an additional test, after the end of theprocedure we insert a global barrier, and after that, we insert an additionaltimer. As it easy comprehended, the difference between the timer after thebarrier and the end of the procedure shows how well our code is balanced.

On each iteration, by using a global reduction, we collect the timings foreach of the step timers separately from all processors on the Master. Thisvalue, divided by the number of processors, is the average time across allprocessors for one iteration. Based on these times, the implementation givesas an output, for each timer, three different values: the minimum, the averageand the minimum, exactly as in the OpenMP case.

It has to be noted that, in order to avoid rounding errors, we do notdivide the results of the global reduction by the number of processors at eachiteration. Instead, we keep the values of the reduction and do the relevantdivision only for the purpose of reporting.

3.2.3 Verification of the 2D FFT results

Before any investigation can be made, we must ensure that the 2D FFTcomputation is correct. Therefore, as proposed in [13], synthetic input datais chosen to ensure reliable verification of the results. As referred in [13],making a correct choice regarding the input function, not only ensures thatthe results can be easily verified, but it also provides the advantage that thesize of the problem can be modified, in a straightforward way, by changing

Section 3.3 - Hardware Overview 31

the values of Nx and Ny. The trigonometric function used to create the 2Dinput data is shown in (3.2).

f(x, y) = sin(2πax

Nx

+2πby

Ny

)

0 6 a < Nx, a ∈ Z

0 6 b < Ny, b ∈ Z

(3.2)

For each program run, different values for a and b can be chosen. Thismain property of this function is that that only two values in the frequencydomain are expected to be nonzero. Equation (3.3) presents the coordinatesof those two values for the DFT of f(x, y).

F (u, v) =

−12i.NxNy if a = u, b = v

12i.NxNy if a = Nx − u, b = Ny − v

0 else

(3.3)

Because the verification is done on the transposed data, the coordinatesof the peak values, determined by the factors a and b, have to be transposedas well. The algorithm’s steps in the test program are as follows:

• Create the input data: f(x, y)

• First FFT on the x-axis: f(x, y) → FX(u, y)

• Transposition of x- and y- axis: FX(u, y) → FX(y, u)

• Second FFT on the y-axis: FX(y, u) → F (v, u)

The peak values from (4.2) have to be at F (v, u) and at F (Ny−v,Nx−u).As a straightforward solution, we choose to transpose the final output beforechecking. The verification routine is implemented to allow for numericalerrors. Since the error depends on the problem size, for double precisioncomplex numbers the maximum relative error defined oscillates between 1.0×10−9 and 1.0 × 10−5.

3.3 Hardware Overview

In this section we attempt to make a concise overview of the hardware usedto run our test programs. The two following sections present the hardwarearchitecture of the HPCx System and EPCC HPC Service Ness.


3.3.1 HPCx System

Consisting of 160 IBM eServer 575 nodes for computational purposes, and 8IBM eServer 575 nodes for login and disk I/O, HPCx system is one of theUK national supercomputers categorised as an SMP cluster. Each eServernode is built by 16 1.5 GHz POWER5 64-bit RISC processors, which isthe maximum that the hardware can afford, resulting in a total of 2560processors. The computational power of the HPCx system is estimated tobe 15.3 Tflops [11]. Linpack, one of the most well-known benchmarks, testedon the complete new platform, gave a 12.9 Tflops peak for the Rmax value.

The POWER5 chip is mainly built by two cores. Each of the processorshas its own L1 cache onto the chip, which is divided into the instruction cache(32 kB) and the data cache (64 kB). Apart from L1, the chip has the L2 (1.9MByte) cache also on board and shared between the two processors. A multi-chip module (MCM) is constituted from four chips, thus 8 processors, andit also contains L3 (128 MB) cache and 16 GB of main memory. An HPCxnode is built by two MCMs, thus 16 processors, and has a total main memoryof 32 GB, which is shared between the 16 processors.

The nodes of the HPCx system are connected with a switch networkusing IBM’s HPS [11], providing a fast multistage bidirectional connectionfrom every node to every other node in the system. As referred in [13], HPCxprovides two identical independent networks: any of the two networks canbe used by a node to communicate to other node of the machines througha maximum of 3 switch boards. One HPS is able to serve 16 nodes, thustwo frames. For every two frames, there are two switches, one for everynetwork, directly attached to each of these nodes and are called Node SwitchBoard (NSB)[13]. Intermediate Switch Board (ISB) is a second level of switchhierarchy connecting to NSB. NSB facilitates the communication between allthe nodes. Compute nodes on the same NSB switch are able to communicateinside this switch without engaging the ISB[13].

3.3.2 EPCC HPC Service Ness

The EPCC HPC Service Ness [5] consists of a cluster of two SMP boxes.These boxes are consisted of the back-end and front-end. The processorsare 2.6 GHz AMD Opteron (AMD64e). Each processors local memory isof 2GBytes. The front-end system, consisted of dual-processors, is used forlogging on, editing and compiling programs which can then be run on theback-end system. Sun Grid Engine (SGE) batch system is used for launchingback-end jobs from the front-end. The back-end consists of 24 processorsand is where HPC jobs run. The backend consists of two SMP nodes. One

Section 3.4 - Results and Analysis 33

node contains 16 processors (8 dual-nodes). Each processor in that node hasaccess to 32 GBytes of memory. The other node contains 8 processors (4dual-nodes).

3.4 Results and Analysis

In this section, we present the results of our test programs run mainly onone node of the HPCx system for problem sizes between 162 and 81922.The following subsections present the isolated times measured for each stepdescribed in the previous section, as well as the sum of the total time noted.On the following results, various execution times of the 4 basic versions ofthe Transposition strategy, the Strided strategy and the MPI strategy arepresented comparatively. An extra subsection is dedicated to the Nestedparallelism versions. In the last subsection, some extra results concerningthe scaling of the code, the use of optimisation flags and the way that theinitial data is created are also presented.

3.4.1 The 1st FFT

In Figure 3.10 we present the average execution time of each strategy for thecomputation of the 1st FFT on 16 processors of an HPCx node. Points tonote:

• The larger the problem size is, the longer it takes to be computed.

• The second thing to note concerns the two strategies in OpenMP : theTransposition strategy with its four versions and the Strided strategy.We observe that all of them have almost identical execution times. Thisis not unexpected as the data and plan creation (along the x dimension)for the 1st FFT are identical for all the OpenMP versions.

• The last observation concerns the MPI strategy. We note that forproblem sizes of 162 up to 20482 it yields a better performance thanOpenMP. The plan creation is again similar but not exactly the same.The main difference between them, that affects the execution time ofthe 1st FFT, is the way that the data is cached on the processors afterthe initial data creation. For small problem sizes MPI caches the dataon a higher level of cache and that is why it is faster than OpenMP.With OpenMP, we cannot guarantee that the threads which will createa set of data, are the same ones used to calculate the 1st FFT. Thatcan be comprehended by the fact that there are two different parallel


1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

16 32 64 128 256 512 1024 2048 4096 8192

Time (log s)

N (Problem size = N2)

Parallelise 1st Loop

♦ ♦ ♦♦

♦

♦

♦

♦

♦

♦♦Parallelise 2nd Loop

+ + ++

+

+

+

+

+

++Loop Exchange

� � ��

�

�

�

�

�

�

�Loop Exchange 2

× × ××

××

××

××

×Strided

△ △ △△

△△

△△

△△

△MPI

⋆⋆

⋆

⋆

⋆

⋆

⋆

⋆

⋆

⋆

⋆

Figure 3.10 1st FFT results on 16 processors of an HPCx node

regions, one for creating the initial data and one for calculating the 1st

FFT. Hence, the order of the loops distribution across the threads isnot guaranteed to be the same.

3.4.2 The transposition

In Figure 3.11 we present the average execution times for the transpositionbetween the x and y dimension on 16 processors of an HPCx node.

• Again, we observe that the larger the problem size is, the longer thetime taken for the transposition.

• The version involving the parallelisation of the 2nd loop and the sameversion with its loops exchanged, are relatively slower for small problemsizes up to 10242. That is because these 2 versions require the openingof N parallel regions for the transposition (see Table 3.1). This seemsto become crucial for small problem sizes. However, as we increase theproblem size and the time required for the transposition increases, thetime spent opening parallel regions become less significant.

• The third thing we observe concerns the version where the 1st loop isparallelised and the same version with its loops exchanged. For small


problem sizes up to 1282, the version where the 1st loop is parallelised(strided reads, continuous writes, false sharing on writing) is betterthan the same version with the loops exchanged (continuous reads andstrided writes, false sharing on reading). However, for problem sizeslarger than 1282 up to 81922, continuous reads, strided writes and falsesharing on reading perform much better.

Even though from Figure 3.11 it is difficult to observe, with a moredetailed look on the timings we observe that the same holds for theother two versions. For problem sizes up to 20482, strided reads, con-tinuous writes and false sharing on writing are slightly better, but forlarger problem sizes, continuous reads, strided writes and false sharingon reading give better performance.

In Figure 3.12 we present the average transposition times for the MPIstrategy compared to the two OpenMP versions, on 16 processors of an HPCxnode. Concerning the MPI strategy, for the transposition time, we took thetotal as the time for packing the sender buffer, the time for the All-to-Allcall and the time for unpacking. Figure 3.13 presents the same times on 16processors of a Ness node.

• In Figure 3.12, we observe that transposing with the OpenMP version,where the 1st loop is parallelised, is clearly the fastest for problem sizesup to 10242. Furthermore, we note that the MPI strategy performswell and is quite competitive for the larger problem sizes.

• In Figure 3.13, we observe that the MPI strategy does not performas well as on the HPCx node. As noted in [9], this can be explainedby the fact that the MPI All-to-All call is much faster on the HPCxmachine than on Ness. The gains from using the access to the sharedmemory for the transposition of small problem sizes depends on theperformance of the All-to-All on the SMP node used.

3.4.3 The 2nd FFT

In Figure 3.14 we present the average execution time for the computation ofthe 2nd FFT on 16 processors of an HPCx node.

• Once again, we observe that the larger the problem size is, the longertime it takes to compute the 2nd FFT.


1e-06

1e-05

0.0001

0.001

0.01

0.1

1

10

16 32 64 128 256 512 1024 2048 4096 8192

Time (log s)



♦ ♦♦

♦♦

♦

♦

♦

♦

♦

♦Parallelise 2nd Loop

++

++

++

++

+

++

Loop Exchange

� � ��

�

�

�

�

�

��

Loop Exchange 2

× × × × ××

××

××

×

Figure 3.11 Transposition results on 16 processors of an HPCx node

1e-05

0.0001

0.001

0.01

0.1

1

16 32 64 128 256 512 1024 2048 4096 8192

Time (log s)



♦♦

♦♦

♦

♦

♦

♦

♦

♦♦

Parallelise 2nd Loop

++

++

++

+

+

+

++

MPI

� ��

�

�

�

�

�

�

�

�

Figure 3.12 MPI vs OpenMP transposition results on 16 processors of an HPCxnode


1e-06

1e-05

0.0001

0.001

0.01

0.1

1

10

16 32 64 128 256 512 1024 2048 4096 8192

Time (log s)



♦ ♦♦

♦

♦

♦

♦

♦

♦

♦♦


++

++

++

+

+

+

++MPI

� � �

� �

�

�

�

�

�

�

Figure 3.13 MPI vs OpenMP transposition results on 16 processors of a Nessnode

• The second thing we observe is that the computation of the 2nd FFTwith the Strided strategy is relatively slow. This is because there is notransposed array on the Strided strategy, and for the computation ofthe 2nd there are many strided accesses (to the memory) on the initialarray.

• Concerning the MPI strategy we note once again, that for problemsizes up to 40962, it performs better than all the OpenMP strategies.

Concerning the 4 basic OpenMP versions, it is not clear from Figure3.14 whether there are significant differences between them. Figure 3.15shows this more clearly. The performance of each version is calculated as theratio Rperf of the execution time of the first version, and each version’s ownexecution time.

• As we observe, the version where we parallelise the second loop, andthe first version with the loops exchanged, perform better compare tothe other two versions. A closer look on the transposed array of theseversions will help us realise why. Figure 3.4(b) and 3.5(b) clearly showthat, during the transposition, the processors write the data whichwill be needed in order to calculate the 2nd FFT. In Figure 3.4(b), we


1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

10

16 32 64 128 256 512 1024 2048 4096 8192

Time (log s)



♦ ♦ ♦♦

♦

♦

♦

♦

♦♦


+ + ++

++

++

++

+Loop Exchange

� � ��

�

�

�

�

��

Loop Exchange 2

× × × ××

××

××

××Strided

△ △ △△

△△

△△

△△

△MPI

⋆ ⋆⋆

⋆⋆

⋆⋆

⋆

⋆⋆

⋆

Figure 3.14 2nd FFT results on 16 processors of an HPCx node

observe that for the first two rows of the transposed array, all the blueelements will be written by the processor (P0) which will then use thoserows to calculate 2nd FFT. The same holds in Figure 3.5(b). Clearly,that is not the case for the other two versions (see Figure 3.3(b) and3.6(b)).

3.4.4 Total execution time of the 2D FFT

In Figure 3.16 we present the total average execution time for the computa-tion of the 2D FFT on 16 processors of an HPCx node.

• As mentioned before, the larger the problem size is, the longer time ittakes to compute the 2D FFT.

• The second thing we observe is that the best version, for small problemsizes up to 642, is the first version with its loops exchanged. Even ifthis version performs well concerning the calculation of the 2nd FFT,for larger problems, the strided reads and the false-sharing on readingbecome crucial to performance.

• The MPI version, the first OpenMP version and the same version withthe loops exchanged, clearly have the best performance for problem


0

0.5

1

1.5

2

16 32 64 128 256 512 1024 2048 4096 8192

Rperf



♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦


+ + ++ + + + +

+ +

+Loop Exchange

��

�

� �

�Loop Exchange 2

× × × × × × × × × ×

×

Figure 3.15 Performance of the 2nd FFT

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

10

16 32 64 128 256 512 1024 2048 4096 8192

Time (log s)



♦ ♦♦

♦♦

♦

♦

♦

♦

♦♦


++

++

++

++

+

++Loop Exchange

� � ��

�

�

�

�

�

��

Loop Exchange 2

× × × × ××

××

××

×Strided

△ △ △△

△△

△△

△△

△MPI

⋆ ⋆⋆

⋆

⋆⋆

⋆

⋆

⋆

⋆

⋆

Figure 3.16 2D FFT total execution time results


sizes between 642 and 5122. This is because these three versions haveboth fast transposition and fast 2nd FFT timings compare to all theother.

• As the problem size increases up to 81922, the opening of N parallelregions becomes non significant, and the second OpenMP version, alongwith the first one and MPI strategy, have the best performance. False-sharing on reading forces the versions with the exchanged loops in aworse performance.

• As predicted and also shown in [18], the Strided strategy has a verypoor performance.

3.4.5 The Nested parallelism version

In Figure 3.17 we present the total average execution time for the computa-tion of the 2D FFT on 16 processors of an HPCx node with Nested paral-lelism. There are 5 possible subversions of the Nested parallelism, dependingon the number of threads at each level of parallelism. For comparison rea-sons, we also present the first OpenMP version, where only the first loop isparallelised.

• It is clearly that all the Nested parallelism subversions perform poorlyfor all problem sizes. A detailed performance analysis of the partialresults show that the execution times for the transposition and thecalculation of the 2nd FFT, are quite disappointing. The main limitingfactor for the total results is the calculation of the 2nd FFT, which isup to 10 times slower compared to the other OpenMP versions. Thiscan be explained by the fact that with the Nested parallelism, after thetransposition of the initial array, the data are all cached on the wrongprocessors.

• The second thing we observe is that the 1 × 16 subversion has theworst performance for small problem sizes. This can be explained bythe fact that (see Table 3.1) false sharing becomes crucial for smallproblem sizes. However, we observe that as the problem size increases,it becomes less and less significant with the total performance thoughremaining at low levels.

Concerning the same version, with the loops exchanged, the results andthe analysis are quite similar.


1e-06

1e-05

0.0001

0.001

0.01

0.1

1

10

16 32 64 128 256 512 1024 2048 4096 8192

Time (log s)


1x16 threads

♦♦

♦♦

♦

♦

♦

♦

♦

♦♦

2x8 threads

++

+

+

+

+

+

+

+

+

+4x4 threads

��

�

�

�

�

�

�

�

�

�8x2 threads

× ××

××

××

××

×

×16x1 threads

△ △△

△△

△△

△△

△

△Parallelise 1st Loop

⋆ ⋆ ⋆⋆

⋆

⋆

⋆

⋆

⋆

⋆

⋆

Figure 3.17 2D FFT total execution time results with Nested parallelism on 16processors of an HPCx node

3.4.6 Scaling, Optimisation flags and Data creation

In this subsection, and for the sake of completeness, we present results con-cerning the scalability of the code, the use of optimisation flags, and the waythat the initial data is created.

When building a parallel code, it is not always guaranteed that the moreprocessors used, the better performance we get. That is why we need to en-sure that our code scales well when adding more and more processors. Figure3.18 presents the parallel speedup of our code for different problem sizes. Theresults for all the versions are very similar, and so we choose to present theresults for the first OpenMP version, which illustrates a good scaling of thecode. We also observe that for small problem sizes, parallel speed-up remainsat low levels. However, as the problem size grows, speed-up increases andthen it remains stabilised. As explained previously, this phenomenon canbe attributed to false sharing, which depends on the problem size and thenumber of processors.

The next results presented concern the use of optimisation flags in ourcode. There are different flags for different levels of general optimisation.Note that the optimisation flags alter three important characteristics; theruntime of the compiled application, the length of time that the compilationtakes, and the amount of debug that is possible with the final binary. Figure


0.1

1

10

16 32 64 128 256 512 1024 2048 4096 8192

Speed-up


Parallelise 1st Loop version

PE=1♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦

♦PE=2

+

++ + + + + + + +

+PE=4

�

�

��

� � �

�PE=8

××

×× × × × × × ×

×PE=16

△

△△

△△ △ △ △ △ △

△

Figure 3.18 2D FFT scaling test from 1 to 16 processors with the First OpenMPversion

3.19 presents the execution time results with 4 different optimisation flags,-O2, -O3, -O3 -qhot and -O4, on 16 processors. Once again, the results forall the versions are very similar and so we choose to present the results forthe 1st OpenMP version. Figure 3.19 demonstrates no significant differencesin the execution times between different optimisation flags.

The last results presented concern the way that the initial data is created:in serial or in parallel. Figure 3.20 presents the execution time results fortwo different ways of implementing data creation. In the case where the datais created in serial, the data is created only by the Masted thread, whilein the parallel version, the data is created by each thread. As it is shownclearly from Figure 3.20, there are no significant differences between the twoimplementations. Once again, the results for all the versions are quite similar,and so we choose to present the results for the 1st OpenMP version.

3.5 Synopsis

We proposed and investigated three basic strategies for parallelising the 2DFFT. The performance of the Strided strategy was poor compared to theTransposition and MPI strategies. In addition, we illustrated how trans-position factors can influence the performance of versions implemented in

Section 3.5 - Synopsis 43

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

10

16 32 64 128 256 512 1024 2048 4096 8192

Time (log s)



-02

♦ ♦♦

♦♦

♦

♦

♦

♦

♦♦

-03

+ + ++

+

+

+

+

+

++-03 -qhot

� � ��

�

�

�

�

�

��

-04

× × × ××

××

××

××

Figure 3.19 2D FFT total execution time results with Optimisation flags on 16processors

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

10

16 32 64 128 256 512 1024 2048 4096 8192

Time (log s)



Data in serial

♦ ♦♦

♦♦

♦

♦

♦

♦

♦♦

Data in parallel

+ + ++

+

+

+

+

+

++

Figure 3.20 2D FFT total execution time results with different Data creationon 16 processors


OpenMP. The two versions with Nested parallelism, demonstrated a poorperformance compare to the other versions of the Transposition strategy.Moreover, the partial results of the 2D FFT demonstrated a superiority ofthe MPI strategy on the calculations of the 1st and 2nd FFT, for problemsizes up to 20482.

Concerning the overall performance of the 2D FFT, we illustrated thatthe best version for small problem sizes up to 642, is the first OpenMP versionwith the loops exchanged. The MPI version, the first OpenMP version andthe same version with the loops exchanged, clearly have the best performancefor problem sizes between 642 and 5122. As the problem size increases up to81922, we demonstrated that the opening of N parallel regions becomes nonsignificant, and the second OpenMP version, along with the first one and theMPI, have the best performance in overall.

Chapter 4

Parallel 3D FFT between nodes

using the Hybrid model

In this chapter, we present our investigations on optimising the parallel 3DFFT with the Hybrid, a mixed mode programming model between sharedmemory programming and message passing, based on the results of the pre-vious chapter.

One of the main advantages of the pure MPI implementations is thehigh portability, which makes them easily transferable to a various numberof SMP systems. However, when we can utilise an SMP architecture, wemust think carefully, whether a pure MPI implementation with the requiredcommunication between the nodes, is the best choice within an SMP node.

Theoretically speaking, within an SMP node, shared memory program-ming, which will avoid the communication between the processors, should bemore efficient than MPI. In a case, where we are able to divide the wholeprocedure into two processing steps, where the first one only requires com-munication within the nodes and the second one between them, we could useshared memory programming for the first step, and message passing modelfor the second one. This programming model, which combines shared mem-ory and message passing, is called mixed mode programming. In certaincases, this model might prove beneficial as it matches the SMP architecturein a better way than the pure MPI model.

In conclusion, by utilising a mixed mode programming model, we shouldbe able to exploit the benefits of both models. Concerning our conclusionsfrom Chapter 3, many cases did show a better performance with OpenMPthan with MPI, so it is possible that, in overall, we may benefit from using amixed mode programming model. In particular, on the one hand, we couldexploit the shared access for the communication inside the nodes to calculate

45

46 Chapter 4 - Parallel 3D FFT between nodes using the Hybrid model

the 2D FFT, based on our results from the 2D case, and on the other hand,in preparation for the calculation of the 3D FFT, apply the message passingmodel by communicating between nodes.

4.1 The Hybrid model

The Hybrid model attempts to exploit features of the SMP cluster archi-tecture, thus resulting in a more efficient parallelisation strategy, possiblyby combining the advantages of both OpenMP and MPI parallelisationstrategies[19]. Hence, the main idea is to parallelise the 3D FFT with theHybrid model, by extending our investigations on the 2D FFT. Even if thethe 2D case shows superiority of the pure MPI for specific problem sizes, stillit is very interesting to study the Hybrid model for the 3D case, because itis possible that it performs best in total.

There are three mixed mode styles implemented for project purposes in[16]. For our investigations we use one of them: the Master-only style.Master-only is the easiest version to implement the Hybrid. The differencefrom other styles like the Funneled and the Multiple Model is in the way thecommunication is handled. In Master-only, all communication between thetasks occurs outside the OpenMP parallel regions and it is applied only bythe master thread. Hence, all other threads are idle during communicationroutine and as it is obvious, processor cycles are wasted. In contrary tothe Master-only style, where the communication occurs outside the parallelregion, the Funneled style allows for the communication handled only byone OpenMP thread to occur inside a parallel region. This also allows forthe other threads to compute during this communication. The last proposedmodel in [16], is the Multiple Model, where in this style, many threads areallowed to communicate.

It turns out that there are many situations where the Hybrid model paral-lelisation strategy could prove beneficial against a pure MPI implementationon an SMP system[19].

One of these situations regards the implementations which have a badperformance when increasing the number of processors. In a case where theOpenMP version of the same problem scales well, then when applying a mixedmode programming scheme, we expect to gain performance improvement. Itis also possible that both versions have a poor scaling but for different reasonsthough. The MPI implementation is possible to suffer from communicationoverheads while the OpenMP version from fault sharing. Hence, by using aversion where we apply a mixed mode programming scheme, it is possible thatwe gain performance improvement as increasing the number of processors,

Section 4.1 - The Hybrid model 47

Figure 4.1 The first two processing steps for the parallel 3D FFT using theHybrid (2 MPI Tasks × 2 OpenMP threads)

and none of the above problems are significant or yet apparent.[19]

Nowadays, the SMP architecture allows for the engineerers to built sys-tems with an increasing number of processors. There are certain situations,where the scaling of existed pure MPI implementations does not match theincrease of processor counts on new platforms. Implementing a mixed modecode may prove beneficial and in certain cases required. It would reduce thenumber of MPI processes as these could be replaced with OpenMP threads.

One last possible benefit which is based on the above, is that, as thenumber of MPI processes is decreased, the number of communication neededbetween the processors is also decreased. This is more likely to lead to largermessages between the less MPI tasks, and thus less demands on the networkadapters of the SMP node.

Imagine that the first two steps, shown in Figure 2.2, could be imple-mented inside the SMP nodes. The data kept by each row of the virtualprocessor grid (e.g first row: PO and P1 in Figure 2.2) could be kept on theshared memory of an SMP node. Therefore, for the first two steps of the al-gorithm, the OpenMP could be used to access the shared data and compute


the 1st and 2nd FFT in parallel (see equation (2.15)) as shown in Figure 4.1.The results from the 2D study can be used to optimise this part of the cal-culation. After that, in preparation for the final step of the algorithm, wherethe processors within the same column of the virtual processor grid (e.g firstcolumn: PO and P2 in Figure 2.2) perform the All-to-All communication,the MPI tasks can instead perform an All-to-All call as shown in Figure 4.2to swap between the x and y dimension. The whole procedure is presentedin Listing 4.1.

1. 1D FFT along the z dimension2. Transposition of the z− and y− axis using shared memory access3. 1D FFT along the y dimension4. Transposition of the y− and x− axis using a single All-to-All call

Pack All-to-All send buffer in parallel by the OpenMP threadsPerform the All-to-All by the MPI tasksUnpack the All-to-All receive buffer in parallel by the OpenMP threads

5. 1D FFT along the x dimension

Listing 4.1 Parallelising the 3D FFT with the Hybrid

The first important clue for using the Hybrid model is that it gives us theflexibility to use the slab data decomposition in a problem where the numberof available processors is smaller than the number of available slabs. Sincethe 2D decomposition approach pays off only if there are more processorsavailable than we can utilise, for the cases where both slab and 2D datadecompositions are applicable, we would expect the slab decomposition withthe Hybrid to have a better performance.

The second important clue for using the Hybrid model is that, at somepoint, we can avoid the limitations which the slab decomposition approachintroduces. The maximum number of processors is clearly not limited by themaximum number of available slabs since for each slice of data we can utiliseas many threads as the hardware architecture can afford. In other words, foreach SMP node, we must use at least one MPI task. The maximum numberof OpenMP threads we can utilise, is the number of processors per node.Hence, the available slabs must be greater or equal to the number of MPItasks used.

The third clue for using the Hybrid model is that there is only a single All-to-All call between the MPI tasks, in contrary to the 2D decomposition (seeFigure 4.2). With the use of the Master-only there is no longer an All-to-Allin each of the columns of the virtual processor grid as in 2D decomposition.Hence, in overall, it could prove beneficial because the Hybrid model can

Section 4.1 - The Hybrid model 49

be used for larger messages and thus efficiently utilise the network adaptersattached to the compute nodes of the HPCx system. With the Master-onlystyle, only it will handle the communication and therefore it will be able tosend larger chunks of data. Furthermore, as there will not be multiple threadstrying to send data, it could avoid the conflicts on the network adapter.

Regarding the time distribution of the parallel 3D FFT using 2D datadecomposition, in [18], it is referred that on 256 processors for a 1283 problemsize, in preparation for the calculation of the 3rd FFT, only the packing andunpacking/rearranging of the columns take 20% of the total execution time.Hence, with the use of the Master-only style, if the number of MPI tasks isrelatively small compared to the problem size, the straightforward solutionof local packing and unpacking in serial by the MPI tasks, could prove dis-astrous on the overall performance. By utilising the OpenMP threads, weare able to implement the local packing and unpacking of each MPI task inparallel, and do not waste processor cycles.

In conclusion, the flexibility of the Hybrid model, the fact that we canavoid the limitations that the slab decomposition introduces, the single All-to-All call between the MPI tasks and the parallelisation of the local packingand unpacking, lead us to the following hypothesis for the parallel 3D FFT:

Hypothesis

1. If the number of available processors P is smaller than the availableslabs N , the slab decomposition with the Hybrid model is more likely toprove beneficial compare to the 2D data decomposition with pure MPI.

2. If the number of available processors P is greater than the availableslabs N , the Hybrid model could prove beneficial compare to the 2D datadecomposition with pure MPI.

In this chapter, we try to investigate which is the best possible combina-tion of MPI Tasks × OpenMP threads in order to efficiently parallelise the3D FFT. Recall that we must use at least one MPI task per node and thatthe number of MPI tasks multiplied by the number of OpenMP threads mustbe equal to the total number of processors used. The question is which is theoptimal number of MPI tasks we can utilise for different processor countsand problem sizes. Finally, we make an attempt to compare our results withthe Hybrid model to the pure MPI implementation, as studied in [18], andconfirm our hypothesis.


Figure 4.2 In preparation for the final processing step of the Hybrid model, asingle All-to-All call between 2 MPI tasks to swap x and y dimension


4.2 Experimental design

In this section, we describe in detail the implementation for our investigation.We implemented a program in C language as a test for detailed performanceanalysis of the parallel 3D FTT, in order to exploit possible benefits of themixed mode parallel model as described in section 4.1.

4.2.1 The Master-only implementation of the test pro-

gram

The 3D input array of size Nx

P× Ny × Nz is declared dynamically by each

MPI task and is of contiguous memory for each of the Nx

Pslices (y and z

dimension), as presented in Listing 4.2. The complex 3D input data forthe parallel FFT algorithm is constructed again using the FFTW’s libraryfftw complex type of double precision complex numbers. The data is createdin parallel by each MPI task, using an easy to verify trigonometric functionwhich we will discuss later on. The values for the complex 3D input arrayare calculated based on the x dimension of the 3D input array that eachMPI task will handle to compute the 1st and 2nd FFT along the y and zdimension. Once again, the initialisation of the input data is not includedin our timing measurements. Hence, we simulate an environment where theSMP nodes have already the input data on their shared memory.

complex∗∗∗ input ;input=mal loc ( (Nx/P)∗ s i z e o f ( complex ∗ ∗ ) ) ;f o r j = 0 to (Nx/P) do{

input [ j ]=mal loc (Nz∗ s i z e o f ( complex ∗ ) ) ;input [ j ] [ 0 ]= mal loc (Nz∗Ny∗ s i z e o f ( complex ) ;f o r i = 0 to Nz do

input [ j ] [ i ]= input [ j ] [ 0 ]+ i ∗Ny∗ s i z e o f ( complex ) ;}Listing 4.2 3D dynamic allocation of contiguous memory for z and y dimension

In order to allow a detailed performance analysis of our test program, weintroduced a set of timers to measure the time taken to execute each step ofthe whole procedure. The MPI ’s function MPI Wtime() was used for all themeasurements made in the code.

The timings are taken from a number of iterations, between 20 and 200depending on the problem size. In the beginning, a number of iterations areconsidered to be warm-up runs and therefore they are not included in the


results. As it is obvious, the MPI Wtime() function is called outside anyparallel region. That means that it is called only by the MPI tasks. In orderto avoid distortion, at the beginning of each iteration and before any timer,the MPI tasks are synchronised with the use of a global barrier.

On each iteration we collect the timing for 8 step timers from the MPItasks only. The step timers are: the beginning of the procedure, the end ofthe 1st FFT, the end of the transposition, the end of the 2nd FFT, the end ofthe packing of the All-to-All send buffer, the end of the All-to-All call, theend of the unpacking of the All-to-All send buffer and the end of the wholeprocedure. The end of the 1st FFT minus the beginning of the proceduregives us the timing for the 1st FFT. The end of the transposition minus theend of the 1st FFT gives us the timing for the transposition. The end ofthe 2nd FFT minus the end of the transposition gives us the timing for the2nd FFT. The end of the packing minus the end of the 2nd FFT gives usthe timing for the packing. The end of the All-to-All minus the end of thepacking gives us the timing for the All-to-All call. The end of the unpackingminus the end of the All-to-All call gives us the timing for the unpacking.The end of the whole procedure minus the end of the unpacking gives us thetiming for the 3rd FFT. Finally, the end of the whole procedure minus thebeginning of the procedure gives us the timing for the whole procedure. Asan additional test, after the end of the procedure we insert a global barrierand after that we insert an additional timer. Exactly as in the 2D case, thedifference between the timer after the barrier and the end of the procedureshows how well our code is balanced.

On each iteration, by using a global reduction, we collect the timings foreach of the step timers separately from all MPI tasks on the Master proces-sor. This value, divided by the number of MPI tasks, is the average timeacross all MPI tasks for one iteration. Based on these times, the programreports for each timer three different values: the minimum, the average andthe maximum, exactly as in the 2D MPI case.

4.2.2 Verification of the 3D FFT results

The 3D FFT using the Hybrid model requires a large amount of copy opera-tions within shared memory, as well as communication between MPI tasks.It is therefore quite facile to verify the results of the transpositions and trans-formations, by extending the 2D verification routine. This verification is doneon the final result of the 3D FFT. The output data, which is in the frequencydomain, is verified against the results we expect from the given input data[13]. Just as in the 2D case, this requires that the input data is createdby using a well-known function with respect to the FT. The trigonometric


function used to create the 3D input data is shown in (4.1).

f(x, y, z) = sin(2πax

Ny

+2πby

Nx

+2πcz

Nz

)

0 6 a < Nx, a ∈ Z

0 6 b < Ny, b ∈ Z

0 6 c < Nz, c ∈ Z

(4.1)

For each program run, different values for a, b and c are chosen. Thisfunction has the property (see full mathematical details in [13]) that in theoutput only two values are expected to be nonzero. Equation (4.2) presentsthe coordinates of those two values for the DFT of f(x, y, z).

F (u, v, w) =

−12i.NxNyNz if a = u, b = v, c = w

12i.NxNyNz if a = Nx − u, b = Ny − v, c = Nz − w

0 else

(4.2)

Because the verification is done on the transposed data, the coordinates ofthe peak values, determined by the factors a, b and c, have to be transposedas well. The order of the transpositions in the test program is as follows:

• Create the input data: f(x, y, z)

• First FFT on the z-axis: f(x, y, z) → FZ(x, y, w)

• Transposition of y- and z- axis: FZ(x, y, w) → Fz(x, w, y)

• Second FFT on the y-axis: FZ(x, w, y) → FY Z(x, w, v)

• Transposition of x- and y- axis: FY Z(x, w, v) → FY Z(w, v, x)

• Third FFT on the x-axis: FY Z(w, v, x) → F (w, v, u)

The peak values from (4.2) have to be at F (w, v, u) and at F (Ny−w, Nz−v, Nx − u). As a solution to that, we choose to implement the verificationroutine in such a way so as to check for the peak values on the expectedtransposed points (i.e. at F (w, v, u) and F (Ny − w, Nz − v, Nx − u)). Theverification routine is implemented to allow for numerical errors. For doubleprecision complex numbers the maximum relative error is defined as 1.0 ×10−5.


4.3 Results and Analysis

In this section, we present the results of our test program run on 1 (16processors) to 64 (1024 processors) HPCx nodes for problem sizes between643 and 5123. The first subsection presents our results concerning the scalingof our code to large processor counts. The next subsection presents theaverage execution times of our code run on the same number of processorswith a different however combination of MPI Tasks × OpenMP threads. Inthe last subsection, we attempt to make a comparison between the results ofour model and the one studied in [18].

Note that Figures 3.11 and 3.16 demonstrate a superiority of two OpenMPversions compare to the MPI strategy for problem sizes between 643 and5123. These two versions, which are the first OpenMP version and the sameversion with the two loops exchanged, were used in our implementation forthe transposition of the z and y dimension (see line 2 in Listing 4.1). Sinceboth give quite similar performance, in this section, we only present the 3DFFT results of transposing z and y with the first OpenMP version.

4.3.1 Scaling

Figure 4.31 presents the parallel speed-up of our code run on 1 to 1024processors on HPCx system for problem sizes between 643 and 5123.

Generally speaking, it demonstrates a good scaling of the code for allproblem sizes. For each problem size, from a certain point and thereafter, weobserve a tail off in performance. Especially for small problem sizes, the useof large processor counts does not overlap the communication cost requiredbetween those processors. That is because the given time to gain performanceis quite small, and so the cost for communication is greater than the profitsgained from increasing the processors number. However, in general, we cansay that as the problem size grows, speed-up is increased. As referred in [8],any sufficiently large problem can be efficiently parallelised and by increasingthe problem size, we can gain the performance improvement.

4.3.2 MPI Tasks × OpenMP threads

In Figure 4.4 we present the average execution time for the calculation ofthe parallel 3D FFT with a different combination of MPI Tasks × OpenMPthreads on 4 nodes (64 processors) of HPCx system.

1The reported results concern the Hybrid balanced scheme, proposed in section 4.3.3


0

50

100

150

200

250

1 2 4 8 16 32 64 128 256 512 1024

Speed-up

P (Number of processors)

643

♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦ ♦

♦1283

+ + + + + ++

+ ++ +

+2563

� � � � � ��

�

�

� �

�5123

× × × × × × ××

×

×

××

Figure 4.3 3D FFT scaling test with the Hybrid balanced scheme run on 1 upto 1024 processors

0.001

0.01

0.1

1

64 128 256 512

Time (log s)


4x16

♦

♦

♦

♦♦

8x8

+

+

+

++

16x4

�

�

�

�

�32x2

×

×

×

×

×64x1

△

△

△

△

△

Figure 4.4 3D FFT average execution time with a different combination of MPITasks × OpenMP threads on 4 nodes (64 processors) of HPCx system


It demonstrates a better performance as the number of MPI Tasks isincreased. We observe that the difference in performance between differentcombinations is decreased as the problem size gets larger. This is somethingexpected from our implementation. As the number of the MPI Tasks isdecreased and the number of the OpenMP threads increased, Nx/Tasks, thenumber of slabs to be computed by each task, is increased. Hence, thenumber of parallel regions to be opened, is also increased.

The above can be comprehended by the fact that all the 3D arrays inour code are declared with only z and y dimensions in contiguous memoryand not x. This means that for each slab, a parallel region must be openedby the task. From our detailed performance analysis, it is clear that this iscrucial for all five steps (see Listing 4.1) of the algorithm. However, as in the2D FFT case, this seems to lessen as the problem size gets larger and theopening of parallel regions is not significant in overall.

An interesting aspect though, is what happens when larger and largerprocessor counts are involved. From our results, we observe that for largeprocessor counts, it is not clear that, for the same number of processors, in-creasing MPI Tasks is the best choice. This is more likely because increasingthe MPI Tasks means that All-to-All communication cost between the tasksis also increased (see line 4 in Listing 4.1).

Future work might be able to find out the best possible combination ofMPI Tasks × OpenMP threads by taking into consideration the cost of open-ing parallel regions in OpenMP and the cost of the All-to-All call betweenthe MPI Tasks tasks.

4.3.3 Hybrid model and pure MPI

Looking the above from a different angle, that is the comparison betweenthe Hybrid model and pure MPI, we can say that, in general, the flexibilityof the Hybrid model is an important advantage. Since the model providesus the choice to use the slab decomposition (e.g. 64 × 1), for any problemsize in which the available processors P is less or equal than the availableslabs N , then, with further optimisation of our code, it is possible that slabdecomposition with the Hybrid will perform relatively the same compare tothe same decomposition with pure MPI.

Moreover, if P ≤ N is in effect, our hypothesis argues that the Hybridwill be faster than the 2D data decomposition. Recall that it is more likelythat the 2D data decomposition pays off only if there are more processorsavailable than we can utilize using the slab decomposition. If not, the Hy-brid is expected to perform better because the first transposition of z andy dimension is achieved by a local re-sort and for the second transposition


between x and y there is only a single All-to-All call between the MPI tasks(see line 4 in Listing 4.1). It is also quite interesting to see what happenswith the Hybrid when P > N , and compare it to the 2D data decompositionapproach.

In order to compare our results with the Hybrid to a pure MPI imple-mentation, the ideal and scientificly correct would be to have both imple-mentations. Unfortunately, due to lack of time, it was not feasible to havea pure MPI implementation with the 2D data decomposition. However, inorder to get a feeling and not a strong conclusion, we do not think that it isa bad idea to compare our results to those presented in [18]. We must notethat both implementations were run on the same system, HPCx, with ap-proximately the same number of iterations, the same trigonometric functionas an input, the same timing function for measurements and the same FFTlibrary, a different version of it though. In addition, we must note that someminor software upgrades will have happened on HPCx system.

Figure 4.5 presents the average execution time for the calculation of theparallel 3D FFT with the Hybrid and pure MPI, as presented in [18], on 4nodes (64 processors) of HPCx system, for problem sizes between 643 and5123. Since for all problem sizes P ≤ N , we choose the combination of P MPItasks × 1 OpenMP thread in order to check our hypothesis that it should befaster. Indeed, as expected, Figure 4.5 demonstrates a clear superiority of theslab decomposition with the Hybrid compare to the 2D data decompositionwith pure MPI. The results show that our hypothesis is confirmed. We alsoobserve that, as the problem size gets larger, the balanced combination of 8MPI tasks × 8 OpenMP threads has a better performance than pure MPI.In certain cases we have a gain as much as approximately 40% in executiontime when the Hybrid model is applied. Once again, for small problem sizes,the opening of N parallel regions is crucial in overall performance.

Regarding the P > N case, let Pnode the number of processors per node.As mentioned before, with the hardware architecture used, at least one MPItask must be used for each node. Concerning the combination of MPI tasks× OpenMP threads, the ideal would be to propose a scheme based on theresults of the previous section. However, as mentioned before, because wedid not have the time to investigate it thoroughly in this dissertation, wepropose a balanced scheme as presented below:

• If P/Pnode 6√

P and P/Pnode 6 N

Consider S the ordered set of powers of 2


0.0001

0.001

0.01

0.1

1

10

64 128 256 512

Time (log s)


64x1

♦

♦

♦

♦

♦8x8

+

+

+

+

+MPI

�

�

�

��

Figure 4.5 3D FFT execution time with the Hybrid and pure MPI, as presentedin [18], on 4 nodes (64 processors) of HPCx system

S = {1, 2, 4, 8, 16, 32, ..., 2i, ...}, i ∈ Z and P ∈ S

There exists one and only one a ∈ S such that

b <√

P 6 a

where b the element in S that precedes a. Then

∃ c ∈ S : a. c = P and c = P/a

Proof : There exists one even number c such that

a. c = P → mod(P, a) = 0 → P, a ∈ S

Let P = 2n, a = 2k such that n, k ∈ Z and n > k then


a. c = P → c = 2(n−k) → c ∈ S

�

• If P/Pnode >√

P and P/Pnode 6 N

a = P/Pnode and c = P/a = PP/Pnode

= Pnode

Hence, for our tests, we take a MPI tasks × c OpenMP threads as themost balanced combination concerning the hardware architecture used. Weapply our scheme for problem sizes between 643 and 5123 on up to 64 nodes(1024 processors) of HPCx system.

Figure 4.6 presents the average execution times on 1024 processors ofHPCx system, 64 MPI tasks × 16 OpenMP threads, for problem sizes be-tween 643 and 5123, thus P > N and P/Pnodes >

√P . It demonstrates a clear

superiority of the MPI for problem sizes between 163 and 2563. However,using the Hybrid for the problem size of 5123, acquires 50% of performanceimprovement. The results for the P > N and P/Pnodes 6

√P case are similar

to Figure 4.6. From our detailed performance analysis and the one in [18], wenote that the loss of performance in our implementation, for small problemsizes, is caused, mainly, during the transposition of the x and y dimension.

The results lead us to the conclusion that our second hypothesis is onlyconfirmed for the problem size of 5123. However, the afford in mapping thevirtual 2D processor grid to the processors in the SMP node, affecting theoverall performance in [18], as well as possible further optimisation of packingand unpacking the data before and after the All-to-All call between the MPItasks (see line 4 in Listing 4.1) in our implementation, might be interestingaspects to investigate in future time.


0

0.05

0.1

0.15

0.2

0.25

64 128 256 512

Time (s)


64x16

♦ ♦♦

♦

♦MPI

+ + +

+

+

Figure 4.6 3D FFT execution time with the Hybrid and pure MPI, as presentedin [18], on 64 nodes (1024 processors) of HPCx system

Chapter 5

Conclusions

As we have seen, the parallelisation of the 2D FFT is constituted from threebasic stages: the calculation of the 1st FFT, the transposition between thetwo dimensions, and the calculation of the 2nd FFT. In general, our im-plementations with the Strided, the Transposition and the MPI strategiesdemonstrated a good scaling of the parallel 2D FFT on one node (16 proces-sors) of both HPCx and Ness, for problem sizes up to 81922.

The results of the proposed strategies demonstrated a superiority of theMPI strategy on the calculations of the 1st and 2nd FFT both on HPCxand Ness for problem sizes up to 20482. As the problem size gets larger,this superiority seems to recede. This is comprehended by the fact that,during the parallel input data creation, the MPI caches the data on a highercache level than OpenMP. Regarding the transposition time, things are morecomplicated as many factors come into play.

A detailed performance analysis of the time distribution between the 3basic stages of the algorithm showed that the transposition time for the MPIversion stays always above 50% of the total execution time for all problemsizes on one HPCx node. The same percentage is always above 70% on Ness.The rest of it is equally splitted between the calculation of the 1st and 2nd

FFT. That is not the case for the OpenMP versions, where for certain prob-lem sizes, the same percentage oscillates between 35% and 90%, dependingon the version used. As we have seen, many factors can influence this per-centage among the OpenMP versions, such as, the problem size, false sharing,strided reads/writes and the opening of many parallel regions. However, itis demonstrated that, for the same problem size, the calculations of the 1st

and 2nd FFT with different OpenMP versions are similar.Let RFFTs the ratio between the time for the calculation of the two FFTs

of the MPI version and the same amount of an OpenMP version. Let Rtransp

the ratio between the transposition time of the MPI version and the same

61

62 Chapter 5 - Conclusions

amount of an OpenMP version. Since the percentage of the transpositiontime for the MPI version stays always above 50% for all problem sizes andtaking into account the superiority of the MPI version on the calculationsof the 1st and 2nd FFT for small problem sizes, we conclude that, for smallproblem sizes, only if Rtransp > 1

RFFTsthen an OpenMP version performs

better in total. However, as the problem size gets larger and RFFTs → 1,then Rtransp > 1 seems to be a quite satisfactory condition.

The gains acquired from using the access to the shared memory of a nodeand thus Rtransp, depend on the performance of the All-to-All call on the SMPnode used. Even with a system like the HPCx, if the overall performance ofthe MPI strategy keeps the percentage of the transposition time above 50%,then it is more likely that we can exploit the shared access to the memory ofmulti-core and shared-memory nodes for the parallelisation of the 2D FFT.

Making a choice between the OpenMP versions, clearly depends on theproblem size. The version where we parallelise the 1st loop, and the sameversion with the two loops exchanged, perform much better for problem sizesup to 5122. The main reason for that is the limiting factor of opening Nparallel regions from the other two versions. As the problem size gets largerthis does not seem to be a limiting factor anymore. From a certain point andthereafter and up to 81922, the versions with continuous reads and stridedwrites, which are the first two versions, have the best performance.

We examined the parallelisation of the 3D FFT with the Hybrid, a mixedmode programming model. Our implementation uses the Master-Only style,where all communication is applied only by the Master thread, outside theOpenMP parallel regions. The parallelisation of the 3D FFT is constitutedfrom 5 basic stages: the calculation of the 1st FFT, the transposition betweenthe z and y with a local resort by the OpenMP threads, the calculation of the2nd FFT, the transposition between the x and y using a single All-to-All callbetween MPI tasks and finally the calculation of the 3rd FFT. In general,our implementation demonstrated a good scaling of the parallel 3D FFT upto 64 nodes (1024 processors) of HPCx system for problem sizes between 643

and 5123.We also investigated how, with the same number of processors P , different

combinations of MPI Tasks × OpenMP threads affect the average executiontime of the parallel 3D FFT. Our results illustrated that for a relatively smallP , performance improves as the number of MPI Tasks is increased. This isowed to the fact that when the number of slabs to be computed by each taskis increased, the number of parallel regions to be opened is decreased. We alsoobserved that the difference in performance between different combinations isdecreased as the problem size gets larger and the opening of parallel regionsbecomes less significant in overall. However, as P gets larger it is not clear,

63

in this dissertation, whether the opening of parallel regions or the All-to-Allcall for the last step of the algorithm is more expensive.

Furthermore, an attempt to compare the Hybrid model to the pure MPIimplementation in [18] was made. If N is the available slabs of the problem,on the one hand, if P ≤ N , the Hybrid gives us the flexibility to use theslab decomposition, or a more balanced combination, and so avoid the All-to-All call between the rows, which comes as an extra cost with the 2D datadecomposition. On the other hand, if P > N , it also gives us the flexibility touse as many OpenMP threads as the hardware architecture can effort insidea node, for each slice of data, in order to optimise the calculation, which isa limiting factor for the slab decomposition.

The results demonstrated that if P < N , for certain cases, we gain ap-proximately 40% in performance compare to the 2D data decomposition. Inaddition, we demonstrated a clear superiority of the slab decomposition withthe Hybrid compare to the 2D data decomposition with pure MPI. Regardingthe P > N case, we proposed the most balanced combination of MPI Tasks× OpenMP threads. The results demonstrated a superiority of MPI for smallproblem sizes up to 2563. This is more likely because of the task placementpolicy applied in [18] and the fact that our implementation can be furtheroptimised concerning the minimisation of parallel regions and the transpo-sition of x and y dimension. However, for a problem size of 5123, we gainapproximately 50% in performance compare to the 2D data decomposition.

As a next step to this dissertation, concerning the parallelisation of the2D FTT using the access to shared memory, it would be interesting, by usingspecialised performance analysis tools, to investigate in more detail, how thephenomenon of false sharing, both on reading and writing, can influencethe overall performance. Another interesting aspect of the parallel 2D FFT,which needs a more detailed investigation, is the way that the data is createdby the OpenMP threads. This investigation could prove beneficial regardingthe optimisation of the 1st FFT’s calculation for small problem sizes.

Moreover, concerning the parallelisation of the 3D FTT with the Hybrid,it would be interesting to see how the dynamic allocation of 3D arrays incontiguous memory can improve the overall performance, since the numberof parallel regions will be decreased. The possibility of optimising the par-allel packing and unpacking of the data for the All-to-All call, as well asthe use of other Hybrid styles, are also two aspects we would except to im-prove performance. Since we demonstrated a clear superiority of the Hybridcompare to the 2D data decomposition with pure MPI for a problem sizeof 5123, the above optimisations could improve the performance of smallerproblem sizes as well. Finally, we strongly believe that, within the scope ofan academic paper, the invention of a scheme that will determine the optimal

64 Chapter 5 - Conclusions

combination of MPI tasks × OpenMP threads for given N and P , would bea scientific innovation for the parallelisation of the 3D FFT with the use ofshared memory nodes of modern computing equipment.

Appendix A

Parallel 2D FFT results

65

66

Chapte

rA

-Par

alle

l2D

FFT

resu

lts

Problem Parallelise Parallelise Loops Loops Strided MPI

size 1st Loop 2nd Loop Exchange 1 Exchange 2

162 0.000006 0.000006 0.000006 0.000006 0.000006 0.000001322 0.000007 0.000007 0.000007 0.000008 0.000007 0.000001642 0.000010 0.000010 0.000010 0.000010 0.000010 0.0000041282 0.000027 0.000027 0.000027 0.000027 0.000027 0.0000212562 0.000099 0.000100 0.000099 0.000099 0.000097 0.0000905122 0.000458 0.000461 0.000462 0.000456 0.000429 0.00041110242 0.002166 0.002270 0.002206 0.002205 0.002283 0.00192620482 0.013124 0.013280 0.013036 0.012962 0.011987 0.01234040962 0.061012 0.061726 0.061756 0.061154 0.059272 0.06294281922 0.265268 0.266548 0.264847 0.266263 0.266378 0.278328

a

Table A.1 Average execution times (s) for the 1st FFT on 16 processors of an HPCx node

aWe report results to this level of accuracy because the problem size of 162 is always executed in less than 1 × 10−5s

67

Problem Parallelise Parallelise Loops Loops MPI


162 0.000007 0.000084 0.000007 0.000083 0.000018322 0.000010 0.000176 0.000009 0.000186 0.000017642 0.000019 0.000342 0.000014 0.000338 0.0000261282 0.000041 0.000707 0.000031 0.000681 0.0000522562 0.000079 0.001489 0.000154 0.001399 0.0002375122 0.000669 0.003483 0.001622 0.003282 0.00080510242 0.003790 0.008939 0.007004 0.007645 0.00552220482 0.030986 0.026366 0.033985 0.026477 0.02089140962 0.145065 0.118780 0.195496 0.170411 0.15108481922 0.808449 0.652805 1.354438 0.880533 0.684787

Table A.2 Average execution times (s) for the transposition on 16 processors of an HPCx node

68

Chapte

rA

-Par

alle

l2D

FFT

resu

lts



162 0.000006 0.000006 0.000006 0.000006 0.000009 0.000001322 0.000008 0.000007 0.000007 0.000008 0.000015 0.000001642 0.000011 0.000010 0.000010 0.000012 0.000034 0.0000041282 0.000030 0.000026 0.000025 0.000029 0.000099 0.0000192562 0.000105 0.000090 0.000091 0.000107 0.000482 0.0000805122 0.000470 0.000411 0.000412 0.000467 0.003676 0.00035410242 0.002315 0.002211 0.002058 0.002407 0.016489 0.00176820482 0.013131 0.011299 0.011212 0.013534 0.093538 0.00905040962 0.064796 0.063387 0.065437 0.065107 0.601845 0.05964481922 0.257456 0.255105 0.257282 0.256828 3.411185 0.272548

Table A.3 Average execution times (s) for the 2nd FFT on 16 processors of an HPCx node

69



162 0.000020 0.000097 0.000020 0.000096 0.000015 0.000019322 0.000024 0.000190 0.000023 0.000201 0.000022 0.000020642 0.000039 0.000361 0.000033 0.000358 0.000044 0.0000341282 0.000096 0.000760 0.000083 0.000737 0.000125 0.0000922562 0.000282 0.001680 0.000344 0.001606 0.000580 0.0004065122 0.001589 0.004356 0.002495 0.004206 0.004106 0.00157010242 0.008271 0.013420 0.011268 0.012257 0.018772 0.00921520482 0.057241 0.0509445 0.058233 0.052973 0.105525 0.04228040962 0.270873 0.2438926 0.322688 0.296671 0.661117 0.27367081922 1.331173 1.1744577 1.876566 1.403624 3.677564 1.235662

Table A.4 Average execution times (s) for the 2D FFT on 16 processors of an HPCx node

70

Chapte

rA

-Par

alle

l2D

FFT

resu

lts

Processors 162 322 642 1282 2562 5122 10242 20482 40962 81922

1 0.000009 0.000031 0.000119 0.000648 0.003352 0.017743 0.114502 0.599953 3.238710 14.5878722 0.000014 0.000025 0.000073 0.000355 0.001733 0.009318 0.054626 0.328531 1.632488 7.4325404 0.000015 0.000023 0.000047 0.000195 0.000876 0.004702 0.026887 0.184591 0.833760 3.8343678 0.000016 0.000023 0.000038 0.000115 0.000483 0.002515 0.014010 0.101344 0.443282 1.98091516 0.000020 0.000024 0.000039 0.000096 0.000282 0.001589 0.008271 0.057241 0.270873 1.331173

Table A.5 Average execution times (s) for scaling test of the 2D FFT with the first OpenMP version run on 1 up to 16processors of an HPCx node

71

Problem HPCx Ness

size

162 92% 99%322 86% 98%642 76% 97%1282 57% 98%2562 58% 90%5122 51% 74%10242 60% 71%20482 50% 72%40962 55% 74%81922 55% 77%

Table A.6 HPCx vs Ness percentage (%) of the total execution time for the 2Dtransposition with MPI on a 16 processor node

72 Chapter A - Parallel 2D FFT results

Appendix B

Parallel 3D FFT results

73

74 Chapter B - Parallel 3D FFT results

Processors 643 1283 2563 5123

1 0.0193 0.2815 2.5926 23.45772 0.0090 0.1257 1.3327 16.49564 0.0047 0.0598 0.6377 7.66968 0.0028 0.0262 0.3186 3.207816 0.0028 0.0176 0.2047 2.043932 0.0022 0.0164 0.1323 1.271564 0.0019 0.0090 0.0868 0.8170128 0.0014 0.0063 0.0531 0.4033256 0.0019 0.0064 0.0307 0.2558512 0.0020 0.0046 0.0178 0.16281024 0.0055 0.0052 0.0184 0.1102

Table B.1 Average execution times (s) for scaling test of the 3D FFT with theHybrid balanced scheme run on 1 up to 1024 processors of HPCx system

Problem 4 × 16 8 × 8 16 × 4 32 × 2 64 × 1

size

64 0.0019 0.0019 0.0017 0.0018 0.0011128 0.0107 0.0090 0.0098 0.0090 0.0066256 0.0972 0.0868 0.0954 0.0887 0.0726512 0.9908 0.8170 0.6771 0.6489 0.6716

Table B.2 Average execution times (s) with a different combination of MPItasks × OpenMP threads on 4 nodes (64 processors) of HPCx system

Problem 64 × 1 8 × 8 MPI

size

64 0.0011 0.0019 0.0010128 0.0066 0.0090 0.0073256 0.0726 0.0868 0.1231512 0.6716 0.8170 1.4773

Table B.3 Average execution times (s) with the Hybrid model and pure MPI,as presented in [18], on 4 nodes (64 processors) of HPCx system

75

Problem Hybrid MPI

size

64 0.0055 0.0006128 0.0052 0.0015256 0.0184 0.0085512 0.1102 0.1908

Table B.4 Average execution times (s) with the Hybrid balanced scheme andpure MPI, as presented in [18], on 64 nodes (1024 processors) of HPCx system

76 Chapter B - Parallel 3D FFT results

Bibliography

[1] Bailey D.H., A High-Performance FFT Algorithm for Vector Supercom-puters, International Journal of Supercomputer Applications, vol.2, no.1, 1988, pp. 82-87.

[2] Bull M., Shared Memory Programming, EPCC Course Slides, Version1.4, 2008.

[3] Cooley J.W., and Tukey J., An algorithm for the machine calculationof complex Fourier series, Math. Comput. 19, 1965, pp. 297-301.

[4] Eleftheriou M., Moreira J.E., Fitch B.G., Germain R.S., A VolumetricFFT for BlueGene/L, Lecture Notes in Computer Science. Volume2913, 2003, pp. 194-203.

[5] EPCC website http://www.epcc.ed.ac.uk

[6] Frigo M. and Johnson S.G., FFTW Documentation, Version 3.1.2, 2006.

[7] Gauss C.F., Nachlass: Theoria interpolationis methodo nova tractata,Werke band 3, Gttingen: Knigliche Gesellschaft der Wissenschaften,1866, pp. 265-327.

[8] Gustafson J.L., Reevaluating Amdahl’s Law, CACM, 31(5), 1988, pp.532-533.

[9] Hein J., Performance Scaling on Modern HPC Architectures, EPCCCourse Slides, 2008.

[10] Henty D., Applied Numerical Algorithms, EPCC Course Slides, 2008.

[11] HPCx website http://www.hpcx.ac.uk

[12] IBM website http://www.ibm.com

77

78 BIBLIOGRAPHY

[13] Jacode H., Fourier Transforms for the BlueGene/L CommunicationNetwork, EPCC MSc Dissertation, 2006.

[14] James J.F., A Student’s Guide to Fourier Transforms: With Applica-tions in Physics Engineering, Cambridge University Press, 2002.

[15] Kumar V., Grama A., Gupta A., Karpis G., Introduction to ParallelComputing: Design and Analysis of Parallel Algorithms, Benjamin-Cummings Pub Co., 1994.

[16] Piotrowski M., Mixed Mode Programming on Clustered SMP Systems,EPCC MSc Dissertation, 2005.

[17] Press W.H., Teukolsky S.A., Vetterling W.T., Flannery B.P., Numeri-cal Recipes 3rd Edition: The Art of Scientific Computing, 2007.

[18] Sigrist U., Optimising parallel 3D Fast Fourier Transformations for acluster of IBM POWER SMP nodes, EPCC MSc Dissertation, 2007.

[19] Smith L.A. and Bull M., Development of mixed mode MPI / OpenMPapplications, Scientific Programming 9, 2001, pp. 83-98.

parallel fourier transformations using shared memory nodes

Documents