parallelisation of random number generation in placet approaches of parallelisation in placet martin...

Parallelisation of Random Number Generation in PLACET

Approaches of parallelisation in PLACET

Martin BlahaUniversity of Vienna AT

CERN 25.09.2013

Additions to centralised RNG

• TCL command RandomReseto Sets seeds to all streams individually

RandomReset –stream Misalignments –seed 1234o sets default seeds = reset, if called without argumento sets generatorso replaces redundancy in Tcl commands that set seeds (e.g.

Groundmotion_init)o Help that lists all streams

• Benchmarks on not parallelised codeo gsl causes slowdown of max. 3% depending on generator

1

Motivation for parallel execution

Runtimes of simulations slow!

“Low-performance” functions:

● SBEND

● QUADRUPOLE

● MULTIPOLE

● ELEMENT

→ they refer to RNGs through syncrotron radiation emission

profile by Yngve LevinsenFeb. 2013

2

Parallel Random Number Generation

Problems

requesting random numbers from a sequential stream for parallel use is uncontrolable

controlable and reproducible

gsl random number generators do not support parallel generation by itself

3

Methods for parallel random number generation

● centralized generation

● replicated generation

● distributed generation

● existing Libraries

4

Centralized RNGOne generator produces all numbers

Advantages:

only one RNG with good sequence

easy implementation

Disadvantage:

race conditions occur

fair play not guaranteed or crash (programme not stable)

slow if queueing (even slower than single thread)

5

Replicated RNGInitial RNG is copied for each thread

Advantages:

more efficient

easy implementation

Disadvantage:

can suffer from correlations between threads

6

Distributed RNGEach thread has its own generator

Advantages:

efficient - each thread can work stand alone

threadsafe

reproducible

Disadvantage:

can suffer from correlations

7

Existing Libraries

SPRNG - University of Florida

hard to find “good” documentation on how to

combine with parallel code eg OpenMp

PRAND

for CUDA environment on GPU and CPU

good documentation on RNGs in general

Disadvantage: yet another library8

Distributed RNG

Summary:

distributed generation considered to fit the best for our needs

Common methods that are known to produce satisfactory outcome

1. Random Tree Method

2. Block Splitting

3. Leapfrog Method 9

Random Tree Method

• Global RNG for seeding

• Standalone RNG per thread

• Reproducible for known number of threads

new tcl command to set number of threads

→ only runs fair for the same number of

threads, not for dinamical thread assignment

Seed

10

Block Splitting

Split a sequence of RN in blocks

Advantages:

no overlap in random numbers

plays fair

Disadvantages:

allocates a huge array of numbers

number of RNs has to be known in advance

11

Leapfrog Method

Distributes a sequence or RN over several threads one by one

Advantages:

number of RNs must not be known in advance

guarantees no overlap of RN

plays fair, still permutations in calls

Disadvantage:

costly call of random numbers

12

Block splitting vs. Leapfrog

Block-Splitting and Leapfrog runs fair with dynamic thread assignment

Problem of implimentation in a distributed, non centralised wayPeriod per thread is period of RNG/# threads

13

Testing parallel RNG methods

SPEEDUP to -33,3% in runtime for random tree method

only overheads for nosynrad and little number of particles

SLOWDOWN to + 120% in runtime for leapfrog method

due to withdrawing more numbers than needed

Testing via test-bds-track for 300 000 particles, with quadrupoles and multipoles

14

PreparationTool for Parallelisation - OpenMp

easy implementation

control of variable scope, assignment schedule, critical sections

15

Preparation:Centralising synrad functions

2 functions calculate synrad emmission:

synrad.cc

photon_spectrum.cc

Centralised for easier and reproducible use of parallel RNG

synrad.cc has been removed

Tested via test-bds-track for 3e5 particles, same outcome

16

Implementation of new class

New class PARALLEL_RNG

Inherits all methods from RANDOM_NEW

Initialises parallel RNG always on max. number of available threads

New Tcl-command ParallelThreads –num val to choose number of threads

Now RNG stream Radiation runs completely parallel by default 17

Testing – BDS tracking

Covariance Matrix of test-bds-track

18

Testing via test-bds-track for 300 000 particles, with quadrupoles and multipoles

Testing – CLIC beam tracking

Beam - tracking with no correction Beam - tracking with simple correction

19Testing test-clic-3 for 3500 machines

Time Profile

Total runtime on 32 cores: 27 sec

Total runtime on 1 core 1 m 21 sec

Total runtime on PLACET:58 sec

BDS tracking:

PLACET: 39 sec

PLACET-NEW:~9 sec

BDS TRACKING 3.5 times faster

Timeprofile for BDS tracking for 300 000 particles2x Intel Xeon E5-2650 2.00 GHz 8-Core (16 w/hyper threading)(95W 20MB 2.8GHz Turbo Sandy Bridge EP)

20

Profiling

BDS:

Sbend, elements, multipole, quadrupole still most timeconsuming functions

Linac:OMP library causes slowdown in simple-correction routines

(e.g. test-clic-4)

76% of time consumption caused by OpenMP in wait_sleep

It was necessary to find a compromise!21

Conclusion

BDS runs ~30 % faster (total runtime)

CLIC 4 runs ~13 % faster

Compared to current placet in the trunk

OpenMP is a quick and easy way to parallelisation for existing functions.

22

Future Plan

• Need to understand the overhead while running sequential

• Benchmark performance of quick functions e.g. dipoles, drifts, step-in, BPMs

• Adjust automatically to current configuration

• Write technical/user documentation

• Merge into trunk

23

parallelisation of random number generation in placet approaches of parallelisation in placet martin...

Documents

number of rns

parallel generation

random numbers12block

thread reproducible

parallel code

parallel use

single thread

distributed rngeach