parallelisation of random number generation in placet approaches of parallelisation in placet martin...
TRANSCRIPT
Parallelisation of Random Number Generation in PLACET
Approaches of parallelisation in PLACET
Martin BlahaUniversity of Vienna AT
CERN 25.09.2013
Additions to centralised RNG
• TCL command RandomReseto Sets seeds to all streams individually
RandomReset –stream Misalignments –seed 1234o sets default seeds = reset, if called without argumento sets generatorso replaces redundancy in Tcl commands that set seeds (e.g.
Groundmotion_init)o Help that lists all streams
• Benchmarks on not parallelised codeo gsl causes slowdown of max. 3% depending on generator
1
Motivation for parallel execution
Runtimes of simulations slow!
“Low-performance” functions:
● SBEND
● QUADRUPOLE
● MULTIPOLE
● ELEMENT
→ they refer to RNGs through syncrotron radiation emission
profile by Yngve LevinsenFeb. 2013
2
Parallel Random Number Generation
Problems
requesting random numbers from a sequential stream for parallel use is uncontrolable
controlable and reproducible
gsl random number generators do not support parallel generation by itself
3
Methods for parallel random number generation
● centralized generation
● replicated generation
● distributed generation
● existing Libraries
4
Centralized RNGOne generator produces all numbers
Advantages:
only one RNG with good sequence
easy implementation
Disadvantage:
race conditions occur
fair play not guaranteed or crash (programme not stable)
slow if queueing (even slower than single thread)
5
Replicated RNGInitial RNG is copied for each thread
Advantages:
more efficient
easy implementation
Disadvantage:
can suffer from correlations between threads
6
Distributed RNGEach thread has its own generator
Advantages:
efficient - each thread can work stand alone
threadsafe
reproducible
Disadvantage:
can suffer from correlations
7
Existing Libraries
SPRNG - University of Florida
hard to find “good” documentation on how to
combine with parallel code eg OpenMp
PRAND
for CUDA environment on GPU and CPU
good documentation on RNGs in general
Disadvantage: yet another library8
Distributed RNG
Summary:
distributed generation considered to fit the best for our needs
Common methods that are known to produce satisfactory outcome
1. Random Tree Method
2. Block Splitting
3. Leapfrog Method 9
Random Tree Method
• Global RNG for seeding
• Standalone RNG per thread
• Reproducible for known number of threads
new tcl command to set number of threads
→ only runs fair for the same number of
threads, not for dinamical thread assignment
Seed
10
Block Splitting
Split a sequence of RN in blocks
Advantages:
no overlap in random numbers
plays fair
Disadvantages:
allocates a huge array of numbers
number of RNs has to be known in advance
11
Leapfrog Method
Distributes a sequence or RN over several threads one by one
Advantages:
number of RNs must not be known in advance
guarantees no overlap of RN
plays fair, still permutations in calls
Disadvantage:
costly call of random numbers
12
Block splitting vs. Leapfrog
Block-Splitting and Leapfrog runs fair with dynamic thread assignment
Problem of implimentation in a distributed, non centralised wayPeriod per thread is period of RNG/# threads
13
Testing parallel RNG methods
SPEEDUP to -33,3% in runtime for random tree method
only overheads for nosynrad and little number of particles
SLOWDOWN to + 120% in runtime for leapfrog method
due to withdrawing more numbers than needed
Testing via test-bds-track for 300 000 particles, with quadrupoles and multipoles
14
PreparationTool for Parallelisation - OpenMp
easy implementation
control of variable scope, assignment schedule, critical sections
15
Preparation:Centralising synrad functions
2 functions calculate synrad emmission:
synrad.cc
photon_spectrum.cc
Centralised for easier and reproducible use of parallel RNG
synrad.cc has been removed
Tested via test-bds-track for 3e5 particles, same outcome
16
Implementation of new class
New class PARALLEL_RNG
Inherits all methods from RANDOM_NEW
Initialises parallel RNG always on max. number of available threads
New Tcl-command ParallelThreads –num val to choose number of threads
Now RNG stream Radiation runs completely parallel by default 17
Testing – BDS tracking
Covariance Matrix of test-bds-track
18
Testing via test-bds-track for 300 000 particles, with quadrupoles and multipoles
Testing – CLIC beam tracking
Beam - tracking with no correction Beam - tracking with simple correction
19Testing test-clic-3 for 3500 machines
Time Profile
Total runtime on 32 cores: 27 sec
Total runtime on 1 core 1 m 21 sec
Total runtime on PLACET:58 sec
BDS tracking:
PLACET: 39 sec
PLACET-NEW:~9 sec
BDS TRACKING 3.5 times faster
Timeprofile for BDS tracking for 300 000 particles2x Intel Xeon E5-2650 2.00 GHz 8-Core (16 w/hyper threading)(95W 20MB 2.8GHz Turbo Sandy Bridge EP)
20
Profiling
BDS:
Sbend, elements, multipole, quadrupole still most timeconsuming functions
Linac:OMP library causes slowdown in simple-correction routines
(e.g. test-clic-4)
76% of time consumption caused by OpenMP in wait_sleep
It was necessary to find a compromise!21
Conclusion
BDS runs ~30 % faster (total runtime)
CLIC 4 runs ~13 % faster
Compared to current placet in the trunk
OpenMP is a quick and easy way to parallelisation for existing functions.
22
Future Plan
• Need to understand the overhead while running sequential
• Benchmark performance of quick functions e.g. dipoles, drifts, step-in, BPMs
• Adjust automatically to current configuration
• Write technical/user documentation
• Merge into trunk
23