high performance computing - tu wien · 2016. 7. 1. · ss16 ©jesper larsson träff high...
TRANSCRIPT
![Page 1: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/1.jpg)
©Jesper Larsson Träff SS16
High Performance Computing Introduction, overview
Jesper Larsson Träff, Sascha Hunold, Alexandra Carpen-Amarie {traff,hunold,carpenamarie}@par.tuwien.ac.at
Parallel Computing, 184-5
Favoritenstrasse 16, 3. Stock
Sprechstunde: by email- appointment
![Page 2: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/2.jpg)
©Jesper Larsson Träff SS16
High Performance Computing: A (biased) overview
Concerns: Either
1. Achieving highest possible performance as needed by some application(s)
2. Getting highest possible performance out of given (highly parallel) system
•Ad 1: Anything goes, including designing and building new systems, raw (application) performance matters •Ad 2: Understanding and exploiting details at all levels of given system
![Page 3: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/3.jpg)
©Jesper Larsson Träff SS16
Ad 2: •Understanding modern processors: processor architecture, memory system, single-core performance, multi-core parallelism •Understanding parallel computers: communication networks •Programming parallel systems: algorithms, interfaces, tools, tricks
All issues at all levels are relevant
but not always to the same extent and at the same time
![Page 4: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/4.jpg)
©Jesper Larsson Träff SS16
Typical “Scientific Computing” applications
•Climate (simulations: coupled models, multi-scale, multi-physics) •Earth Science •Long-term weather forecast •Nuclear physics •Computational chemistry •Computational astronomy •Computational fluid dynamics •Protein folding, Molecular Dynamics (MD) •Cryptography (code-breaking) •Weapons (design, nuclear stock pile), defense (“National Security”)
Qualified estimates say these problems require Teraflops, Petaflops, …
![Page 5: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/5.jpg)
©Jesper Larsson Träff SS16
Other, newer “High-Performance Computing” applications
Data analytics (Google, Amazon, FB, …), “big data”
Irregular data (graphs), irregular access patterns (graph algorithms)
![Page 6: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/6.jpg)
©Jesper Larsson Träff SS16
Ad. 1: Special purpose HPC systems for Molecular Dynamics
Special purpose computers have a history in HPC
“Colossus” replica, Tony Sale 2006
N-body computations of forces between molecules to determine movements: special type of computation with specialized algorithms that could potentially be executed orders of magnitude more efficiently on special-purpose hardware
![Page 7: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/7.jpg)
©Jesper Larsson Träff SS16
MDGRAPE-3: PetaFLOPS performance in 2006, more than 3 times faster than BlueGene/L (Top500 #1 at that time)
MDGRAPE-4: Last in the series of a Japanese project of MD supercomputers (RIKEN)
![Page 8: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/8.jpg)
©Jesper Larsson Träff SS16
MDGRAPE-4: Last in the series of a Japanese project of MD supercomputers (RIKEN)
[Ohmura I, Morimoto G, Ohno Y, Hasegawa A, Taiji M. 2014. MDGRAPE-4: A special-purpose computer system for molecular dynamics simulations. Phil. Trans. R. Soc. A 372: 20130387. http://dx.doi.org/10.1098/rsta.2013.0387]
![Page 9: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/9.jpg)
©Jesper Larsson Träff SS16
Anton (van Leeuwenhoek): Another special purpose MD system
512-node (8x8x8 torus) Anton machine
D. E. Shaw Research (DESRES)
Special purpose Anton chip (ASIC)
![Page 10: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/10.jpg)
©Jesper Larsson Träff SS16
From Encyclopedia on Parallel Computing: “Prior to Anton’s completion, few reported all-atom protein simulations had reached 2μs, the longest being a 10-μs simulation that took over 3 months on the NCSA Abe supercomputer […]. On June 1, 2009, Anton completed the first millisecond-long simulation – more than 100 times longer than any reported previously.”
![Page 11: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/11.jpg)
©Jesper Larsson Träff SS16
[Brian Towles, J. P. Grossman, Brian Greskamp, David E. Shaw: Unifying on-chip and inter-node switching within the Anton 2 network. ISCA 2014: 1-12] [David E. Shaw, Martin M. Deneroff, Ron O. Dror, Jeffrey Kuskin, Richard H. Larson, John K. Salmon, Cliff Young, Brannon Batson, Kevin J. Bowers, Jack C. Chao, Michael P. Eastwood, Joseph Gagliardo, J. P. Grossman, Richard C. Ho, Doug Ierardi, István Kolossváry, John L. Klepeis, Timothy Layman, Christine McLeavey, Mark A. Moraes, Rolf Mueller, Edward C. Priest, Yibing Shan, Jochen Spengler, Michael Theobald, Brian Towles, Stanley C. Wang: Anton, a special-purpose machine for molecular dynamics simulation. Commun. ACM 51(7): 91-97 (2008)] [Ron O. Dror, Cliff Young, David E. Shaw: Anton, A Special-Purpose Molecular Simulation Machine. Encyclopedia of Parallel Computing 2011: 60-71]
![Page 12: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/12.jpg)
©Jesper Larsson Träff SS16
Ad 1.: Special purpose to general purpose
Special purpose sometimes have wider applicability
Special purpose advantages: •Higher performance (FLOPS) for special types of computations/applications •More efficient (energy, number of transistors, …)
•Graphics processing processors (GPU) for general purpose computing (GPGPU) •Field Programmable Gate Arrays (FPGA)
HPC systems: Special purpose processors as accelerators
![Page 13: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/13.jpg)
©Jesper Larsson Träff SS16
General purpose MD packages
•GROMACS, www.gromacs.org •NAMD, www.ks.uiuc.edu/Research/namd/
![Page 14: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/14.jpg)
©Jesper Larsson Träff SS16
•Dense and sparse matrices, linear equations •PDE (“Partial Differential Equations”, multi-grid methods) •N-body problems •…
Many (parallel) support libraries: •BLAS -> LAPACK -> ScaLAPACK •Intel’s MKL (Math Kernel Library) •MAGMA/PLASMA •FLAME/Elemental/PLAPACK [R. van de Geijn]
Other typical components in scientific computing applications
•PETSc (“Portable Extensible Toolkit for Scientific computation”)
[M. Snir: “A Note on N-Body Computations with Cutoffs”. Theory Comp. Syst. 37(2): 295-318,2004]
![Page 15: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/15.jpg)
©Jesper Larsson Träff SS16
Trends in High-Performance Computing Architectures
by looking at Top500 list, www.top500.org
Ranks supercomputer performance by LINPACK benchmark (HPL), updated twice yearly (June, ISC Germany; November ACM/IEEE Supercomputing)
![Page 16: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/16.jpg)
©Jesper Larsson Träff SS16
Serious background of Top500: Benchmarking to evaluate (super)computer performance
In HPC: often based on one single benchmark, High Performance LINPACK (HPL) that solves a system of linear equations under some specified constraints (minimum number of operations)
HPL performs well (high computational efficiency) on many architectures; benchmark allows a wide range of optimizations
HPL is less demanding on communication performance
HPL does not give a balanced view of “overall” system performance or capabilities
HPL is politically important… (much money lost because of HPL…)
![Page 17: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/17.jpg)
©Jesper Larsson Träff SS16
Performance measured in FLOPS (Floating Point Operations per Second)
Floating Point: 64-bit IEEE Floating Point number (32-bits often too little)
FLOPS
M(ega)FLOPS 106
G(iga)FLOPS 109
T(era)FLOPS 1012
P(eta)FLOPS 1015
E(xa)FLOPS 1018
Z(etta)FLOPS 1021
Y(otta)FLOPS 1024
Computing system peak Floating Point Performance (Rpeak)
ClockFrequency x #FLOP/Cycle x #CPU’s x #Cores/CPU
Optimistic, best case upper bound
![Page 18: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/18.jpg)
©Jesper Larsson Träff SS16
Linpack performance
Rmax: FLOPS measured by solving large Linpack instance Nmax: Problem size for reaching Rmax N/2: Problem size for reaching Rmax/2 Rpeak: System Peak Performance as computed by owner
Number of double precision floating point operations needed for solving the linear system must be (at least) 2/3 n^3 + O(n^2) Excludes •Strassen and other “fast” matrix-matrix multiplication methods •Algorithms that compute with less than 64-bit precision
![Page 19: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/19.jpg)
©Jesper Larsson Träff SS16
November 2015
#500 system
#1 system
What are the systems at the jumps?
![Page 20: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/20.jpg)
©Jesper Larsson Träff SS16
HPL is politically important… (much money lost because of HPL…)
HPL is used to make projections on supercomputing performance trends (as Moore’s “Law”) HPL partly (to a large extent?) a driver for supercomputing “performance” development: It is very hard to defend building a system that will not rank highly on Top500
Towards Exascale: PetaFLOPS was achieved in 2008, ExaFLOPS expected ca. 2018-2020
![Page 21: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/21.jpg)
©Jesper Larsson Träff SS16
November 2015
2018/19 ExaFlop prediction will not hold
![Page 22: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/22.jpg)
©Jesper Larsson Träff SS16
HPL (TOP500) www.top500.org HPCC, www.hpcchallenge.org STREAM www.cs.virginia.edu/stream/ NASPAR (LU, QR factorizations) Graph search (BFS): Graph500 www.graph500.org Energy consumption/efficiency: Green500 www.green500.org
Other HPC systems benchmarks
Important to know: Many research papers use these benchmarks for evaluation (which may or may not be fair)
![Page 23: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/23.jpg)
©Jesper Larsson Träff SS16
•Very early days: single-processor supercomputers (vector) •After ‘94, all supercomputers are parallel computers •Earlier days: custom-made, unique – highest performance processor + highest performance network •Later days, now: custom convergence, weaker standard processors, but more of them, weaker networks (InfiniBand, Tori, …) •Recent years: accelerators (again): GPUs, FPGA, MIC, …
Using top500: Broad trends
![Page 24: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/24.jpg)
©Jesper Larsson Träff SS16
Example: the Earth Simulator 2002-2004 (#1)
![Page 25: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/25.jpg)
©Jesper Larsson Träff SS16
System Vendor Cores Rmax (GFLOPS)
Rpeak (GFLOPS)
Power (KW)
Earth-Simulator
NEC 5120 35860.00 40960.00 3200.00
June 2002, Earth Simulator
Rmax: Performance achieved on HPL Rpeak: “Theoretical Peak Performance”, best case, all processors busy Power: processors only?
![Page 26: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/26.jpg)
©Jesper Larsson Träff SS16
Power supply
•~40TFLOPS
•5120 vector processors •8 (NEC SX6) processors per node •640 nodes, 640x640 full crossbar interconnect
BUT: Energy expensive
Earth Simulator 2 (2009) only vector system on Top500
•~15MW
![Page 27: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/27.jpg)
©Jesper Larsson Träff SS16
Vector processor can operate on (long) vectors instead of on scalars only
Prototypical SIMD/data parallel architecture
Peak performance: 8GFlops (with all vector pipes active) 256 element (double/long) vectors
[G. Blelloch: Vector Models for Data Parallel Computing”, MIT Press, 1990]
Pioneered by Cray; other vendors Convex, Fujitsu, NEC, …
![Page 28: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/28.jpg)
©Jesper Larsson Träff SS16
int a[], b[n], c[n];
double x[n], y[n], z[n];
double xx[n], yy[n], zz[n];
for (i=0; i<n; i++) {
a[i] = b[i]+c[i];
x[i] = y[i]+z[i];
xx[i] = (yy[i]*zz[i])/xx[i];
}
for (i=0; i<n; i+=v) {
vadd(a+i,b+i,c+i);
vdadd(x+i,y+i,z+i);
vdmul(t,yy+i,zz+i);
vddiv(xx+i,t,xx+i);
}
Simple “data parallel (SIMD) loop”, no dependencies
n independent operations broken down into n/v operations on v-element vectors (v=256, e.g.) Roughly translates to:
Can keep both integer and floating point pipes busy
n>>v: iteration i can prefetch vector for iteration i+v
![Page 29: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/29.jpg)
©Jesper Larsson Träff SS16
Large n means long sequences of instructions with no branches: deep pipelines are viable Each pipe of each vector unit produces a result in every cycle
But: Sufficient memory bandwidth must available!
![Page 30: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/30.jpg)
©Jesper Larsson Träff SS16
High memory bandwidth achieved by organizing memory into banks (NEC SX-6: 2K banks)
Element i, i+1, i+2, … in different banks, element i and i+2K in same bank: bank conflict, expensive because of serialization
32 Memory units, each with 64 banks
Special communication processor (RCU) directly connected to memory system
![Page 31: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/31.jpg)
©Jesper Larsson Träff SS16
High memory bandwidth achieved by organizing memory into banks (NEC SX-6: 2K banks)
Banked memories also found in GPUs
[Harris, Sengupta, Owens: “Parallel Prefix Sum (Scan) with CUDA”, 2007
Element i, i+1, i+2, … in different banks, element i and i+2K in same bank: bank conflict, expensive because of serialization
![Page 32: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/32.jpg)
©Jesper Larsson Träff SS16
Vectorizable loop structures
for (i=0; i<n; i++) {
a[i] = b[i]+c[i];
}
for (i=0; i<n; i++) {
a[i] = a[i]+b[i]*c[i];
}
DAXPY, fused multiply add
Simple loop, integer (long) and floating point operations
![Page 33: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/33.jpg)
©Jesper Larsson Träff SS16
Vectorizable loop structures
for (i=0; i<n; i++) {
if (cond[i]) a[i] = b[i]+c[i];
}
Conditional execution handled by masking
for (i=0; i<n; i++) {
R[i] = b[i]+c[i];
MASK[i] =cond[i];
if (MASK[i]) a[i] = R[i];
}
Roughly translates to:
MASK special register for conditional store, R temporary register
![Page 34: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/34.jpg)
©Jesper Larsson Träff SS16
Vectorizable loop structures
#pragma vdir vector,nodep
for (i=0; i<n; i++) {
a[ixa[i]] = b[ixb[i]]+c[ixc[i]];
}
Gather/Scatter operations. Compiler may need help, dependency analysis not sufficient Can cause bank conflicts
![Page 35: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/35.jpg)
©Jesper Larsson Träff SS16
Vectorizable loop structures
#pragma vdir vector
for (i=1; i<n; i++) {
a[i] = a[i-1]+a[i];
}
min = a[0];
#pragma vdir vector
for (i=0; i<n; i++) {
if (a[i]<min) min = a[i];
}
Prefix-sums
Min/max operations
![Page 36: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/36.jpg)
©Jesper Larsson Träff SS16
#pragma vdir vector,nodep
for (i=0; i<n; i++) {
a[s*i] = b[s*i]+c[s*i];
}
Strided access
Can cause bank conflicts, some strides always bad
Vectorizable loop structures
Large vector processors for High-Performance Computing currently out of fashion, almost non-existent NEC SX-8 (2005), NEC SX-9 (2008), NEC SX-ACE (2013)
For a while, no NEC vector processors
![Page 37: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/37.jpg)
©Jesper Larsson Träff SS16
Small scale vectorization
•MMX, SSE, AVX… (128 bit vectors, 256 bit vectors) •Intel MIC/Xeon Phi: 512 bit vectors, new, special vector instructions (2013: compiler support not yet mature)
High performance on standard processors: •Be aware of/exploit vectorization potential •Check whether loops where indeed vectorized (gcc –ftree-vectorizer-verbose=n …, in combination with architecture specific optimizations)
2, 4, 8 Floating Point operations simultaneously by one vector instruction (no integers?)
![Page 38: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/38.jpg)
©Jesper Larsson Träff SS16
Many scientific codes fit well with vector model; irregular, non-numerical code often not Mature compiler technology for vectorization and optimization (loop splitting, loop fusion – to keep vector pipes busy)
[Allen, Kennedy: “Optimizing Compilers for Modern Architectures”, MKP 2002]
Scalar (non-vectorizable) code carried out by standard, scalar processor; limits performance (Amdahls Law)
Programming model: loops, sequential control flow, compiler handles parallelism (implicit) by vectorizing loops (some help from programmer)
Standard textbook
![Page 39: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/39.jpg)
©Jesper Larsson Träff SS16
Explicit Parallelism
•8-way SMP (8 vector processor per shared-memory node) •Not cache-coherent •Nodes connected by full crossbar
2-level explicit parallelism: •Intra-node with shared-memory communication •Inter-node with communication over crossbar
![Page 40: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/40.jpg)
©Jesper Larsson Träff SS16
Lack of cache-coherence: Earth Simulator/NEC SX
•Scalar unit has cache, caches of scalar units on node not coherent •Vector units read/write directly to memory, no caches •Write-through cache
Aside: Cray X1 (vector computer early 2000) had a different cache-coherent design
•Nodes must coordinate and synchronize •Parallel programming model (OpenMP, MPI) can help
[D. Abts, S. Scott, D. J. Lilja: “So Many States, So Little Time: Verifying Memory Coherence in the Cray X1”, IPDPS 2003: 11]
![Page 41: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/41.jpg)
©Jesper Larsson Träff SS16
Example: MPI and cache non-coherence
i j
MPI_Recv(&y,…,comm,&status);
MPI_Send(&x,…,comm);
x: Mem of rank i y: Mem of rank j
y: Cache of j
Coherency/consistency needed after MPI_Recv: rank j must invalidate cache(lines) at the point where MPI requires coherence (at MPI_Recv)
Incoherent state
Processes i and j on same node
Vectorized memcpy
write
![Page 42: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/42.jpg)
©Jesper Larsson Träff SS16
Example: MPI and cache non-coherence
i j
MPI_Recv(&y,…,comm,&status);
MPI_Send(&x,…,comm);
x: Mem of rank i y: Mem of rank j
y: Cache of j
Coherency/consistency needed after MPI_Recv: clear_cache instruction invalidates all cache lines
Incoherent state
Expensive: 1) clear_cache itself; 2) all cached values lost!
Further complication with MPI: structured data/data types; address &y alone do not tell where the data are
Vectorized memcpy
write
![Page 43: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/43.jpg)
©Jesper Larsson Träff SS16
Example: OpenMP and cache non-coherence
#pragma omp parallel for
for (i=0; i<n; i++) {
x[i] = f(y[i]);
} Sequential region: All x[i]’s visible to all threads
OpenMP: All regions (parallel, critical, …) require memory in a consistent state (caches coherent); explicit flush/fence constructs to force visibility
Observation: Higher-level programming models may alleviate need for low-level, fine-grained cache coherency.
![Page 44: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/44.jpg)
©Jesper Larsson Träff SS16
Cache debate
Caches (idea: exploit temporal and spatial locality in applications) has been a major factor in single-processor performance increase (since sometime in the 80ties)
Many new challenges for caches in parallel processors: •Coherency •Scalability •Resource consumption (logic=transistors=chip area; energy) •…
[Milo M. K. Martin, Mark D. Hill, Daniel J. Sorin: Why on-chip cache coherence is here to stay. Commun. ACM 55(7): 78-89 (2012)]
![Page 45: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/45.jpg)
©Jesper Larsson Träff SS16
MPI and OpenMP
Still most widely used programming interfaces/models for parallel HPC (there are contenders)
MPI: Message-Passing Interface, see www.mpi-forum.org •MPI processes (ranks) communicate explicitly: point-to-point-communication, one-sided communication, collective communication, parallel I/O •Subgrouping and encapsulation (communicators) •Much support functionality
OpenMP: shared-memory interface (C/Fortran pragma-extension), data (loops) and task parallel support; see www.openmp.org
![Page 46: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/46.jpg)
©Jesper Larsson Träff SS16
Partitioned Global Address Space (PGAS)alternative to MPI
Addressing mechanism for part of the processor-local address space can be shared between processes; referencing non-local parts of partitioned space leads to implicit communication
Language or library supported: Some data structures (typically arrays) can be declared as shared (partitioned) across (all) threads
Note: PGAS not same as Distributed Shared Memory (DSM). PGAS explicitly controls which data structures (arrays) are partitioned, and to some extent how they are partitioned
![Page 47: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/47.jpg)
©Jesper Larsson Träff SS16
Global array(s):
Thread k owns
a:
Each block of global array in local memory of some process/thread
Simple, block cyclic distribution of array a
PGAS: Data structures (simple arrays) partitioned (shared) over the memory of p threads
![Page 48: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/48.jpg)
©Jesper Larsson Träff SS16
Global array(s):
Thread k owns
b = a[i];
a[j] = b;
Thread k:
Memory model: Defines when update becomes visible to other threads
entails communication if index i or index j is not owned by thread k
a:
Each block of global array in local memory of some process/thread
![Page 49: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/49.jpg)
©Jesper Larsson Träff SS16
Global array(s):
a[i] = b[j];
Thread k:
even if neither a[i] nor b[j] owned by k
Thread k owns
Memory model: Defines when update becomes visible to other threads
a:
![Page 50: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/50.jpg)
©Jesper Larsson Träff SS16
Global array(s):
forall(i=0; i<n; i+) {
a[i] = f(x[i]);
}
Owner computes rule: Thread k performs updates only on the elements(indices) owned by/local to k
partitioned (shared) over the memory of p threads
Thread k owns
a:
![Page 51: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/51.jpg)
©Jesper Larsson Träff SS16
Typical PGAS features: •Array assignments/operations translated into communication when necessary based on ownership •Mostly simple, block-cyclic distributions of (multi-dimensional) arrays •Collective communication support for redistribution, collective data transfer (transpositions, gather/scatter) and reduction-type operations •Bulk-operations, array operations
Even more extreme: SIMD array languages, array operations parallelized by library and runtime
Often less support for library building (process subgoups) than MPI
![Page 52: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/52.jpg)
©Jesper Larsson Träff SS16
Some PGAS languages/interfaces:
•UPC: Unified Parallel C, C language extension; collective communication support; severe limitations •CaF: Co-array Fortran, standardized, but limited PGAS extension to Fortran •CAF2: considerably more powerful, non-standardized Fortran extension •X10 (Habanero): IBM asynchronous PGAS language •Chapel: Cray, powerful data structure support •Titanium: Java-extension •Global Arrays (GA): older, PGAS-like library for array programming , see http://hpc.pnl.gov/globalarrays/ HPF: High-Performance Fortran
![Page 53: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/53.jpg)
©Jesper Larsson Träff SS16
Full crossbar: •Each node has a direct link (cable) to each other node •Full bidirectional communication over each link •All pairs of nodes can communicate simultaneously without having to share bandwidth •Processors on node shared crossbar bandwidth •BUT: 12.6 Gbyte/s BW vs. 64GFLOPS/node
Back to the Earth Simulator: Interconnect
![Page 54: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/54.jpg)
©Jesper Larsson Träff SS16
Hierarchical/Hybrid communication subsystems
•Processors on same node “closer” than processors on different nodes – nodes can be organized in a hierarchy •Different communication media within nodes (e.g., shared-memory) and between nodes (e.g., crossbar network) •Processors on same node share bandwidth of inter-node network
M
P P P P
M
P P P P
M
P P P P
M
P P P P
Communication network
![Page 55: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/55.jpg)
©Jesper Larsson Träff SS16
Hierarchical/Hybrid communication subsystems
M
P P P P
M
P P P P
M
P P P P
M
P P P P
Communication network
Many more hierarchy levels: •Processors have cache (and memory) hierarchy: L1 (data/instruction) -> L2 –> L3 (…) •Processors (multi-core) share caches at certain levels (e.g., AMD and Intel differ) •Network may itself be hierarchical (Clos/fat tree: InfiniBand) •Vector/accelerators •…
![Page 56: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/56.jpg)
©Jesper Larsson Träff SS16
“Pure”, homogeneous programming models typically oblivious to hierarchy •MPI (no performance model, only indirect mechanisms for grouping processes according to system structure: MPI topologies) •UPC (local/global, no grouping at all) •…
Implementation challenge for compiler/library implementer to take hierarchy into account: •Point-to-point communication uses closest path, e.g. shared memory when possible •Efficient, hierarchical collective communication algorithms exist (for some cases, still incomplete and immature)
![Page 57: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/57.jpg)
©Jesper Larsson Träff SS16
“Pure”, homogeneous programming models typically oblivious to hierarchy
Application programmer relies on language/library to efficiently exploit system: •Portability! •Performance portability?! All library/language functions give good performance on (any) given system, thus an application whose performance is dominated by library/language function will perform predictable when porting to another system
Sensible to analyse performance in terms of collective operations (building blocks), e.g., T(n,p) = TAllreduce(p)+TAlltoall(n)+TBcast(np)+O(n)
![Page 58: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/58.jpg)
©Jesper Larsson Träff SS16
Hybrid/heterogeneous programming models
•Conscious to certain aspects/levels of hierarchy •Possibly more efficient application code: •Example: MPI+OpenMP •Less portable, less performance portable •Sometimes unavoidable (accelerators): OpenCL, OpenMP, OpenACC, …
M
P P P P
M
P P P P
M
P P P P
M
P P P P
Communication network
OpenMP
MPI between master threads
![Page 59: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/59.jpg)
©Jesper Larsson Träff SS16
Earth simulator 2/SX-9, 2009
More pipes Special pipes (square root)
Peak performance >100GFLOPS/processor
![Page 60: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/60.jpg)
©Jesper Larsson Träff SS16
Peak performance/CPU
102.4Gflops Total number of CPUs
1280
Peak performance/PN
819.2Gflops Total number of PNs
160
Shared memory/PN
128GByte Total peak performance
131Tflops
CPUs/PN
8
Total main memory
20TByte
Earth Simulator 2/SX-9 system
![Page 61: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/61.jpg)
©Jesper Larsson Träff SS16
Cheaper communication network than full crossbar: Fat Tree
![Page 62: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/62.jpg)
©Jesper Larsson Träff SS16
Fat Tree: Indirect (multi-stage), hierarchical network
P P
N
P P
N
P P
N
P P
N
N N
N Tree network, max 2 log p “hops” between processors, p-1 “wires”
![Page 63: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/63.jpg)
©Jesper Larsson Träff SS16
P P
N
P P
N
P P
N
P P
N
N N
N
Bandwidth increases, “fatter” wires
[C. E. Leiserson: Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. IEEE Trans. Computers 34(10): 892-901, 1985]
Fat-tree: typical, indirect (multi-stage), hierarchical network
![Page 64: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/64.jpg)
©Jesper Larsson Träff SS16
P P
N
P P
N
P P
N
P P
N
N N
N
[C. E. Leiserson: Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. IEEE Trans. Computers 34(10): 892-901, 1985]
Thinking Machines CM5, on first, unofficial Top500
Fat-tree: typical, indirect (multi-stage), hierarchical network
![Page 65: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/65.jpg)
©Jesper Larsson Träff SS16
P P
N
P P
N
P P
N
P P
N
N N N N N
N
N
N N N N N
Realization with N small crossbar switches
Fat-tree: typical, indirect (multi-stage), hierarchical network
Example: InfiniBand
![Page 66: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/66.jpg)
©Jesper Larsson Träff SS16
Example: the Blue Gene’s, 2004 (#1)
![Page 67: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/67.jpg)
©Jesper Larsson Träff SS16
System Vendor Cores Rmax (GFLOPS)
Rpeak (GFLOPS)
BlueGene/L DD2 beta-System (0.7 GHz PowerPC 440)
IBM 32768 70720.00 91750.00
November 2004, Blue Gene/L
![Page 68: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/68.jpg)
©Jesper Larsson Träff SS16
Large number of cores (2012: 1572864 – Sequioa system), weaker cores, limited memory per core/node
IBM Blue Gene L •~200.000 processing cores •256MBytes to 1G/core
Note: Not possible to locally maintain state of whole system, 256MBytes/200.000 ~ 1KBytes
•Applications that need to maintain state information for each other process in trouble •Libraries (e.g., MPI) that need to maintain state information for each process in (big) trouble
![Page 69: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/69.jpg)
©Jesper Larsson Träff SS16
•“slow” processors, 700-800MHz •Simpler processors, limited out-of-order, branch-prediction •BG/L: 2-core, not cache-coherent •BG/P: 4-core, cache-coherent •BG/Q: ? •Very memory constrained (512MB to 4GB/node) •Simple, low-bisection 3d-torus network
Energy efficient, heavily present on Green500
P P P P
P P P P
P P P P
P P P P
Note:Torus is not hierarchical
![Page 70: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/70.jpg)
©Jesper Larsson Träff SS16
Example: the Road Runner, 2008 (#1)
First PetaFLOP system, seriously accelerated
Decommisioned 31.3.2013
![Page 71: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/71.jpg)
©Jesper Larsson Träff SS16
System Vendor Cores Rmax (TF)
Rpeak (TF)
Power (KW)
BladeCenter QS22/LS21 Cluster, PowerXCell 8i 3.2 Ghz / Opteron DC 1.8 GHz, Voltaire Infiniband
IBM 129600 1105.0 1456.7 2483.00
November 2008, Road Runner
What counts as a “core”?
![Page 72: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/72.jpg)
©Jesper Larsson Träff SS16
3240 Nodes 2x2-core AMD processors 2 IBM Cell Broadband Engine (CBE) InfiniBand interconnect (single rail, 288 port IB switch)
Node
InfiniBand interconnect
Note: Node performance: ~600GFLOPS Communication Bandwidth/node: few Gbytes/s
![Page 73: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/73.jpg)
©Jesper Larsson Träff SS16
25,6GByte/s Total BW>300GByte/s
Standard IBM scalar PowerPC architecture
Multiple ring network with atomic operations
~total 250GFLOPS
![Page 74: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/74.jpg)
©Jesper Larsson Träff SS16
25.6 GFlops (32-bit)
•SIMD (128-bit vectors, 4 32-bit words) •Single-issue, no out-of-order capabilities, limited (no?) branch prediction
Small local storage, 256KB, no cache
![Page 75: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/75.jpg)
©Jesper Larsson Träff SS16
Complex, heterogeneous system: complex programming model (?)
•MPI communication between nodes, either all processors per node or one process per node •Possibly OpenMP/shared memory model on nodes •Offload to CBE of compute-intensive kernels •CBE programming: PPE/SPE, vectorization, explicit communication between SPE’s, PPE, node-memory
Possibly suited to very (very) compute intensive applications
Performance model (extended Roofline) would (help) tell which
![Page 76: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/76.jpg)
©Jesper Larsson Träff SS16
MPI communication
•Let the SPEs of the Cell be full-fledged MPI processes •Offload to CPUs as needed/possible
[Pakin et al.: The reverse-acceleration model for programming petascale hybrid systems. IBM J. Res. And Dev, (5): 8, 2009]
Drawbacks: •Latency high (SPE -> PPE -> CPU -> IB) •Supports only subset of MPI
![Page 77: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/77.jpg)
©Jesper Larsson Träff SS16
Example: the Fujitsu K Computer, 2011 (#1)
![Page 78: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/78.jpg)
©Jesper Larsson Träff SS16
System Vendor Cores Rmax (TF)
Rpeak (TF)
Power (KW)
K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect
Fujitsu 548352 8162.0 8773.6 9898.56
June 2011, K-Computer
![Page 79: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/79.jpg)
©Jesper Larsson Träff SS16
•High-end, multithreaded, scalar processor (SPARC64 VIIIfx) •Many special instructions •6-dimensional torus
[Yuichiro Ajima, Tomohiro Inoue, Shinya Hiramoto, Yuzo Takagi, Toshiyuki Shimizu: The Tofu Interconnect. IEEE Micro 32(1): 21-31 (2012)] [Yuichiro Ajima, Shinji Sumimoto, Toshiyuki Shimizu: Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers. IEEE Computer 42(11): 36-40 (2009)]
![Page 80: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/80.jpg)
©Jesper Larsson Träff SS16
Examples: Other accelerator-based systems
![Page 81: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/81.jpg)
©Jesper Larsson Träff SS16
November 2013, TianHe-2
System Vendor Cores Rmax (TF)
Rpeak (TF)
Power (KW)
TH-IVB-FEP Cluster, Intel Xeon E5-2692 12C 2.200GHz, TH Express-2, Intel Xeon Phi 31S1P
NUDT 3,120,000 33,862.7 54,902.4 17,808.00
![Page 82: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/82.jpg)
©Jesper Larsson Träff SS16
System Vendor Cores Rmax (TF)
Rpeak (TF)
Power (KW)
Cray XK7, Opteron 6274 16C 2.200GHz, Cray Gemini interconnect, NVIDIA K20x
Cray 560640 17590.0 27112.5 8209.00
November 2012, Cray Titan
![Page 83: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/83.jpg)
©Jesper Larsson Träff SS16
System Vendor Cores Rmax Rpeak Power (KW)
PowerEdge C8220, Xeon E5-2680 8C 2.700GHz, Infiniband FDR, Intel Xeon Phi
Dell 462462 5,168.1 8,520.1 4,510.00
November 2012, Stampede (#7)
![Page 84: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/84.jpg)
©Jesper Larsson Träff SS16
System Vendor Cores Rmax Rpeak Power
NUDT TH MPP, X5670 2.93Ghz 6C, NVIDIA GPU, FT-1000 8C
NUDT 186368 2566.0 4701.0 4040.00
November 2010, Tianhe
![Page 85: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/85.jpg)
©Jesper Larsson Träff SS16
Hybrid architectures with accelerator support (GPU, MIC)
•High-performance and low energy consumption through accelerators •GPU accelerator: highly parallel “throughput architecture”, lightweight cores, complex memory hierarchy, banked memory •MIC accelerator: lightweight x86 cores, extended vectorization, ring-network on chip
![Page 86: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/86.jpg)
©Jesper Larsson Träff SS16
Hybrid architectures with accelerator support (GPU, MIC)
Issues with accelerator: currently (2013) limited on-chip memory (MIC 8GByte), PCIex connection to main processor
Programming: Kernel offload, explicitly with OpenCL/CUDA MIC: some “reverse acceleration” projects, MPI between MIC cores
![Page 87: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/87.jpg)
©Jesper Larsson Träff SS16
Accelerators for Exascale?
Energy consumption and cooling perceived as biggest obstacles for ExaFLOPS
Energy consumed in •Processor (heat, leak) •Memory system •Interconnect
“Solution”: Massive amount of simple, low-frequency processors; weak(er) interconnects; deep memory hierarchy
![Page 88: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/88.jpg)
©Jesper Larsson Träff SS16
Run-of-the-mill
System Vendor Cores Rmax Rpeak Power
Megware Saxonid 6100, Opteron 8C 2.2 GHz, Infiniband QDR
Megware 20776 152.9 182.8 430.00
VSC-2, June 2011, November 2012: #162
Similar to TU Wien, Parallel Computing group “jupiter”
![Page 89: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/89.jpg)
©Jesper Larsson Träff SS16
Run-of-the-mill
System Vendor Cores Rmax Rpeak Power
Oil blade server, Intel Xeon E5-2650v2 8C 2.6GHz, Intel TrueScale Infiniband
Cluster Vision
32,768 596.0 681.6 450.00
VSC-3 November 2014 #85; November 2015 #138
•Innovative oil cooling •Dual link InfiniBand
![Page 90: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/90.jpg)
©Jesper Larsson Träff SS16
Memory in HPC systems (2015)
System #Cores Memory (GB)
Memory/Core (GB)
TianHe-2 3,120,000 1,024,000 0,33
Titan (Cray XK) 560,640 710,144 1,27
Sequoia (IBM BG/Q) 1,572,864 1,572,864 1
K (Fujitsu SPARC) 705,024 1,410,048 2
Stampede (Dell) 462,462 192,192 0,42
Roadrunner (IBM) 129,600 ?
Pleiades (SGI) 51,200 51,200 1
BlueGene/L (IBM) 131,072 32,768 0,25
Earth Simulator (SX9) 1,280 20,480 16
Earth Simulator (SX6) 5,120 ~10,000 1,95
![Page 91: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/91.jpg)
©Jesper Larsson Träff SS16
Memory/core in HPC systems
•What is a core (GPU SIMD core)? •Memory a scarce resource, not possible to keep state information for all cores •Hybrid, shared memory programming models may help to keep shared structures once/node •Algorithms must use memory efficiently: in-place, no O(n2) representations for O(n+m) sized graphs, …
![Page 92: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/92.jpg)
©Jesper Larsson Träff SS16
Not in Top500 list
Details on interconnect only indirectly available: •Bandwidth/node, bandwidth/core •Bisection bandwidth •Number of communication ports/node
Fully connected, direct: high bisection, low diameter, contention free
(Fat)tree: logarithmic diameter, high bisection possible, contention possible
Torus/Mesh: low bisection, high diamter
#cores?
![Page 93: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/93.jpg)
©Jesper Larsson Träff SS16
Summary: Exploiting (HPC) systems well
•Understand computer architecture: Processor capabilities (pipeline, branch predictor, speculation, vectorization, …) memory system (cache-hierarchy, memory network) •Understand communication networks (structure: diameter, bisection width, practical realization: NIC, communication processors)
•Understand programming model, and realization: language, interface, framework; algorithms and datastructures
Co-design: application, programming model, architecture
![Page 94: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/94.jpg)
©Jesper Larsson Träff SS16
Summary: What is HPC?
Study of •Computer architecture, memory systems •Communication networks •Programming models and interfaces •(Parallel) Algorithms and data structures, for applications and for interface support •Assessment of computer systems: performance models, rigorous benchmarking
For Scientific Computing (applications): •Tools, libraries, packages •(Parallel) Algorithms and datastructures
![Page 95: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/95.jpg)
©Jesper Larsson Träff SS16
[Hennessy, Patterson: Computer Architecture – A Quantitative Approach (5 Ed.). Morgan Kaufmann, 2012] [Bryant, O’Halloran: Computer Systems. Prentice-Hall, 2003] [Georg Hager, Jan Treibig, Johannes Habich, Gerhard Wellein: Exploring performance and power properties of modern multi-core chips via simple machine models. Concurrency and Computation: Practice and Experience 28(2): 189-210 (2016)] [Georg Hager, Gerhard Wellein: Introduction to High Performance Computing for Scientists and Engineers. Chapman and Hall / CRC computational science series, CRC Press 2011, ISBN 978-1-439-81192-4, pp. I-XXV, 1-330]
Processor architecture models
Roofline model This lecture
![Page 96: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/96.jpg)
©Jesper Larsson Träff SS16
Memory system
Cache system basics
[Georg Hager, Gerhard Wellein: Introduction to High Performance Computing for Scientists and Engineers. Chapman and Hall / CRC computational science series, CRC Press 2011, ISBN 978-1-439-81192-4, pp. I-XXV, 1-330]
•Cache-aware algorithm: Algorithm that uses memory (cache) hierarchy efficiently, under knowledge of the number of levels, cache and cache line sizes •Cache-oblivious algorithm: Algorithm that uses memory hierarchy efficiently, without explicitly knowing cache system parameters (cache and line sizes) •Cache-replacement strategies Not this year
![Page 97: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/97.jpg)
©Jesper Larsson Träff SS16
[Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar Ramachandran: Cache-Oblivious Algorithms. ACM Trans. Algorithms 8(1): 4 (2012), results dating back to FOCS 1999]
•Cache-aware algorithm: Algorithm that uses memory (cache) hierarchy efficiently, under knowledge of the number of levels, cache and cache line sizes •Cache-oblivious algorithm: Algorithm that uses memory hierarchy efficiently, without explicitly knowing cache system parameters (cache and line sizes) •Cache-replacement strategies
![Page 98: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/98.jpg)
©Jesper Larsson Träff SS16
Memory system: multi-core memory systems (NUMA)
[Georg Hager, Gerhard Wellein: Introduction to High Performance Computing for Scientists and Engineers. Chapman and Hall / CRC computational science series, CRC Press 2011, ISBN 978-1-439-81192-4, pp. I-XXV, 1-330]
Memory efficient algorithms: external memory model, in-place algorithms, …
Not this year
![Page 99: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/99.jpg)
©Jesper Larsson Träff SS16
Communication networks
•Network topologies •Routing •Modeling By need only
Communication library
Efficient communication algorithms for given network assumptions inside MPI
![Page 100: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/100.jpg)
©Jesper Larsson Träff SS16
Completely different case-study: context allocation in MPI
Process i: MPI_Send(&x,c,MPI_INT,j,TAG, comm);
Process j: MPI_Recv(&y,c,MPI_INT,j,TAG,comm,&status);
Process j receives messages with TAG on comm in order
MPI_Send(…,j,TAG,other); no match: no communication if comm!=other
Implementation of point-to-point communication: Message envelope contains communication context, unique to comm, to distinguish messages on different communicators
![Page 101: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/101.jpg)
©Jesper Larsson Träff SS16
Tradeoff: number of bits for communication context vs. number of communication contexts. Sometimes: 12 bits, 14 bits, 16 bits… (4K to 16K possible communicators)
Implementation challenges: small envelope
Recall: •Communicators in MPI essential for safe parallel libraries, tags not sufficient (library routines written by different people might use same tags) •Communicators in MPI essential for algorithms that require collective communication on subsets of processes
![Page 102: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/102.jpg)
©Jesper Larsson Träff SS16
MPI_Comm: MPI_COMM_WORLD
i j
MPI_Comm: local structure representing distributed communicator object
MPI_Recv(…,comm,&status); MPI_Send(…,comm);
MPI_COMM_WORLD: default communicator, all processes
MPI_Comm_create(), MPI_Comm_split(), MPI_Graph_create(), …: collective operations to create new communicators out of old ones
![Page 103: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/103.jpg)
©Jesper Larsson Träff SS16
MPI_Comm_create(), MPI_Comm_split(), MPI_Graph_create(), …: collective operations to create new communicators out of old ones
1. Determine which other processes will belong to new commnicator
2. Allocate context id: maintain global bitmap of used id’s
Algorithm scheme, process i:
Standard implementation: Use 4K to 16K bit vector bitmap to keep track of free communication contexts. If bitmap[i]==0, then i is a free communication context
unsigned long bitmap[words];
![Page 104: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/104.jpg)
©Jesper Larsson Träff SS16
MPI_Comm: MPI_COMM_WORLD
MPI_Comm MPI_Comm
MPI_Comm
MPI_Comm
MPI_Comm MPI_Comm MPI_Comm
Problem: ensure that all processes in new communicator have same communication context by using same bitmap
![Page 105: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/105.jpg)
©Jesper Larsson Träff SS16
unsigned long bitmap[words], newmap[words];
MPI_Allreduce(bitmap,newmap,words,MPI_LONG,MPI_BOR,comm);
Important fact ( will see later in lecture): For any reasonable network N, it holds that
Time(MPI_Allreduce(m)) = O(max(diam(N),log p)+m)
Step 2.1: Since all communicator creating operations are collective, use collective MPI_Allreduce() to generate global bitmap representing all used communication contexts
Bitwise OR
![Page 106: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/106.jpg)
©Jesper Larsson Träff SS16
Typical MPI_Allreduce performance (function of problem size, fixed number of processes, p=26*16)
Time is constant for m≤K, for some small K
Use K as size of bitmap?
“jupiter” IB cluster at TU Wien “Minimum recorded time, no error bars”
![Page 107: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/107.jpg)
©Jesper Larsson Träff SS16
for (i=0; i<words; i++) if (newmap[i]!=0xF…FL) break;
unsigned long x = newmap[i];
for (z=0; z<8*sizeof(x); z++)
if ((x&0x1)==0x0) break; else x>>=1;
O(words) operations
O(wordlength), dominates if words<wordlength
Step 2.2: Find first word with 0-bit
Step 2.3: Find rightmost (first) 0-bit in word
64 words of 64-bits = 4K communication contexts
Can we do better?
![Page 108: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/108.jpg)
©Jesper Larsson Träff SS16
Find “first 0 from right”, alternative methods
Here: 16-bit word
Method 1: Architecture has lsb(x) instruction (“least significant bit of x”
z = lsb(~x);
0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1
1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0
![Page 109: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/109.jpg)
©Jesper Larsson Träff SS16
0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1
Method 2: Architecture has “popcount” instruction pop(x) (population count, number of 1’s in x)
x = x&~(x+1);
z = pop(x);
0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0
1 1 0 1 0 0 1 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
z = pop(x) = 7;
Here: 16-bit word
![Page 110: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/110.jpg)
©Jesper Larsson Träff SS16
z = 0;
if ((x&0x0000FFFF) == 0x0000FFFF) { z = 16; x >>= 16; }
if ((x&0x000000FF) == 0x000000FF) { z += 8; x >>= 8; }
if ((x&0x0000000F) == 0x0000000F) { z += 4; x >>= 4; }
if ((x&0x00000003) == 0x00000003) { z += 2; x >>= 2; }
z += (x&0x1);
0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1
Method 3: direct; binary search
Here: 16-bit word
for 32-bit word
![Page 111: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/111.jpg)
©Jesper Larsson Träff SS16
z = 0;
if ((x&0x0000FFFF) == 0x0000FFFF) { z = 16; x >>= 16; }
if ((x&0x000000FF) == 0x000000FF) { z += 8; x >>= 8; }
if ((x&0x0000000F) == 0x0000000F) { z += 4; x >>= 4; }
if ((x&0x00000003) == 0x00000003) { z += 2; x >>= 2; }
z += (x&0x1);
0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1
0 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1
z = 0
z = 4
0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 1 z = 6 z = 7
for 32-bit word
![Page 112: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/112.jpg)
©Jesper Larsson Träff SS16
x = ~x; // invert bits
if (x==0) z = 32; else {
z = 0;
if ((x&0x0000FFFF) == 0x0) { z = 16; x >>= 16; }
if ((x&0x000000FF) == 0x0) { z += 8; x >>= 8; }
if ((x&0x0000000F) == 0x0) { z += 4; x >>= 4; }
if ((x&0x00000003) == 0x0) { z += 2; x >>= 2; }
z -= (x&0x1);
}
0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1
Method 3a: direct; binary search to find lsb
Here: 16-bit word
Might be better because masks needed only once
for 32-bit word
![Page 113: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/113.jpg)
©Jesper Larsson Träff SS16
x = x&~(x+1);
x = (x&0x55555555) + ((x>>1)&0x55555555);
x = (x&0x33333333) + ((x>>2)&0x33333333);
x = (x&0x0F0F0F0F) + ((x>>4)&0x0F0F0F0F);
x = (x&0x00FF00FF) + ((x>>8)&0x00FF00FF);
x = (x&0x0000FFFF) + ((x>>16)&0x0000FFFF);
0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1
Method 4: implement popcount
Exploits word parallelism. And is branchfree
Here: 16-bit word
for 32-bit word
popcount
![Page 114: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/114.jpg)
©Jesper Larsson Träff SS16
0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1
Idea:
pop
0 1 1 1 1 1 1 1 0 0 1 0 1 1 0 1 =
pop + pop
…and recurse
![Page 115: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/115.jpg)
©Jesper Larsson Träff SS16
x = (x&0x55555555) + ((x>>1)&0x55555555);
x = (x&0x33333333) + ((x>>2)&0x33333333);
x = (x&0x0F0F0F0F) + ((x>>4)&0x0F0F0F0F);
x = (x&0x00FF00FF) + ((x>>8)&0x00FF00FF);
x = (x&0x0000FFFF) + ((x>>16)&0x0000FFFF);
Observation: pop(x) for k-bit word x at most k; so pop(x) fits in word x
0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 pop(10) = ((10>>1)&0x1)+(10&0x1) = 1
pop(11) = ((11>>1)&0x1)+(11&0x1) = 2
for 32-bit word
![Page 116: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/116.jpg)
©Jesper Larsson Träff SS16
x = ~(~x&(x+1));
x = x-((x>>1)&0x55555555);
x = (x&0x33333333) + ((x>>2)&0x33333333);
x = (x+(x>>4)) & 0x0F0F0F0F;
x += (x>>8);
x += (x>>16);
z = x&0x0000003F;
0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1
Method 4a: implement popcount, improved
Here: 16-bit word
for 32-bit word
Exercise: Figure out what this does and why it works
![Page 117: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/117.jpg)
©Jesper Larsson Träff SS16
Preprocessing for FFT: bit reversal
For efficient Fast Fourier Transform (FFT) implementations a bit-reversal permutation is needed: B[r(i)] = A[i], where r(i) is the number arising from reversing the bits in the binary representation of i
Examples: r(111000) = 000111 r(10111) = 11101 r(101101) = 101101
General: r(ab) = r(b)r(a)
![Page 118: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/118.jpg)
©Jesper Larsson Träff SS16
x = ((x&0x55555555)<<1) | ((x&0xAAAAAAAA)>>1);
x = ((x&0x33333333)<<2) | ((x&0xCCCCCCCC)>>2);
x = ((x&0x0F0F0F0F)<<4) | ((x&0xF0F0F0F0)>>4);
x = ((x&0x00FF00FF)<<8) | ((x&0xFF00FF00)>>8);
x = ((x&0x0000FFFF)<<16) | ((x&0xFFFF0000)>>16);
for 32-bit word
r(a) for 32-bit word: Recursively, in parallel; branch-free:
0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1
0 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1
1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 0
0 1 0 0 1 0 1 1 1 1 1 0 1 1 1 1
1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 0
Note: the assignments can be done in any order
![Page 119: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/119.jpg)
©Jesper Larsson Träff SS16
x = ((x&0x55555555)<<1) | ((x>>1)&0x55555555);
x = ((x&0x33333333)<<2) | ((x>>2)&0x33333333);
x = ((x&0x0F0F0F0F)<<4) | ((x>>4)&0x0F0F0F0F);
x = (x<<24)| ((x&0xFF00)<<8) | ((x>>8)&0xFF00) |
(x>>24);
for 32-bit word
And perhaps even better (reuse of constants)
![Page 120: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/120.jpg)
©Jesper Larsson Träff SS16
Exercises (project):
1. On processor xyz (Intel, AMD, …) write the fastest code to find the rightmost 0-bit in a k-word bitmap. Use bit-parallelism and vectorization.
2. Write the fastest code for bit-reversal of 64-bit word on processor xyz
• Measure execution time, count number of generated
assembly instructions • How much better than first-shot linear-time algorithm?
![Page 121: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/121.jpg)
©Jesper Larsson Träff SS16
“If you write optimizing compilers or high-performance code, you must read this book”, Guy L. Steele, Foreword to “Hackers Delight”, 2002
[D. E. Knuth: “The Art of Computer Programming”, Vol . 4, Section 7.1.3, Addison-Wesley, 2011] [D. E. Knuth: “MMIXWare: A RISC Computer for the Third Millenium”, LNCS 1750, 1999 (new edition 2014)]
See also http://graphics.stanford.edu/~seander/bithacks.html
Are such things relevant?
![Page 122: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/122.jpg)
©Jesper Larsson Träff SS16
[Michael Pippig: PFFT: An Extension of FFTW to Massively Parallel Architectures. SIAM J. Scientific Computing 35(3) (2013)] [Matteo Frigo, Steven G. Johnson: FFTW: an adaptive software architecture for the FFT. ICASSP 1998: 1381-1384 [Matteo Frigo: A Fast Fourier Transform Compiler. PLDI 1999: 169-180]
[Kang Su Gatlin, Larry Carter: Memory Hierarchy Considerations for Fast Transpose and Bit-Reversals. HPCA 1999:33-42] [Larry Carter, Kang Su Gatlin: Towards an Optimal Bit-Reversal Permutation Program. FOCS 1998:544-555]
Interesting further reading
![Page 123: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/123.jpg)
©Jesper Larsson Träff SS16
Not the end of the story
MPI_Comm_split(oldcomm,color,key,&newcomm);
All processes (in oldcomm) that supply the same color (or MPI_UNDEFINED) will belong to same newcomm, ordered by key, tie-break by rank in oldcomm
Problem: rank i supplying color c needs to determine which other processes also supplied color c
Trivial solution: all processes gather all colors and keys (MPI_Allgather), sort lexicographically to determine rank in newcomm
Early mpich: bubblesort!!!
![Page 124: High Performance Computing - TU Wien · 2016. 7. 1. · SS16 ©Jesper Larsson Träff High Performance Computing Introduction, overview Jesper Larsson Träff, Sascha Hunold, Alexandra](https://reader033.vdocuments.mx/reader033/viewer/2022060921/60ace4b79a3ea06a4966dba6/html5/thumbnails/124.jpg)
©Jesper Larsson Träff SS16
[Siebert, Wolf: „Parallel Sorting with Minimal Data“. EuroMPI 2011, LNCS 6960: 170-177]
Better solutions: •Different, O(p log p) sort •Modified allgather algorithm to merge on the fly •…
[A. Moody, D. H. Ahn, B. R. de Supinski: Exascale Algorithms for Generalized MPI_Comm_split. EuroMPI 2011, LNCS 6960: 9-18]