multigrid on heterogeneous systems€¦ · 3.scaling behavior of hybrid code 4.analyse roo ine...
TRANSCRIPT
![Page 1: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/1.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
Multigrid on Heterogeneous Systems
A Look at Structured Stencil Computations
MPI + OpenMP + OpenCL
Lukas Spies1, Luke Olson1, Scott MacLachlan2, MichaelCampbell1, Daniel Bodony1, William Gropp1
1 The Center for Exascale Simulation of Plasma-Coupled Combustion (XPACC),University of Illinois at Urbana-Champaign, USA, http://xpacc.illinois.edu
2 Department of Mathematics and Statistics, Memorial University of Newfoundland,
Canada
1
![Page 2: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/2.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
IntroductionWhy target heterogeneous systems?
I Combine compute power of both CPUsand GPUs
I Avoid having one sit idle while the otheris busy
I Chance for performance increase
Challenges when working with BLAS-1(basic linear algebra subprograms, level 1 (e.g.,“axpy”-type operations))
I BLAS-1 has low computational intensity
I BLAS-1 requires a lot of communication
Image source: https://bluewaters.ncsa.illinois.edu/image/image gallery?img id=11748
2
![Page 3: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/3.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
Model problem
I Model problem: Laplace’s equation in two dimensions,
∇2u = 0.
I Dirichlet boundary conditions (set to 0)
I Domain: [0, 1]× [0, 1]
I Four-color Gauss-Seidel: Take weighted average of each pointand its 8 surrounding points (Mehrstellen FD stencil)
3
![Page 4: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/4.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
Coarse partitioning
N
M0
1
1
MPI #1 MPI #2
MPI #3 MPI #4
4
![Page 5: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/5.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
CPU/GPU partitioning designs
N
M0
1
1
GPU
CPU
GPU
CPU
GPU
CPU
GPU
CPU
Halo design
CPU GPU CPU GPU
CPU GPU CPU GPU
Slab design
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
L-shape design
Halo region
5
![Page 6: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/6.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
Blue Waters
MPI+
OpenMP
MPI+
OpenMP
MPI+
OpenMP
MPI+
OpenCL
XE6 node: XK7 node:2 AMD 6276 Interlagos 1 AMD 6276 Interlagos
1 NVIDIA K20X
I each Interlagos: 2.3 GHz peak performance, 4 memorychannels, 32 GB physical memory
I each K20X: 1.31 TFLOPS peak performance, 6 GB physicalmemory
6
![Page 7: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/7.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
Threading for hybrid approach
int main()
return 0;
shared memory
std::launch::async
CPU
OpenMP
CPU
OpenCL
blocking calls
GPU
std::future::wait()
I Using C++11 future class for asynchronous threading
7
![Page 8: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/8.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
Numerical experiments
1. Test various OpenMP thread numbers and OpenMPschedulers:
• static: equal chunks for each thread• dynamic: assign small chunks to threads, let each thread check
back for more work• guided: like dynamic, but with decreasing chunk size
2. Test various device-to-host ratios
3. Scaling behavior of hybrid code
4. Analyse roofline model
5. Performance model
Notation: host-to-device ratio := ω
8
![Page 9: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/9.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
CPU-only benchmark
Actual Timings in seconds Efficiency
2 4 6 8 10 12 14 166
8
10
12
14
16
18
OMP_NUM_THREADS
run
tim
e [s
]
staticstatic, 1dynamic, 1dynamic, 10dynamic, 100guided, 10
2 4 6 8 10 12 14 160
0.2
0.4
0.6
0.8
1
OMP_NUM_THREADS
T1
/ ( T
t *
t )
staticstatic, 1dynamic, 1dynamic, 10dynamic, 100guided, 10
I Varying number of OpenMP threads for various OpenMPschedulers
I 40, 000× 40, 000 points, 4 MPI ranks, 5 iterations
I Fastest time: 6356.43 ms
9
![Page 10: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/10.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
Hybrid OpenMP benchmarks
Actual Timings in seconds Efficiency
2 4 6 8 10 12 14 161.5
2
2.5
3
3.5
4
OMP_NUM_THREADS
run
tim
e [s
]
staticstatic, 1dynamic, 1dynamic, 10dynamic, 100guided, 10
2 4 6 8 10 12 14 160
0.2
0.4
0.6
0.8
1
OMP_NUM_THREADS
T1
/ ( T
t *
t )
staticstatic, 1dynamic, 1dynamic, 10dynamic, 100guided, 10
I Varying number of OpenMP threads for various OpenMPschedulers
I 40, 000× 40, 000 points, 4 MPI ranks, 5 iterations, ω = 0.9
I Fastest time: 1898.2 ms
10
![Page 11: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/11.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
Varying device-to-host ratio
0.1 0.25 0.45 0.65 0.85 0.991
2
3
4
5
6
7
portion of points handled by GPU
run
tim
e [s
]
CPU CPU
CPU
GPU GPU GPU
I 40, 000× 40, 000 points, 4 MPI ranks, 5 iterations,8 OpenMP threads per rank
11
![Page 12: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/12.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
Weak and strong scaling
1 4 8 16 32 64 128 256 512 10241
1.5
2
2.5
3
# MPI ranks
run
tim
e [s
]
hybrid weak scaling
1 4 8 16 32 64 128 256 512 10240.01
0.1
1
2
# MPI ranks
run
tim
e [s
]
hybrid strong scalingideal strong scaling
(a) weak scaling (b) strong scaling
I weak scaling: 20, 000× 20, 000 points per MPI rank
I strong scaling: 20, 000× 20, 000 points overall problem size
I both: 5 iterations, ω = 0.9
12
![Page 13: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/13.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
Roofline model
0.03125 0.0625 0.125 0.25 0.5 1 2 4 8 164
8
16
32
6481
128
256
512
1024
2048
Arithmetic Intensity
GF
lop
s/s
I Arithmetic intensity: Measure of FLOPS vs amount ofmemory accesses; we measured it with Oclgrind
I Theoretical peak performance of model problem: 81 GFLOPS,about 6.2% of peak performance of GPU (1.31 TFLOPS)
13
![Page 14: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/14.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
Performance model
0.1 0.25 0.45 0.65 0.85 0.990
2
4
6
8
10
12
14
portion of points handled by GPU
run
tim
e [s
]
measuredpredicted
1.25k2.5k 5k 7.5k 10k 15k 20k 25k0
1
2
3
4
5
sqrt of dimension
run
tim
e [s
]
measuredpredicted
T = max(Tc,Tg)k + (Tc→c + Tc→g + Tg→c)(k− 1) + TmoveSoln
kernel execution halo exchanges
k := iteration count (here: k = 5)
download solution
from GPU
14
![Page 15: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/15.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
Why not use package abc?I Frameworks/Libraries:
• Maestro: provides automatic data transfer, task decompositionacross multiple devices, autotuning of certain dynamicparameters
• SnuCL: provides API for using all compute units in a cluster asif they were located in the host node (no need for MPI)
• ViennaCL: linear algebra library for computations on many-corearchitectures (GPUs, MIC) and multi-core CPUs
I Key challenges for frameworks/libraries:
• Find parameters for autotuning• Fine-grained data-movement is tricky
I We needed ability to control fine-grained data movementbetween compute units
I Results and performance modelling can give importantinformation for improving existing libraries/frameworks
15
![Page 16: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/16.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
Takeaway
I Heterogeneous systems provides possibility of using CPUs andGPUs simultaneously
I C++11 future class provides easy access to threading
I OpenMP alone can only provide limited speedup (lowefficiency)
I Hybrid code achieved speedup of 3.3 over CPU-only code
I Optimal for GPU to handle a large portion of the points
I Very low arithmetic intensity of 0.4375 leading to a peakperformance of the code of 81 GFLOPS, i.e., more than 90%of GPU remain idle
I Improving data bandwidth between CPU and GPU canimprove performance of code, e.g., by using one GPU formultiple partitions
16
![Page 17: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/17.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
Future work
I apply concept to multigrid V-cycle and eventually GMRESwith multigrid preconditioner
I improve performance model
I extend to three dimensions
I use higher order stencils (e.g., Q2 FEM) and coupled systems
I increase halo width
17
![Page 18: Multigrid on Heterogeneous Systems€¦ · 3.Scaling behavior of hybrid code 4.Analyse roo ine model 5.Performance model Notation: host-to-device ratio := ! 8. PCI Parallel Computing](https://reader033.vdocuments.mx/reader033/viewer/2022053101/60617847861e3157e866ab92/html5/thumbnails/18.jpg)
PCIParallel ComputingInstitute
CSE Computational Science & Engineering XPACC
Center for Exascale Simulation
of Plasma-Coupled Combustion
Thank you!
This material is based in part upon work supported by the Department of Energy,National Nuclear Security Administration, under Award Number DE-NA0002374.
18