optimisation of parallel scientific applications on highly
TRANSCRIPT
Optimisation of Parallel Scientific Applicationson Highly Heterogeneous Modern Multicore and
Multi-GPU Computing Nodes
Vladimir Rychkov Ziming Zhong
Heterogeneous Computing LaboratorySchool of Computer Science and Informatics
University College Dublin
May 11, 2012
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 1 / 16
Introduction
Introduction
Hybrid multi-CPU/GPU architectures in HPCHow to utilise highly heterogeneous hardware and software stack?
Target platforms and applicationsDedicated multicore and multi-GPU nodes and clustersData-parallel applications dependent on data locality
Performance modeling and model-based optimisationRealistic models of processors executing data-parallel applicationAccurate measurement of performance on each processorOptimal data partitioning based on the models
Functional performance modelsHeterogeneous uniprocessor clusters
How to model performance of devices on a hybrid platform?Different programming models and softwareShared resources of different speed and capacity
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 2 / 16
Introduction
Introduction
Hybrid multi-CPU/GPU architectures in HPCHow to utilise highly heterogeneous hardware and software stack?
Target platforms and applicationsDedicated multicore and multi-GPU nodes and clustersData-parallel applications dependent on data locality
Performance modeling and model-based optimisationRealistic models of processors executing data-parallel applicationAccurate measurement of performance on each processorOptimal data partitioning based on the models
Functional performance modelsHeterogeneous uniprocessor clusters
How to model performance of devices on a hybrid platform?Different programming models and softwareShared resources of different speed and capacity
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 2 / 16
Introduction
Introduction
Hybrid multi-CPU/GPU architectures in HPCHow to utilise highly heterogeneous hardware and software stack?
Target platforms and applicationsDedicated multicore and multi-GPU nodes and clustersData-parallel applications dependent on data locality
Performance modeling and model-based optimisationRealistic models of processors executing data-parallel applicationAccurate measurement of performance on each processorOptimal data partitioning based on the models
Functional performance modelsHeterogeneous uniprocessor clusters
How to model performance of devices on a hybrid platform?Different programming models and softwareShared resources of different speed and capacity
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 2 / 16
Introduction
Hybrid HPC Platform
Device Memory
GPU
PCIe
Interface
PCIe
Interface
Multicore Host
...
Main Memory Device Memory
GPU
core 1
cache cache
core n
shared cache
Highly heterogeneous devices
Memory: shared and distributed
Resource contention: memory, PCI Express
Can be modelled as a heterogeneous distributed-memory systemFunctional performance models of devicesIntra-node data partitioning between devices
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 3 / 16
Introduction
Hybrid HPC Platform
Device Memory
GPU
PCIe
Interface
PCIe
Interface
Multicore Host
...
Main Memory Device Memory
GPU
core 1
cache cache
core n
shared cache
Highly heterogeneous devices
Memory: shared and distributed
Resource contention: memory, PCI Express
Can be modelled as a heterogeneous distributed-memory systemFunctional performance models of devicesIntra-node data partitioning between devices
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 3 / 16
Introduction
Outline
Functional performance models of multiple cores and GPUs
Accurate measurement of performance of devices
Optimisation of computational kernels
Intra-node data partitioning based on FPMs of devices
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 4 / 16
Functional Performance Models of Multiple Cores and GPUs
Heterogeneous Data-Parallel Application
Heterogeneous matrix multiplicationA B n x bCb
b
n x b
Beaumont, O., Boudet, V., Rastello, F., Robert, Y.: Matrix Multiplication onHeterogeneous Platforms. IEEE Trans. Parallel Distrib. Syst. 12(10), 1033-1051 (2001)
Representative kernel, GEMM: Ci+ = A(b) × B(b)
+=bX
b
m x b
n x b
i m x bi
i
n x bi
A B C i(b) (b)
Computation is proportional to the area of submatrix Ci
The same memory access pattern as the whole application
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 5 / 16
Functional Performance Models of Multiple Cores and GPUs
Functional Performance Models of Devices
Optimised GEMM routines: MKL/ACML(CPU), CUBLAS(GPU)
CPU cores compete with each other within a socketsc(x) - speed of a socket executing CPU GEMM on c cores,with submatrices x/c
GPU depends on a host process, which handles data transfers
GPU performance can be measured only within some rangeg(x) - combined speed of a GPU and its dedicated core executing CUBLAS,can be defined at [0,∞) for out-of-core GEMM
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 6 / 16
Functional Performance Models of Multiple Cores and GPUs
Functional Performance Models of Devices
Optimised GEMM routines: MKL/ACML(CPU), CUBLAS(GPU)
CPU cores compete with each other within a socketsc(x) - speed of a socket executing CPU GEMM on c cores,with submatrices x/c
GPU depends on a host process, which handles data transfers
GPU performance can be measured only within some rangeg(x) - combined speed of a GPU and its dedicated core executing CUBLAS,can be defined at [0,∞) for out-of-core GEMM
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 6 / 16
Functional Performance Models of Multiple Cores and GPUs
Functional Performance Models of Devices
Optimised GEMM routines: MKL/ACML(CPU), CUBLAS(GPU)
CPU cores compete with each other within a socketsc(x) - speed of a socket executing CPU GEMM on c cores,with submatrices x/c
GPU depends on a host process, which handles data transfers
GPU performance can be measured only within some rangeg(x) - combined speed of a GPU and its dedicated core executing CUBLAS,can be defined at [0,∞) for out-of-core GEMM
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 6 / 16
Performance Measurement on Hybrid Platforms
Performance Measurement on Hybrid Platforms
Bind processes to cores- to avoid potential performance degradation resulting fromautomatic rearranging of processes by operating system
Synchronise processes- to ensure resources are shared between processes
Repeat experiments multiple times- until the results are proved statistically reliable
Performance of multiple cores and GPUs can be measured separately
Simultaneous benchmark of multiple cores on a socket
Combined benchmark for a GPU and its dedicated core
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 7 / 16
Performance Measurement on Hybrid Platforms
Performance Measurement on Hybrid Platforms
Bind processes to cores- to avoid potential performance degradation resulting fromautomatic rearranging of processes by operating system
Synchronise processes- to ensure resources are shared between processes
Repeat experiments multiple times- until the results are proved statistically reliable
Performance of multiple cores and GPUs can be measured separately
Simultaneous benchmark of multiple cores on a socket
Combined benchmark for a GPU and its dedicated core
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 7 / 16
Performance Measurement on Hybrid Platforms
Experimental Platform
Multicore multi-GPU node ig.eecs.utk.eduInnovative Computing Laboratory, University of Tennessee, USA
CPU (AMD) GPUs (NVIDIA)
Architecture Opteron 8439SE GeForce GTX480 Tesla C870Core Clock 2.8 GHz 700 MHz 600 MHzNumber of Cores 8×6 cores 480 cores 128 coresMemory Size 64 GB 1536 MB 1536 MBMemory Bandwidth 177.4 GB/s 76.8 GB/s
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 8 / 16
Functional Performance Models of Multiple Cores
Functional Performance Models of Multiple Cores
Speed functions of multiple cores
60
80
100
120
0 1e+08 2e+08 3e+08 4e+08
Sp
ee
d (
GF
lop
s)
Matrix area (d)
socket
5 cores, b 640 6 cores, b 640
s5(x), s6(x) - speed functions for5 and 6 cores on a socket
Resource contention with GPU
60
80
100
120
0 1e+08 2e+08 3e+08 4e+08 5e+08
Sp
ee
d (
GF
lop
s)
Matrix area (d)
5 cores
CPU-onlycores:GPU = 1:3
cores:GPU = 1:8
s5(x) remains stable when sharingresource with GPU
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 9 / 16
Functional Performance Models of GPUs
Functional Performance Models of GPUs
Speed functions of GeForce GTX480
0
200
400
600
800
0 4e+08 8e+08 1.2e+09 1.6e+09
Sp
ee
d (
GF
lop
s)
Matrix area (d)
memory limit
b 640 b 640, overlap
Out-of-core implementationextends the range of problem sizes
Overlapping improves performance
Resource contention
0
200
400
600
800
0 4e+08 8e+08 1.2e+09 1.6e+09
Sp
ee
d (
GF
lop
s)
Matrix area (d)
GPU-onlycores:GPU = 1:3
cores:GPU = 1:8
Performance decreases around 10%from resource contention
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 10 / 16
Functional Performance Models of GPUs
Optimisation of the Kernel
Out-of-core computations
Overlap data transfers and computation
A0 B0 C0
A1
GEMM
GEMM C1
A0 C0GEMM
Stream 0
Stream 1
........ ........ ........ ........
time
C0
C1
C0Stream 2
...
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 11 / 16
FPM-based Data Partitioning on Hybrid Platform
FPM-based Data Partitioning on Hybrid Platform
Execution time of the parallel matrix multiplication applicationon different configurations on the hybrid server
Execution time (sec)Matrix size CPUs GPUs CPUs+GPUs
12800× 12800 14.6 10.5 5.819200× 19200 43.4 32.5 16.225600× 25600 99.8 147.9 38.232000× 32000 189.2 265.3 114.1
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 12 / 16
Ongoing and Future Study
Ongoing and Future Study
Target platform: multi-GPU servers - more resource contention
Accurate measurement technique
FPM of multiple GPUs on the same server
Design of computational kernel for multi-GPU servers
Experiments on data partitioning with different models
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 13 / 16
Output
Publications
Zhong, Z., Rychkov, V., Lastovetsky, A. ”Data Partitioning onHeterogeneous Multicore Platforms”, Cluster 2011, Austin, Texas,USA, IEEE Computer Society, pp. 580-584 (2011).
Zhong, Z., Rychkov, V., Lastovetsky, A. ”Functional PerformanceModels of Scientific Applications for Heterogeneous Multicore andMulti-GPU Systems” (submitted to Euro-Par 2012, Rhodes Island,Greece).
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 14 / 16
Output
Output
Project web page: http://hcl.ucd.ie/project/fupermod
Software
Implemented in the FuPerMod package, developed at HCL
Based on system and mathematical software: C/C++, MPI,Autotools, GNU Scientific Library, Boost C++ libraries, BLAS,CUDA Toolkit
Team
1 postdoctoral researcher: Vladimir Rychkov
2 PhD students: David Clarke, Ziming Zhong
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 15 / 16
Output
CollaborationHardware
Multicore multi-GPU servers (Innovative Computing Laboratory, Universityof Tennessee, USA)
Multicore multi-GPU cluster (High Performance Computing & Architecturesgroup, University Jaume I, Spain)
Grid’5000 (INRIA, CNRS, RENATER, France)
Collaboration
Leonel Sousa, Alexandar Ilic (Institute of System Engineering, ComputerResearch and Development, Portugal)
Enrique Quintana Orti (High Performance Computing & Architectures,University Jaume I, Spain)
Financial Support
Science Foundation UCD CSI China Scholarship Complex HPC actionIreland Council COST
Vladimir Rychkov (HCL/UCD) Optimisation of Applications on Multicore and Multi-GPU May 11, 2012 16 / 16