enabling low-overhead hybrid mpi/openmp parallelism in mpc · iwomp'2010 june 15th 2010 1...
Embed Size (px)
TRANSCRIPT

1June 15th 2010IWOMP'2010
Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC
Patrick Carribault, Marc Pérache and HervéJourdren
CEA, DAM, DIF, F-91297 Arpajon France

Introduction/Context
2June 15th 2010IWOMP'2010
HPC Architecture: Petaflop/s EraMulticore processors as basic blocksClusters of ccNUMA nodes
Parallel programming modelsMPI: distributed-memory modelOpenMP: shared-memory model
Hybrid MPI/OpenMP (or mixed-mode programming)Promising solution (benefit from both models for data parallelism)How to hybridize an application?
ContributionsApproaches for hybrid programmingUnified MPI/OpenMP framework (MPC) for lower hybrid overhead

Outline
3June 15th 2010IWOMP'2010
Introduction/Context
Hybrid MPI/OpenMP ProgrammingOverview
Extended taxonomy
MPC Framework
OpenMP runtime implementation
Hybrid optimization
Experimental Results
OpenMP performance
Hybrid performance
Conclusion & Future Work

Hybrid MPI/OpenMP Programming Overview
4June 15th 2010IWOMP'2010
MPI (Message Passing Interface)
Inter-node communication
Implicit locality
Useless data duplication and useless shared-memory transfers
OpenMP
Fully exploit shared-memory data parallelism
No inter-node standard
No data-locality standard (ccNUMA node)
Hybrid Programming
Mix MPI and OpenMP inside an application
Benefit from Pure-MPI and Pure-OpenMP modes

Hybrid MPI/OpenMP Programming Approaches
5June 15th 2010IWOMP'2010
Traditional ApproachesExploit one core with one execution flowE.g., MPI for inter-node communication, OpenMP otherwiseE.g., multi core CPU Socket-exploitation with OpenMP
Oversubscribing ApproachesExploit one core with several execution flowsLoad balancing on the whole nodeAdaptive behavior between parallel regions
Mixing DepthCommunications outside parallel regions
Network bandwidth saturationCommunications inside parallel regions
MPI thread-safety
Extended Taxonomy from [Heager09]

Hybrid MPI/OpenMP Extended Taxonomy
6June 15th 2010IWOMP'2010

MPC Framework
7June 15th 2010IWOMP'2010
User-level thread library [EuroPar’08]Pthreads API, debugging with GDB [MTAAP’2010]
Thread-based MPI [EuroPVM/MPI’09]MPI 1.3 CompliantOptimized to save memory
NUMA-aware memory allocator (for multithreaded applications)
Contribution: Hybrid representation inside MPCImplementation of OpenMP Runtime (2.5 compliant)Compiler part w/ patched GCC (4.3.X and 4.4.X)Optimizations for hybrid applications
Efficient oversubscribed OpenMP (more threads than cores)Unified representation of MPI tasks and OpenMP threadsScheduler-integrated polling methodsMessage-buffer privatization and parallel message reception

MPC’s Hybrid Execution Model (Fully Hybrid)
8June 15th 2010IWOMP'2010
Application with 1 MPI task per node

MPC’s Hybrid Execution Model (Fully Hybrid)
9June 15th 2010IWOMP'2010
Initialization of OpenMP regions (on the whole node)

MPC’s Hybrid Execution Model (Fully Hybrid)
10June 15th 2010IWOMP'2010
Entering OpenMP parallel region w/ 6 threads

MPC’s Hybrid Execution Model (Simple Mixed)
11June 15th 2010IWOMP'2010
2 MPI tasks + OpenMP parallel region w/ 4 threads (on 2 cores)

Experimental Environment
12June 15th 2010IWOMP'2010
Architecture
Dual-socket Quad-core Nehalem-EP machine
24GB of memory/Linux 2.6.31 kernel
Programming model implementations
MPI: MPC, IntelMPI 3.2.1, MPICH2 1.1, OpenMPI 1.3.3
OpenMP: MPC, ICC 11.1, GCC 4.3.0 and 4.4.0, SunCC 5.1
Best option combination OpenMP thread pinning (KMP_AFFINITY, GOMP_CPU_AFFINITY)OpenMP wait policy (OMP_WAIT_POLICY, SUN_MP_THR_IDLE=spin)MPI task placement (I_MPI_PIN_DOMAIN=omp)
Benchmarks
EPCC suite (Pure OpenMP/Fully Hybrid) [Bull et al. 01]
Microbenchmarks for mixed-mode OpenMP/MPI [Bull et al. IWOMP’09]

EPCC: OpenMP Parallel-Region Overhead
13June 15th 2010IWOMP'2010
05
101520253035404550
1 2 4 8
Number of Threads
Exec
utio
n Ti
me
(us)
MPC ICC 11.1 GCC 4.3.0 GCC 4.4.0 SUNCC 5.1

EPCC: OpenMP Parallel-Region Overhead (cont.)
14June 15th 2010IWOMP'2010
00,5
11,5
22,5
33,5
44,5
5
1 2 4 8
Number of Threads
Exec
utio
n Ti
me
(us)
MPC ICC 11.1 GCC 4.4.0 SUNCC 5.1

EPCC: OpenMP Parallel-Region Overhead (cont.)
15June 15th 2010IWOMP'2010
0
50
100
150
200
250
300
350
8 16 32 64
Number of Threads
Exec
utio
n Ti
me
(us)
MPC ICC 11.1 GCC 4.4.0 SUNCC 5.1

Hybrid Funneled Ping-Pong (1KB)
16June 15th 2010IWOMP'2010
1
10
100
1000
2 4 8
Number of OpenMP Threads
Rat
io
MPC IntelMPI MPICH2/GCC 4.4.0MPICH2/ICC 11.1 OPENMPI/GCC 4.4.0 OPENMPI/ICC 11.1

Hybrid Multiple Ping-Pong (1KB)
17June 15th 2010IWOMP'2010
02468
101214161820
2 4
Number of OpenMP Threads
Rat
io
MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1

Hybrid Multiple Ping-Pong (1KB) (cont.)
18June 15th 2010IWOMP'2010
0
10
20
30
40
50
60
2 4 8
Number of OpenMP Threads
Rat
io
MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1

Hybrid Multiple Ping-Pong (1MB)
19June 15th 2010IWOMP'2010
0
0,5
1
1,5
2
2,5
3
3,5
2 4 8
Number of OpenMP Threads
Rat
io
MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1

Alternating (MPI Tasks Waiting)
20June 15th 2010IWOMP'2010
0123456789
2 4 8 16
Number of OpenMP Threads
Rat
io
MPC Intel MPI MPICH2 1.1/GCC 4.4.0MPICH2 1.1/ICC 11.1 OPENMPI/GCC 4.4.0

Conclusion
21June 15th 2010IWOMP'2010
Mixing MPI+OpenMP is a promising solution for next-generation computer architectures How to avoid large overhead?
Contributions
Taxonomy of hybrid approaches
MPC: a Framework unifying both programming models
Lower hybrid overhead
Fully compliant MPI 1.3 and OpenMP 2.5 (with patched GCC)
Freely available at http://mpc.sourceforge.net (version 2.0)Contact: [email protected] or [email protected]
Future Work
Optimization of OpenMP runtime (e.g., NUMA barrier)
OpenMP 3.0 (tasks)
Thread/data affinity (thread placement, data locality)
Tests on large applications