enabling low-overhead hybrid mpi/openmp parallelism in mpc · iwomp'2010 june 15th 2010 1...

1June 15th 2010IWOMP'2010

Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC

Patrick Carribault, Marc Pérache and HervéJourdren

CEA, DAM, DIF, F-91297 Arpajon France

Introduction/Context


HPC Architecture: Petaflop/s EraMulticore processors as basic blocksClusters of ccNUMA nodes

Parallel programming modelsMPI: distributed-memory modelOpenMP: shared-memory model

Hybrid MPI/OpenMP (or mixed-mode programming)Promising solution (benefit from both models for data parallelism)How to hybridize an application?

ContributionsApproaches for hybrid programmingUnified MPI/OpenMP framework (MPC) for lower hybrid overhead

Outline


Introduction/Context

Hybrid MPI/OpenMP ProgrammingOverview

Extended taxonomy

MPC Framework

OpenMP runtime implementation

Hybrid optimization

Experimental Results

OpenMP performance

Hybrid performance

Conclusion & Future Work

Hybrid MPI/OpenMP Programming Overview


MPI (Message Passing Interface)

Inter-node communication

Implicit locality

Useless data duplication and useless shared-memory transfers

OpenMP

Fully exploit shared-memory data parallelism

No inter-node standard

No data-locality standard (ccNUMA node)

Hybrid Programming

Mix MPI and OpenMP inside an application

Benefit from Pure-MPI and Pure-OpenMP modes

Hybrid MPI/OpenMP Programming Approaches


Traditional ApproachesExploit one core with one execution flowE.g., MPI for inter-node communication, OpenMP otherwiseE.g., multi core CPU Socket-exploitation with OpenMP

Oversubscribing ApproachesExploit one core with several execution flowsLoad balancing on the whole nodeAdaptive behavior between parallel regions

Mixing DepthCommunications outside parallel regions

Network bandwidth saturationCommunications inside parallel regions

MPI thread-safety

Extended Taxonomy from [Heager09]

Hybrid MPI/OpenMP Extended Taxonomy


MPC Framework


User-level thread library [EuroPar’08]Pthreads API, debugging with GDB [MTAAP’2010]

Thread-based MPI [EuroPVM/MPI’09]MPI 1.3 CompliantOptimized to save memory

NUMA-aware memory allocator (for multithreaded applications)

Contribution: Hybrid representation inside MPCImplementation of OpenMP Runtime (2.5 compliant)Compiler part w/ patched GCC (4.3.X and 4.4.X)Optimizations for hybrid applications

Efficient oversubscribed OpenMP (more threads than cores)Unified representation of MPI tasks and OpenMP threadsScheduler-integrated polling methodsMessage-buffer privatization and parallel message reception

MPC’s Hybrid Execution Model (Fully Hybrid)


Application with 1 MPI task per node



Initialization of OpenMP regions (on the whole node)



Entering OpenMP parallel region w/ 6 threads

MPC’s Hybrid Execution Model (Simple Mixed)


2 MPI tasks + OpenMP parallel region w/ 4 threads (on 2 cores)

Experimental Environment


Architecture

Dual-socket Quad-core Nehalem-EP machine

24GB of memory/Linux 2.6.31 kernel

Programming model implementations

MPI: MPC, IntelMPI 3.2.1, MPICH2 1.1, OpenMPI 1.3.3

OpenMP: MPC, ICC 11.1, GCC 4.3.0 and 4.4.0, SunCC 5.1

Best option combination OpenMP thread pinning (KMP_AFFINITY, GOMP_CPU_AFFINITY)OpenMP wait policy (OMP_WAIT_POLICY, SUN_MP_THR_IDLE=spin)MPI task placement (I_MPI_PIN_DOMAIN=omp)

Benchmarks

EPCC suite (Pure OpenMP/Fully Hybrid) [Bull et al. 01]

Microbenchmarks for mixed-mode OpenMP/MPI [Bull et al. IWOMP’09]

EPCC: OpenMP Parallel-Region Overhead


05

101520253035404550

1 2 4 8

Number of Threads

Exec

utio

n Ti

me

(us)

MPC ICC 11.1 GCC 4.3.0 GCC 4.4.0 SUNCC 5.1

EPCC: OpenMP Parallel-Region Overhead (cont.)


00,5

11,5

22,5

33,5

44,5

5

1 2 4 8

Number of Threads

Exec

utio

n Ti

me

(us)

MPC ICC 11.1 GCC 4.4.0 SUNCC 5.1

EPCC: OpenMP Parallel-Region Overhead (cont.)


0

50

100

150

200

250

300

350

8 16 32 64

Number of Threads

Exec

utio

n Ti

me

(us)

MPC ICC 11.1 GCC 4.4.0 SUNCC 5.1

Hybrid Funneled Ping-Pong (1KB)


1

10

100

1000

2 4 8

Number of OpenMP Threads

Rat

io

MPC IntelMPI MPICH2/GCC 4.4.0MPICH2/ICC 11.1 OPENMPI/GCC 4.4.0 OPENMPI/ICC 11.1

Hybrid Multiple Ping-Pong (1KB)


02468

101214161820

2 4


Rat

io

MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1

Hybrid Multiple Ping-Pong (1KB) (cont.)


0

10

20

30

40

50

60

2 4 8


Rat

io


Hybrid Multiple Ping-Pong (1MB)


0

0,5

1

1,5

2

2,5

3

3,5

2 4 8


Rat

io


Alternating (MPI Tasks Waiting)


0123456789

2 4 8 16


Rat

io

MPC Intel MPI MPICH2 1.1/GCC 4.4.0MPICH2 1.1/ICC 11.1 OPENMPI/GCC 4.4.0

Conclusion


Mixing MPI+OpenMP is a promising solution for next-generation computer architectures How to avoid large overhead?

Contributions

Taxonomy of hybrid approaches

MPC: a Framework unifying both programming models

Lower hybrid overhead

Fully compliant MPI 1.3 and OpenMP 2.5 (with patched GCC)

Freely available at http://mpc.sourceforge.net (version 2.0)Contact: [email protected] or [email protected]

Future Work

Optimization of OpenMP runtime (e.g., NUMA barrier)

OpenMP 3.0 (tasks)

Thread/data affinity (thread placement, data locality)

Tests on large applications

http://mpc.sourceforge.net/

mailto:[email protected]

mailto:[email protected]

enabling low-overhead hybrid mpi/openmp parallelism in mpc · iwomp'2010 june 15th 2010 1...

Documents