Transcript
Page 1: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

1June 15th 2010IWOMP'2010

Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC

Patrick Carribault, Marc Pérache and HervéJourdren

CEA, DAM, DIF, F-91297 Arpajon France

Page 2: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

Introduction/Context

2June 15th 2010IWOMP'2010

HPC Architecture: Petaflop/s EraMulticore processors as basic blocksClusters of ccNUMA nodes

Parallel programming modelsMPI: distributed-memory modelOpenMP: shared-memory model

Hybrid MPI/OpenMP (or mixed-mode programming)Promising solution (benefit from both models for data parallelism)How to hybridize an application?

ContributionsApproaches for hybrid programmingUnified MPI/OpenMP framework (MPC) for lower hybrid overhead

Page 3: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

Outline

3June 15th 2010IWOMP'2010

Introduction/Context

Hybrid MPI/OpenMP ProgrammingOverview

Extended taxonomy

MPC Framework

OpenMP runtime implementation

Hybrid optimization

Experimental Results

OpenMP performance

Hybrid performance

Conclusion & Future Work

Page 4: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

Hybrid MPI/OpenMP Programming Overview

4June 15th 2010IWOMP'2010

MPI (Message Passing Interface)

Inter-node communication

Implicit locality

Useless data duplication and useless shared-memory transfers

OpenMP

Fully exploit shared-memory data parallelism

No inter-node standard

No data-locality standard (ccNUMA node)

Hybrid Programming

Mix MPI and OpenMP inside an application

Benefit from Pure-MPI and Pure-OpenMP modes

Page 5: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

Hybrid MPI/OpenMP Programming Approaches

5June 15th 2010IWOMP'2010

Traditional ApproachesExploit one core with one execution flowE.g., MPI for inter-node communication, OpenMP otherwiseE.g., multi core CPU Socket-exploitation with OpenMP

Oversubscribing ApproachesExploit one core with several execution flowsLoad balancing on the whole nodeAdaptive behavior between parallel regions

Mixing DepthCommunications outside parallel regions

Network bandwidth saturationCommunications inside parallel regions

MPI thread-safety

Extended Taxonomy from [Heager09]

Page 6: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

Hybrid MPI/OpenMP Extended Taxonomy

6June 15th 2010IWOMP'2010

Page 7: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

MPC Framework

7June 15th 2010IWOMP'2010

User-level thread library [EuroPar’08]Pthreads API, debugging with GDB [MTAAP’2010]

Thread-based MPI [EuroPVM/MPI’09]MPI 1.3 CompliantOptimized to save memory

NUMA-aware memory allocator (for multithreaded applications)

Contribution: Hybrid representation inside MPCImplementation of OpenMP Runtime (2.5 compliant)Compiler part w/ patched GCC (4.3.X and 4.4.X)Optimizations for hybrid applications

Efficient oversubscribed OpenMP (more threads than cores)Unified representation of MPI tasks and OpenMP threadsScheduler-integrated polling methodsMessage-buffer privatization and parallel message reception

Page 8: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

MPC’s Hybrid Execution Model (Fully Hybrid)

8June 15th 2010IWOMP'2010

Application with 1 MPI task per node

Page 9: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

MPC’s Hybrid Execution Model (Fully Hybrid)

9June 15th 2010IWOMP'2010

Initialization of OpenMP regions (on the whole node)

Page 10: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

MPC’s Hybrid Execution Model (Fully Hybrid)

10June 15th 2010IWOMP'2010

Entering OpenMP parallel region w/ 6 threads

Page 11: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

MPC’s Hybrid Execution Model (Simple Mixed)

11June 15th 2010IWOMP'2010

2 MPI tasks + OpenMP parallel region w/ 4 threads (on 2 cores)

Page 12: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

Experimental Environment

12June 15th 2010IWOMP'2010

Architecture

Dual-socket Quad-core Nehalem-EP machine

24GB of memory/Linux 2.6.31 kernel

Programming model implementations

MPI: MPC, IntelMPI 3.2.1, MPICH2 1.1, OpenMPI 1.3.3

OpenMP: MPC, ICC 11.1, GCC 4.3.0 and 4.4.0, SunCC 5.1

Best option combination OpenMP thread pinning (KMP_AFFINITY, GOMP_CPU_AFFINITY)OpenMP wait policy (OMP_WAIT_POLICY, SUN_MP_THR_IDLE=spin)MPI task placement (I_MPI_PIN_DOMAIN=omp)

Benchmarks

EPCC suite (Pure OpenMP/Fully Hybrid) [Bull et al. 01]

Microbenchmarks for mixed-mode OpenMP/MPI [Bull et al. IWOMP’09]

Page 13: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

EPCC: OpenMP Parallel-Region Overhead

13June 15th 2010IWOMP'2010

05

101520253035404550

1 2 4 8

Number of Threads

Exec

utio

n Ti

me

(us)

MPC ICC 11.1 GCC 4.3.0 GCC 4.4.0 SUNCC 5.1

Page 14: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

EPCC: OpenMP Parallel-Region Overhead (cont.)

14June 15th 2010IWOMP'2010

00,5

11,5

22,5

33,5

44,5

5

1 2 4 8

Number of Threads

Exec

utio

n Ti

me

(us)

MPC ICC 11.1 GCC 4.4.0 SUNCC 5.1

Page 15: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

EPCC: OpenMP Parallel-Region Overhead (cont.)

15June 15th 2010IWOMP'2010

0

50

100

150

200

250

300

350

8 16 32 64

Number of Threads

Exec

utio

n Ti

me

(us)

MPC ICC 11.1 GCC 4.4.0 SUNCC 5.1

Page 16: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

Hybrid Funneled Ping-Pong (1KB)

16June 15th 2010IWOMP'2010

1

10

100

1000

2 4 8

Number of OpenMP Threads

Rat

io

MPC IntelMPI MPICH2/GCC 4.4.0MPICH2/ICC 11.1 OPENMPI/GCC 4.4.0 OPENMPI/ICC 11.1

Page 17: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

Hybrid Multiple Ping-Pong (1KB)

17June 15th 2010IWOMP'2010

02468

101214161820

2 4

Number of OpenMP Threads

Rat

io

MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1

Page 18: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

Hybrid Multiple Ping-Pong (1KB) (cont.)

18June 15th 2010IWOMP'2010

0

10

20

30

40

50

60

2 4 8

Number of OpenMP Threads

Rat

io

MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1

Page 19: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

Hybrid Multiple Ping-Pong (1MB)

19June 15th 2010IWOMP'2010

0

0,5

1

1,5

2

2,5

3

3,5

2 4 8

Number of OpenMP Threads

Rat

io

MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1

Page 20: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

Alternating (MPI Tasks Waiting)

20June 15th 2010IWOMP'2010

0123456789

2 4 8 16

Number of OpenMP Threads

Rat

io

MPC Intel MPI MPICH2 1.1/GCC 4.4.0MPICH2 1.1/ICC 11.1 OPENMPI/GCC 4.4.0

Page 21: Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism in MPC · IWOMP'2010 June 15th 2010 1 Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC Patrick Carribault, Marc Pérache

Conclusion

21June 15th 2010IWOMP'2010

Mixing MPI+OpenMP is a promising solution for next-generation computer architectures How to avoid large overhead?

Contributions

Taxonomy of hybrid approaches

MPC: a Framework unifying both programming models

Lower hybrid overhead

Fully compliant MPI 1.3 and OpenMP 2.5 (with patched GCC)

Freely available at http://mpc.sourceforge.net (version 2.0)Contact: [email protected] or [email protected]

Future Work

Optimization of OpenMP runtime (e.g., NUMA barrier)

OpenMP 3.0 (tasks)

Thread/data affinity (thread placement, data locality)

Tests on large applications


Top Related