enabling low-overhead hybrid mpi/openmp parallelism in mpc · iwomp'2010 june 15th 2010 1 enabling...

Click here to load reader

Post on 13-Mar-2020

1 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • 1June 15th 2010IWOMP'2010

    Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC

    Patrick Carribault, Marc Pérache and HervéJourdren

    CEA, DAM, DIF, F-91297 Arpajon France

  • Introduction/Context

    2June 15th 2010IWOMP'2010

    HPC Architecture: Petaflop/s EraMulticore processors as basic blocksClusters of ccNUMA nodes

    Parallel programming modelsMPI: distributed-memory modelOpenMP: shared-memory model

    Hybrid MPI/OpenMP (or mixed-mode programming)Promising solution (benefit from both models for data parallelism)How to hybridize an application?

    ContributionsApproaches for hybrid programmingUnified MPI/OpenMP framework (MPC) for lower hybrid overhead

  • Outline

    3June 15th 2010IWOMP'2010

    Introduction/Context

    Hybrid MPI/OpenMP ProgrammingOverview

    Extended taxonomy

    MPC Framework

    OpenMP runtime implementation

    Hybrid optimization

    Experimental Results

    OpenMP performance

    Hybrid performance

    Conclusion & Future Work

  • Hybrid MPI/OpenMP Programming Overview

    4June 15th 2010IWOMP'2010

    MPI (Message Passing Interface)

    Inter-node communication

    Implicit locality

    Useless data duplication and useless shared-memory transfers

    OpenMP

    Fully exploit shared-memory data parallelism

    No inter-node standard

    No data-locality standard (ccNUMA node)

    Hybrid Programming

    Mix MPI and OpenMP inside an application

    Benefit from Pure-MPI and Pure-OpenMP modes

  • Hybrid MPI/OpenMP Programming Approaches

    5June 15th 2010IWOMP'2010

    Traditional ApproachesExploit one core with one execution flowE.g., MPI for inter-node communication, OpenMP otherwiseE.g., multi core CPU Socket-exploitation with OpenMP

    Oversubscribing ApproachesExploit one core with several execution flowsLoad balancing on the whole nodeAdaptive behavior between parallel regions

    Mixing DepthCommunications outside parallel regions

    Network bandwidth saturationCommunications inside parallel regions

    MPI thread-safety

    Extended Taxonomy from [Heager09]

  • Hybrid MPI/OpenMP Extended Taxonomy

    6June 15th 2010IWOMP'2010

  • MPC Framework

    7June 15th 2010IWOMP'2010

    User-level thread library [EuroPar’08]Pthreads API, debugging with GDB [MTAAP’2010]

    Thread-based MPI [EuroPVM/MPI’09]MPI 1.3 CompliantOptimized to save memory

    NUMA-aware memory allocator (for multithreaded applications)

    Contribution: Hybrid representation inside MPCImplementation of OpenMP Runtime (2.5 compliant)Compiler part w/ patched GCC (4.3.X and 4.4.X)Optimizations for hybrid applications

    Efficient oversubscribed OpenMP (more threads than cores)Unified representation of MPI tasks and OpenMP threadsScheduler-integrated polling methodsMessage-buffer privatization and parallel message reception

  • MPC’s Hybrid Execution Model (Fully Hybrid)

    8June 15th 2010IWOMP'2010

    Application with 1 MPI task per node

  • MPC’s Hybrid Execution Model (Fully Hybrid)

    9June 15th 2010IWOMP'2010

    Initialization of OpenMP regions (on the whole node)

  • MPC’s Hybrid Execution Model (Fully Hybrid)

    10June 15th 2010IWOMP'2010

    Entering OpenMP parallel region w/ 6 threads

  • MPC’s Hybrid Execution Model (Simple Mixed)

    11June 15th 2010IWOMP'2010

    2 MPI tasks + OpenMP parallel region w/ 4 threads (on 2 cores)

  • Experimental Environment

    12June 15th 2010IWOMP'2010

    Architecture

    Dual-socket Quad-core Nehalem-EP machine

    24GB of memory/Linux 2.6.31 kernel

    Programming model implementations

    MPI: MPC, IntelMPI 3.2.1, MPICH2 1.1, OpenMPI 1.3.3

    OpenMP: MPC, ICC 11.1, GCC 4.3.0 and 4.4.0, SunCC 5.1

    Best option combination OpenMP thread pinning (KMP_AFFINITY, GOMP_CPU_AFFINITY)OpenMP wait policy (OMP_WAIT_POLICY, SUN_MP_THR_IDLE=spin)MPI task placement (I_MPI_PIN_DOMAIN=omp)

    Benchmarks

    EPCC suite (Pure OpenMP/Fully Hybrid) [Bull et al. 01]

    Microbenchmarks for mixed-mode OpenMP/MPI [Bull et al. IWOMP’09]

  • EPCC: OpenMP Parallel-Region Overhead

    13June 15th 2010IWOMP'2010

    05

    101520253035404550

    1 2 4 8

    Number of Threads

    Exec

    utio

    n Ti

    me

    (us)

    MPC ICC 11.1 GCC 4.3.0 GCC 4.4.0 SUNCC 5.1

  • EPCC: OpenMP Parallel-Region Overhead (cont.)

    14June 15th 2010IWOMP'2010

    00,5

    11,5

    22,5

    33,5

    44,5

    5

    1 2 4 8

    Number of Threads

    Exec

    utio

    n Ti

    me

    (us)

    MPC ICC 11.1 GCC 4.4.0 SUNCC 5.1

  • EPCC: OpenMP Parallel-Region Overhead (cont.)

    15June 15th 2010IWOMP'2010

    0

    50

    100

    150

    200

    250

    300

    350

    8 16 32 64

    Number of Threads

    Exec

    utio

    n Ti

    me

    (us)

    MPC ICC 11.1 GCC 4.4.0 SUNCC 5.1

  • Hybrid Funneled Ping-Pong (1KB)

    16June 15th 2010IWOMP'2010

    1

    10

    100

    1000

    2 4 8

    Number of OpenMP Threads

    Rat

    io

    MPC IntelMPI MPICH2/GCC 4.4.0MPICH2/ICC 11.1 OPENMPI/GCC 4.4.0 OPENMPI/ICC 11.1

  • Hybrid Multiple Ping-Pong (1KB)

    17June 15th 2010IWOMP'2010

    02468

    101214161820

    2 4

    Number of OpenMP Threads

    Rat

    io

    MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1

  • Hybrid Multiple Ping-Pong (1KB) (cont.)

    18June 15th 2010IWOMP'2010

    0

    10

    20

    30

    40

    50

    60

    2 4 8

    Number of OpenMP Threads

    Rat

    io

    MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1

  • Hybrid Multiple Ping-Pong (1MB)

    19June 15th 2010IWOMP'2010

    0

    0,5

    1

    1,5

    2

    2,5

    3

    3,5

    2 4 8

    Number of OpenMP Threads

    Rat

    io

    MPC IntelMPI MPICH2/GCC 4.4.0 MPICH2/ICC 11.1

  • Alternating (MPI Tasks Waiting)

    20June 15th 2010IWOMP'2010

    0123456789

    2 4 8 16

    Number of OpenMP Threads

    Rat

    io

    MPC Intel MPI MPICH2 1.1/GCC 4.4.0MPICH2 1.1/ICC 11.1 OPENMPI/GCC 4.4.0

  • Conclusion

    21June 15th 2010IWOMP'2010

    Mixing MPI+OpenMP is a promising solution for next-generation computer architectures How to avoid large overhead?

    Contributions

    Taxonomy of hybrid approaches

    MPC: a Framework unifying both programming models

    Lower hybrid overhead

    Fully compliant MPI 1.3 and OpenMP 2.5 (with patched GCC)

    Freely available at http://mpc.sourceforge.net (version 2.0)Contact: [email protected] or [email protected]

    Future Work

    Optimization of OpenMP runtime (e.g., NUMA barrier)

    OpenMP 3.0 (tasks)

    Thread/data affinity (thread placement, data locality)

    Tests on large applications

    http://mpc.sourceforge.net/mailto:[email protected]:[email protected]

    Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPCIntroduction/ContextOutlineHybrid MPI/OpenMP Programming OverviewHybrid MPI/OpenMP Programming ApproachesHybrid MPI/OpenMP Extended TaxonomyMPC FrameworkMPC’s Hybrid Execution Model (Fully Hybrid)MPC’s Hybrid Execution Model (Fully Hybrid)MPC’s Hybrid Execution Model (Fully Hybrid)MPC’s Hybrid Execution Model (Simple Mixed)Experimental EnvironmentEPCC: OpenMP Parallel-Region OverheadEPCC: OpenMP Parallel-Region Overhead (cont.)EPCC: OpenMP Parallel-Region Overhead (cont.)Hybrid Funneled Ping-Pong (1KB)Hybrid Multiple Ping-Pong (1KB)Hybrid Multiple Ping-Pong (1KB) (cont.)Hybrid Multiple Ping-Pong (1MB)Alternating (MPI Tasks Waiting)Conclusion