high-performance mpi and deep learning on openpower...

High-Performance MPI and Deep Learning on OpenPOWER Platform

Talk at OpenPOWER SUMMIT (Mar ‘18)

by

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Hari Subramoni

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~subramon

http://www.cse.ohio-state.edu/%7Epanda

http://www.cse.ohio-state.edu/%7Esubramon

OpenPOWER Summit (Mar ’18) 2Network Based Computing Laboratory

Big Data (Hadoop, Spark,

HBase, Memcached,

etc.)

Deep Learning(Caffe, TensorFlow, BigDL,

etc.)

HPC (MPI, RDMA, Lustre, etc.)

Increasing Usage of HPC, Big Data and Deep Learning

Convergence of HPC, Big Data, and Deep Learning!

Increasing Need to Run these applications on the Cloud!!


Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory MemoryLogical shared memory

Shared Memory Model

SHMEM, DSMDistributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

• PGAS models and Hybrid MPI+PGAS models are gradually receiving importance


Partitioned Global Address Space (PGAS) Models• Key features

- Simple shared memory abstractions

- Light weight one-sided communication

- Easier to express irregular communication

• Different approaches to PGAS- Languages

• Unified Parallel C (UPC)

• Co-Array Fortran (CAF)

• X10

• Chapel

- Libraries• OpenSHMEM

• UPC++

• Global Arrays


Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics

• Benefits:– Best of Distributed Computing Model

– Best of Shared Memory Computing Model

Kernel 1MPI

Kernel 2MPI

Kernel 3MPI

Kernel NMPI

HPC Application

Kernel 2PGAS

Kernel NPGAS


Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges

Programming ModelsMPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP,

OpenACC, Cilk, Hadoop (MapReduce), Spark (RDD, DAG), etc.

Application Kernels/Applications

Networking Technologies(InfiniBand, 40/100GigE,

Aries, and Omni-Path)

Multi-/Many-coreArchitectures

Accelerators(GPU and FPGA)

MiddlewareCo-Design

Opportunities and

Challenges across Various

Layers

PerformanceScalabilityResilience

Communication Library or Runtime for Programming ModelsPoint-to-point

CommunicationCollective

CommunicationEnergy-

AwarenessSynchronization

and LocksI/O and

File SystemsFault

Tolerance


• Message Passing Interface (MPI) Support

• Support for PGAS and MPI + PGAS (OpenSHMEM, UPC)

• Exploiting Accelerators (NVIDIA GPGPUs)

• Support for Deep Learning

MPI, PGAS and Deep Learning Support for OpenPOWER

Support for Big Data (Hadoop and Spark) on OpenPOWER (Talk at 2:15 pm)


Overview of the MVAPICH2 Project• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 2,875 organizations in 85 countries

– More than 455,000 (> 0.45 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (Nov ‘17 ranking)• 1st, 10,649,600-core (Sunway TaihuLight) at National Supercomputing Center in Wuxi, China

• 9th, 556,104 cores (Oakforest-PACS) in Japan

• 12th, 368,928-core (Stampede2) at TACC

• 17th, 241,108-core (Pleiades) at NASA

• 48th, 76,032-core (Tsubame 2.5) at Tokyo Institute of Technology

– Available with software stacks of many vendors, Linux Distros (RedHat and SuSE), and OpenHPC

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

http://mvapich.cse.ohio-state.edu/


MVAPICH2 Software Family High-Performance Parallel Programming Libraries

MVAPICH2 Support for InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE

MVAPICH2-X Advanced MPI features, OSU INAM, PGAS (OpenSHMEM, UPC, UPC++, and CAF), and MPI+PGAS programming models with unified communication runtime

MVAPICH2-GDR Optimized MPI for clusters with NVIDIA GPUs

MVAPICH2-Virt High-performance and scalable MPI for hypervisor and container based HPC cloud

MVAPICH2-EA Energy aware and High-performance MPI

MVAPICH2-MIC Optimized MPI for clusters with Intel KNC (being updated with KNL-based designs)

Microbenchmarks

OMB Microbenchmarks suite to evaluate MPI and PGAS (OpenSHMEM, UPC, and UPC++) libraries for CPUs and GPUs

Tools

OSU INAM Network monitoring, profiling, and analysis for clusters with MPI and scheduler integration

OEMT Utility to measure the energy consumption of MPI applications


0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000Se

p-04

Feb-

05

Jul-0

5

Dec-

05

May

-06

Oct

-06

Mar

-07

Aug-

07

Jan-

08

Jun-

08

Nov

-08

Apr-

09

Sep-

09

Feb-

10

Jul-1

0

Dec-

10

May

-11

Oct

-11

Mar

-12

Aug-

12

Jan-

13

Jun-

13

Nov

-13

Apr-

14

Sep-

14

Feb-

15

Jul-1

5

Dec-

15

May

-16

Oct

-16

Mar

-17

Aug-

17

Jan-

18

Num

ber o

f Dow

nloa

ds

Timeline

MV

0.9.

4

MV2

0.9

.0

MV2

0.9

.8

MV2

1.0

MV

1.0

MV2

1.0.

3

MV

1.1

MV2

1.4

MV2

1.5

MV2

1.6

MV2

1.7

MV2

1.8

MV2

1.9

MV2

-GDR

2.0

b

MV2

-MIC

2.0

MV2

-GDR

2.3

aM

V2-X

2.3b

MV2

Virt

2.2

MV2

2.3

rc1

OSU

INAM

0.9

.3

MVAPICH2 Release Timeline and Downloads


Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface(MPI)

PGAS(UPC, OpenSHMEM, CAF, UPC++)

Hybrid --- MPI + X(MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication RuntimeDiverse APIs and Mechanisms

Point-to-point

Primitives

Collectives Algorithms

Energy-Awareness

Remote Memory Access

I/O andFile Systems

FaultTolerance

Virtualization Active Messages

Job StartupIntrospection

& Analysis

Support for Modern Networking Technology(InfiniBand, iWARP, RoCE, Omni-Path)

Support for Modern Multi-/Many-core Architectures(Intel-Xeon, OpenPOWER, Xeon-Phi, ARM, NVIDIA GPGPU)

Transport Protocols Modern Features

RC XRC UD DC UMR ODPSR-IOV

Multi Rail

Transport MechanismsShared

MemoryCMA IVSHMEM

Modern Features

MCDRAM* NVLink CAPI*

* Upcoming

XPMEM*


• MVAPICH2 – Basic MPI support

• since MVAPICH2 2.2rc1 (March 2016)

• MVAPICH2-X– PGAS (OpenSHMEM and UPC) and Hybrid MPI+PGAS support

• since MVAPICH2-X 2.2b (March 2016)

– Advanced Collective Support with CMA• Since MVAPICH2-X 2.3b (Oct 2017)

• MVAPICH2-GDR – NVIDIA GPGPU support with NVLink (CORAL like systems)

• Since MVAPICH2-GDR 2.3a (Nov 2017)

OpenPOWER Platform Support in MVAPICH2 Libraries


• Message Passing Interface (MPI) Support– Point-to-point Inter-node and Intra-node

– CMA-based Collectives

– Upcoming XPMEM Support






0

0.5

1

1.5

0 1 2 4 8 16 32 64 128 256 512 1K 2K

Late

ncy

(us)

MVAPICH2-2.3rc1SpectrumMPI-10.1.0.2OpenMPI-3.0.0

Intra-node Point-to-Point Performance on OpenPower

Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA

Intra-Socket Small Message Latency Intra-Socket Large Message Latency

Intra-Socket Bi-directional BandwidthIntra-Socket Bandwidth

0.30us0

20

40

60

80

4K 8K 16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)


0

10000

20000

30000

40000

1 8 64 512 4K 32K 256K 2M

Band

wid

th (M

B/s)


0

20000

40000

60000

80000

1 8 64 512 4K 32K 256K 2M

Band

wid

th (M

B/s)



0

1

2

3

4

0 1 2 4 8 16 32 64 128 256 512 1K 2K

Late

ncy

(us)


Inter-node Point-to-Point Performance on OpenPower

Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA

Small Message Latency Large Message Latency

Bi-directional BandwidthBandwidth

0

50

100

150

200

4K 8K 16K 32K 64K 128K 256K 512K 1M 2M

Late

ncy

(us)


0

5000

10000

15000

1 8 64 512 4K 32K 256K 2M

Band

wid

th (M

B/s)


0

10000

20000

30000

1 8 64 512 4K 32K 256K 2M

Band

wid

th (M

B/s)



0

1

2

0 1 2 4 8 16 32 64 128 256 512

Mul

ti-pa

ir La

tenc

y (u

s)MVAPICH2-2.3rc1SpectrumMPI-10.1.0OpenMPI-3.0.0MVAPICH2-XPMEM

Multi-Pair Latency ComparisonIn

tra-

node

(PPN

=20)

Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch

• Basic point-to-point multi-pair latency comparison– Up to 48% and 45% performance improvement over Spectrum MPI and OpenMPI for intra-/inter-node

0

2

4

6

0 1 2 4 8 16 32 64 128 256 512

Mul

ti-pa

ir La

tenc

y (u

s)

Message Size

MVAPICH2-2.3rc1SpectrumMPI-10.1.0OpenMPI-3.0.0MVAPICH2-XPMEM

Inte

r-no

de

(Nod

es=2

, PPN

=20)

0

100

200

300

400

500

0 2 8 32 128

512 2K 8K 32K

128K

512K 2M

Mul

ti-pa

ir La

tenc

y (u

s)


0

1000

2000

3000

4000

1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M

Mul

ti-pa

ir La

tenc

y (u

s)

Message Size

MV2-2.3rc1

SpectrumMPI-10.1.0

OpenMPI-3.0.0

MV2-XPMEM

48%

45%

40%

32%

45%


0

5000

10000

15000

Band

wid

th (M

B/s)

Message Size

MV2-2.3rc1SpectrumMPI-10.1.0OpenMPI-3.0.0MV2-XPMEM

Multi-pair Bandwidth Comparison


• Basic point-to-point multi-pair bandwidth comparison– Up to 83% and 62% performance improvement over Spectrum MPI and OpenMPI for intra-/inter-node

0

5000

10000

15000

1 2 4 8 16 32 64 128 256 512

Band

wid

th (M

B/s)

Message Size


Inte

r-no

de

(Nod

es=2

, PPN

=20)

0

5000

10000

15000

20000

1 2 4 8 16 32 64 128 256 512

Band

wid

th (M

B/s)


Intr

a-no

de (P

PN=2

0)

0

150,000

300,000

450,000

600,000

1K 2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M

Band

wid

th (M

B/s)


1.8X1.6X

1.4X1.2X

1.5X1.2X


Process Mapping Support in MVAPICH2 for Multi-Core Platforms

Process-Mapping support in MVAPICH2

(available since v1.4)

bunch(Default)

scatter

core(Default)

socket numanode

Preset Binding Policies User-defined binding

MPI rank-to-core binding

• MVAPICH2 detects processor architecture at job-launch

Policy

Granularity

hybrid


• A new process binding policy – “hybrid”– MV2_CPU_BINDING_POLICY = hybrid

• A new environment variable for co-locating Threads with MPI Processes– MV2_THREADS_PER_PROCESS = k

– Automatically set to OMP_NUM_THREADS if OpenMP is being used

– Provides a hint to the MPI runtime to spare resources for application threads.

• New variable for threads bindings with respect to parent process and architecture– MV2_THREADS_BINDING_POLICY = {linear | compact | spread}

• Linear – binds MPI ranks and OpenMP threads sequentially (one after the other)

– Recommended to be used on non-hyperthreaded systems with MPI+OpenMP

• Compact – binds MPI rank to physical-core and locates respective OpenMP threads on hardware threads

– Recommended to be used on multi-/many-cores e.g., KNL, POWER8, and hyper-threaded Xeon

Process and Thread binding policies in Hybrid MPI+Threads


User-Defined Process Mapping• User has complete-control over process-mapping

• To run 4 processes on cores 0, 1, 4, 5:– $ mpirun_rsh -np 4 -hostfile hosts MV2_CPU_MAPPING=0:1:4:5 ./a.out

• Use ‘,’ or ‘-’ to bind to a set of cores:– $mpirun_rsh -np 64 -hostfile hosts MV2_CPU_MAPPING=0,2-4:1:5:6 ./a.out

• Is process binding working as expected?– MV2_SHOW_CPU_BINDING=1

• Display CPU binding information

• Launcher independent

• Example

– MV2_SHOW_CPU_BINDING=1 MV2_CPU_BINDING_POLICY=scatter

-------------CPU AFFINITY-------------

RANK:0 CPU_SET: 0

RANK:1 CPU_SET: 8

• Refer to Running with Efficient CPU (Core) Mapping section of MVAPICH2 user guide for more information

• http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.3rc1-userguide.html#x1-600006.5

http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.3a-userguide.html#x1-600006.5


0

10

20

30

40

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)

MVAPICH2-X-2.3bSpectrumMPI-10.1.0.2OpenMPI-3.0.0

Scalable Collectives with CMA (Intra-node Reduce & AlltoAll)(N

odes

=1, P

PN=2

0)

0

50

100

150

200

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)


(Nod

es=1

, PPN

=20)

Up to 5X and 3x performance improvement by MVAPICH2-X-2.3b for small and large messages respectively

3.6X

Alltoall

Reduce

5.2X

3.2X3.3X

0

400

800

1200

1600

2000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)


(Nod

es=1

, PPN

=20)

0

2500

5000

7500

10000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)


(Nod

es=1

, PPN

=20)

3.2X

1.4X

1.3X1.2X


0

1000

2000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)

Message Size (bytes)


0

25

50

75

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)


MVAPICH2-X-2.3b

SpectrumMPI-10.1.0.2

OpenMPI-3.0.0

0

15

30

45

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)



Scalable Collectives with CMA (Intra-node, Gather & Scatter)(N

odes

=1, P

PN=2

0)

0

250

500

750

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)



(Nod

es=1

, PPN

=20)

(Nod

es=1

, PPN

=20)

(Nod

es=1

, PPN

=20)

Up to 24X and 15x performance improvement by MVAPICH2-X-2.3b for medium to large messages respectively

Scatter

Gather24.5X

5.6X

8.3X

15.6X

4.9X

1.5X

6.4X 6.7X


0

10

20

30

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)

MVAPICH2-X-2.3b

OpenMPI-3.0.0

Scalable Collectives with CMA (Multi-node, Reduce & Alltoall)(N

odes

=4, P

PN=2

0)

0

500

1000

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)

MVAPICH2-X-2.3bOpenMPI-3.0.0

(Nod

es=4

, PPN

=20)

Up to 12.4X and 8.5X performance improvement by MVAPICH2-X-2.3b for small and large messages respectively

12.4X

Alltoall

Reduce

1.9X

0

2000

4000

6000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)

MVAPICH2-X-2.3b

OpenMPI-3.0.0

(Nod

es=4

, PPN

=20)

0

50000

100000

150000

8K 16K 32K 64K 128K 256K 512K 1MLa

tenc

y (u

s)

MVAPICH2-X-2.3bOpenMPI-3.0.0

(Nod

es=4

, PPN

=20)

8.5X

1.0X


0

1000

2000

3000

4000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)


MVAPICH2-X-2.3b

OpenMPI-3.0.0

0

20

40

60

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)


MVAPICH2-X-2.3b

OpenMPI-3.0.0

0

20

40

60

80

100

4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(us)


MVAPICH2-X-2.3b

OpenMPI-3.0.0

Scalable Collectives with CMA (Multi-node, Gather & Scatter)(N

odes

=4, P

PN=2

0)

0

1000

2000

3000

4000

5000

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(us)


MVAPICH2-X-2.3b

OpenMPI-3.0.0

(Nod

es=4

, PPN

=20)

(Nod

es=4

, PPN

=20)

(Nod

es=4

, PPN

=20)

Up to 17.8X and 3.9X performance improvement by MVAPICH2-X-2.3b for medium to large messages respectively

Scatter

Gather17.8X 3.9X

1.6X 0.8X


0

1000

2000

3000

256K 512K 1M 2M

Late

ncy

(us)

MVAPICH2-2.3rc1

SpectrumMPI-10.1.0

OpenMPI-3.0.0

MVAPICH2-XPMEM

3X0

100

200

300

16K 32K 64K 128K

Late

ncy

(us)

Message Size


34%

0

1000

2000

3000

4000

256K 512K 1M 2M

Late

ncy

(us)

Message Size


0

100

200

300

16K 32K 64K 128K

Late

ncy

(us)


Optimized MVAPICH2 All-Reduce with XPMEM(N

odes

=1, P

PN=2

0)


• Optimized MPI All-Reduce Design in MVAPICH2– Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node

2X

(Nod

es=2

, PPN

=20)

4X

48%

3.3X

2X

2X


0

100

200

300

400

16K 32K 64K 128K

Late

ncy

(us)

Message Size

MVAPICH2-2.3rc1OpenMPI-3.0.0MVAPICH2-XPMEM

0

1000

2000

3000

4000

256K 512K 1M 2M

Late

ncy

(us)


0

100

200

300

400

16K 32K 64K 128K

Late

ncy

(us)


Optimized MVAPICH2 All-Reduce with XPMEM (N

odes

=3, P

PN=2

0)


• Optimized MPI All-Reduce Design in MVAPICH2– Up to 2X performance improvement over OpenMPI for inter-node. (Spectrum MPI didn’t run for >2 processes)

22%

0

1000

2000

3000

4000

256K 512K 1M 2M

Late

ncy

(us)

Message Size


(Nod

es=4

, PPN

=20)

42% 27%

2X 35%

11%20%

18.8%


0

200

400

16K 32K 64K 128K 256K

Late

ncy

(us)


Optimized MVAPICH2 Reduce with XPMEM(N

odes

=1, P

PN=2

0)

0

200

400

600

800

16K 32K 64K 128K 256K

Late

ncy

(us)

Message Size



(Nod

es=2

, PPN

=20)

• Optimized MPI Reduce Design in MVAPICH2– Up to 3.1X performance improvement over OpenMPI at scale and up to 37% over spectrum MPI on intra-node

0

5000

10000

15000

512K 1M 2M 4M 8M 16M

Late

ncy

(us)

MVAPICH2-2.3rc1

SpectrumMPI-10.1.0

OpenMPI-3.0.0

MVAPICH2-XPMEM

0

10000

20000

30000

40000

512K 1M 2M 4M 8M 16M

Late

ncy

(us)

Message Size

MVAPICH2-2.3rc1

SpectrumMPI-10.1.0

OpenMPI-3.0.0

MVAPICH2-XPMEM

26%

2.8X

27%

1.9X

37%

5.6X

93%

3.1X


0

200

400

600

800

1000

16K 32K 64K 128K 256K

Late

ncy

(us)


Optimized MVAPICH2 Reduce with XPMEM(N

odes

=3, P

PN=2

0)

0

200

400

600

800

1000

1200

16K 32K 64K 128K 256K

Late

ncy

(us)

Message Size

MVAPICH2-2.3rc1

SpectrumMPI-10.1.0

OpenMPI-3.0.0

MVAPICH2-XPMEM


(Nod

es=4

, PPN

=20)

• Optimized MPI Reduce Design in MVAPICH2– Up to 7X performance improvement over OpenMPI at scale and up to 25% over MVAPICH2-2.3rc1

0

10000

20000

30000

40000

50000

512K 1M 2M 4M 8M 16M

Late

ncy

(us)

MVAPICH2-2.3rc1

SpectrumMPI-10.1.0

OpenMPI-3.0.0

MVAPICH2-XPMEM

0

10000

20000

30000

40000

50000

512K 1M 2M 4M 8M 16M

Late

ncy

(us)

Message Size

MVAPICH2-2.3rc1

SpectrumMPI-10.1.0

OpenMPI-3.0.0

MVAPICH2-XPMEM

16%

4X6X

25%

7X

8%

14%

4X


0

20

40

60

10 20 40 60

Exec

utio

n Ti

me

(s)

MPI Processes (proc-per-sock=10, proc-per-node=20)

MVAPICH-2.3rc1MVAPICH2-XPMEM

MiniAMR Performance using Optimized XPMEM-based Collectives

• MiniAMR application execution time comparing MVAPICH2-2.3rc1 and optimized All-Reduce design– Up to 45% improvement over MVAPICH2-2.3rc1 in mesh-refinement time of MiniAMR application for weak-

scaling workload on up to four POWER8 nodes.

45%41%

36%42%

Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=scatter


MVAPICH2-X for Hybrid MPI + PGAS Applications

• Current Model – Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI– Possible deadlock if both runtimes are not progressed

– Consumes more network resource

• Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF– Available with since 2012 (starting with MVAPICH2-X 1.9) – http://mvapich.cse.ohio-state.edu

http://mvapich.cse.ohio-state.edu/overview/mvapich2x


0

0.2

0.4

1 2 4 8 16 32 64 128 256 512

Late

ncy

(us)

Message Size

shmem_get (intra)shmem_put (intra)

OpenSHMEM PUT/GET using MVAPICH2-X on OpenPOWERIn

tra-

node

(PPN

=2)

• OpenSHMEM one-sided PUT/GET benchmark using MVAPICH2-X 2.3b on OpenPOWER cluster– Near native performance benefits are observed on OpenPOWER

Inte

r-no

de (P

PN=1

)

0

10

20

30

40

Late

ncy

(us)

Message Size

shmem_get (intra)shmem_put (intra)

0

1

2

3

1 2 4 8 16 32 64 128 256 512

Late

ncy

(us)

Message Size

shmem_get (inter)shmem_put (inter)

0

50

100

150

Late

ncy

(us)

Message Size

shmem_get (inter)shmem_put (inter)


0

1

2

3

1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(us)

Message Size

upc_getmem (intra)upc_putmem (intra)

UPC PUT/GET benchmarks using MVAPICH2-X on OpenPOWERIn

tra-

node

(PPN

=2)

• UPC one-sided memput/memget benchmarks using MVAPICH2-X 2.3b on OpenPOWER cluster– Near native performance benefits are observed on OpenPOWER

Inte

r-no

de (P

PN=1

)

0

200

400

600

2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M

Late

ncy

(us)

Message Size

upc_getmem (intra)upc_putmem (intra)

0

1

2

3

1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(us)

Message Size

upc_getmem (inter)upc_putmem (inter)

0

100

200

300

400

2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M

Late

ncy

(us)

Message Size

upc_getmem (inter)upc_putmem (inter)


0

0.02

0.04

0.06

1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(us)

Message Size

upcxx_async_get (intra)upcxx_async_put (intra)

UPC++ PUT/GET benchmarks using MVAPICH2-X on OpenPOWERIn

tra-

node

(PPN

=2)

• UPC++ one-sided async_copy_put/get benchmarks using MVAPICH2-X 2.3b on OpenPOWER cluster– Near native performance benefits are observed on OpenPOWER

Inte

r-no

de (P

PN=1

)

0

100

200

2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M

Late

ncy

(us)

Message Size

upcxx_async_get (intra)upcxx_async_put (intra)

0

1

2

1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(us)

Message Size

upcxx_async_get (inter)upcxx_async_put (inter)

0

500

2K 4K 8K 16K

32K

64K

128K

256K

512K 1M 2M 4M

Late

ncy

(us)

Message Size

upcxx_async_get (inter)upcxx_async_put (inter)




• Exploiting Accelerators (NVIDIA GPGPUs)– CUDA-aware MPI– Optimized GPUDirect RDMA (GDR) Support– Optimized MVAPICH2 for OpenPower




At Sender:

At Receiver:MPI_Recv(r_devbuf, size, …);

insideMVAPICH2

• Standard MPI interfaces used for unified data movement

• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)

• Overlaps data movement from GPU with RDMA transfers

High Performance and High Productivity

MPI_Send(s_devbuf, size, …);

GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU


CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases• Support for MPI communication from NVIDIA GPU device memory• High performance RDMA-based inter-node point-to-point communication

(GPU-GPU, GPU-Host and Host-GPU)• High performance intra-node point-to-point communication for multi-GPU

adapters/node (GPU-GPU, GPU-Host and Host-GPU)• Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node

communication for multiple GPU adapters/node• Optimized and tuned collectives for GPU device buffers• MPI datatype support for point-to-point and collective communication from

GPU device buffers• Unified memory


0

2000

4000

6000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)

Message Size (Bytes)

GPU-GPU Inter-node Bi-Bandwidth

MV2-(NO-GDR) MV2-GDR-2.3a

01000200030004000

1 2 4 8 16 32 64 128

256

512 1K 2K 4K

Band

wid

th (M

B/s)


GPU-GPU Inter-node Bandwidth


0

10

20

300 1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)


GPU-GPU Inter-node Latency


MVAPICH2-GDR-2.3aIntel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

NVIDIA Volta V100 GPUMellanox Connect-X4 EDR HCA

CUDA 9.0Mellanox OFED 4.0 with GPU-Direct-RDMA

10x

9x

Optimized MVAPICH2-GDR Design

1.88us11X


• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)• HoomDBlue Version 1.0.5

• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Application-Level Evaluation (HOOMD-blue)

0

500

1000

1500

2000

2500

4 8 16 32

Aver

age

Tim

e St

eps p

er

seco

nd (T

PS)

Number of Processes

MV2 MV2+GDR

0500

100015002000250030003500

4 8 16 32Aver

age

Tim

e St

eps p

er

seco

nd (T

PS)

Number of Processes

64K Particles 256K Particles

2X2X

mailto:[email protected]



Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 96Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

CSCS GPU cluster

Default Callback-based Event-based

00.20.40.60.8

11.2

4 8 16 32

Nor

mal

ized

Exec

utio

n Ti

me

Number of GPUs

Wilkes GPU Cluster

Default Callback-based Event-based

• 2X improvement on 32 GPUs nodes• 30% improvement on 96 GPU nodes (8 GPUs/node)

C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16

On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application

Cosmo model: http://www2.cosmo-model.org/content/tasks/operational/meteoSwiss/


http://www2.cosmo-model.org/content



0

10

20

30

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)


INTRA-NODE LATENCY (SMALL)

INTRA-SOCKET(NVLINK) INTER-SOCKET

MVAPICH2-GDR: Performance on OpenPOWER (NVLink + Pascal)

010203040

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (G

B/se

c)


INTRA-NODE BANDWIDTH


0

5

10

15

1 4 16 64 256 1K 4K 16K

64K

256K 1M 4M

Band

wid

th (G

B/se

c)


INTER-NODE BANDWIDTH

Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P100-SXM GPUs, and EDR InfiniBand Inter-connect

0

200

400

16K 32K 64K 128K256K512K 1M 2M 4M

Late

ncy

(us)


INTRA-NODE LATENCY (LARGE)


202224262830

1 2 4 8 16 32 64 128

256

512 1K 2K 4K 8K

Late

ncy

(us)


INTER-NODE LATENCY (SMALL)

0100200300400500

Late

ncy

(us)


INTER-NODE LATENCY (LARGE)

Intra-node Bandwidth: 33.9 GB/sec (NVLINK)Intra-node Latency: 14.6 us (without GPUDirectRDMA)

Inter-node Latency: 23.8 us (without GPUDirectRDMA) Inter-node Bandwidth: 11.9 GB/sec (EDR)Available in MVAPICH2-GDR 2.3a


• MPI runtime has many parameters• Tuning a set of parameters can help you to extract higher performance• Compiled a list of such contributions through the MVAPICH Website

– http://mvapich.cse.ohio-state.edu/best_practices/

• List of applications (16 contributions so far)– Amber, Cloverleaf, GAPgeofem, HoomDBlue, HPCG, LAMPS, LU, Lulesh, MILC,

Neuro, POP2, SMG2000, SPEC MPI, TERA_TF, UMR, and WRF2• Soliciting additional contributions, send your results to mvapich-help at

cse.ohio-state.edu• We will link these results with credits to you

Compilation of Best Practices

http://mvapich.cse.ohio-state.edu/best_practices/





• Support for Deep Learning– New challenges for MPI Run-time

– Optimizing Large-message Collectives

– Co-designing DL Frameworks



• Deep Learning frameworks are a different game altogether

– Unusually large message sizes (order of megabytes)

– Most communication based on GPU buffers

• Existing State-of-the-art– cuDNN, cuBLAS, NCCL --> scale-up performance

– CUDA-Aware MPI --> scale-out performance• For small and medium message sizes only!

• Proposed: Can we co-design the MPI runtime (MVAPICH2-GDR) and the DL framework (Caffe) to achieve both?

– Efficient Overlap of Computation and Communication

– Efficient Large-Message Communication (Reductions)

– What application co-designs are needed to exploit communication-runtime co-designs?

Deep Learning: New Challenges for MPI Runtimes

Scal

e-up

Per

form

ance

Scale-out Performance

cuDNN

NCCL

gRPC

Hadoop

ProposedCo-Designs

MPIcuBLAS

A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)


0100200300

2 4 8 16 32 64 128

Late

ncy

(ms)

Message Size (MB)

Reduce – 192 GPUsLarge Message Optimized Collectives for Deep Learning

0

100

200

128 160 192

Late

ncy

(ms)

No. of GPUs

Reduce – 64 MB

0100200300

16 32 64

Late

ncy

(ms)

No. of GPUs

Allreduce - 128 MB

0

50

100

2 4 8 16 32 64 128

Late

ncy

(ms)

Message Size (MB)

Bcast – 64 GPUs

0

50

100

16 32 64

Late

ncy

(ms)

No. of GPUs

Bcast 128 MB

• MV2-GDR provides optimized collectives for large message sizes

• Optimized Reduce, Allreduce, and Bcast

• Good scaling with large number of GPUs

• Available since MVAPICH2-GDR 2.2GA

0100200300

2 4 8 16 32 64 128La

tenc

y (m

s)Message Size (MB)

Allreduce – 64 GPUs


0

1000

2000

3000

4000

5000

6000

512K 1M 2M 4M

Late

ncy

(us)


MVAPICH2 BAIDU OPENMPI

0100000200000300000400000500000600000700000800000

8M 16M

32M

64M

128M

256M

512M

Late

ncy

(us)



1

10

100

1000

10000

100000

4 16 64 256 1K 4K 16K

64K

256K

Late

ncy

(us)



• Initial Evaluation shows promising performance gains for MVAPICH2-GDR 2.3a*• 8 GPUs (2 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0

MVAPICH2: Allreduce Comparison with Baidu and OpenMPI

*Available with MVAPICH2-GDR 2.3a and higher

~10X betterMV2 is ~20% better

than Baidu

~3.5X better OpenMPI is ~4X slower than Baidu

Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P100-SXM GPUs, and FDR InfiniBand Inter-connect


0

10000

20000

30000

40000

50000

512K 1M 2M 4M

Late

ncy

(us)



0

1000000

2000000

3000000

4000000

5000000

6000000

8M 16M

32M

64M

128M

256M

512M

Late

ncy

(us)



1

10

100

1000

10000

100000

4 16 64 256 1K 4K 16K

64K

256K

Late

ncy

(us)



• 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0

MVAPICH2: Allreduce Comparison with Baidu and OpenMPI

*Available with MVAPICH2-GDR 2.3a and higher

~30X betterMV2 is ~2X better

than Baidu

~10X better OpenMPI is ~5X slower than Baidu

~4X better

Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P100-SXM GPUs, and FDR InfiniBand Inter-connect


1

10

100

1000

10000

1000004 16 64 256 1K 4K 16K

64K

256K 1M 4M 16M

64M

Late

ncy

(us)


NCCL2 MVAPICH2-GDR

MVAPICH2-GDR vs. NCCL2• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases

• MPI_Bcast (MVAPICH2-GDR) vs. ncclBcast (NCCL2) on 16 K-80 GPUs

*Will be available with upcoming MVAPICH2-GDR 2.3b

1

10

100

1000

10000

100000

4 16 64 256 1K 4K 16K

64K

256K 1M 4M 16M

64M

Late

ncy

(us)


NCCL2 MVAPICH2-GDR

~10X better

~4X better

1 GPU/node 2 GPUs/node

Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 2 K-80 GPUs, and EDR InfiniBand Inter-connect


• Caffe : A flexible and layered Deep Learning framework.

• Benefits and Weaknesses– Multi-GPU Training within a single node

– Performance degradation for GPUs across different sockets

– Limited Scale-out

• OSU-Caffe: MPI-based Parallel Training – Enable Scale-up (within a node) and Scale-out (across

multi-GPU nodes)

– Scale-out on 64 GPUs for training CIFAR-10 network on CIFAR-10 dataset

– Scale-out on 128 GPUs for training GoogLeNet network on ImageNet dataset

OSU-Caffe: Scalable Deep Learning

0

50

100

150

200

250

8 16 32 64 128

Trai

ning

Tim

e (s

econ

ds)

No. of GPUs

GoogLeNet (ImageNet) on 128 GPUs

Caffe OSU-Caffe (1024) OSU-Caffe (2048)

Invalid use case

OSU-Caffe publicly available from

http://hidl.cse.ohio-state.edu/

Support on OPENPOWER will be available soon

http://hidl.cse.ohio-state.edu/


• Next generation HPC systems need to be designed with a holistic view of Big Data and Deep Learning

• OpenPOWER platform is emerging with many novel features

• Presented some of the approaches and results along these directions from the MVAPICH2 and HiDL Projects

• Solutions Enable High-Performance and Scalable HPC and Deep Learning on OpenPOWER Platforms

Concluding Remarks


High-Performance Hadoop and Spark on OpenPOWER Platform

Talk at 2:15 pm


Funding AcknowledgmentsFunding Support by

Equipment Support by


Personnel AcknowledgmentsCurrent Students (Graduate)

– A. Awan (Ph.D.)

– R. Biswas (M.S.)

– M. Bayatpour (Ph.D.)

– S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.)

– S. Guganani (Ph.D.)

Past Students – A. Augustine (M.S.)

– P. Balaji (Ph.D.)

– S. Bhagvat (M.S.)

– A. Bhat (M.S.)

– D. Buntinas (Ph.D.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– R. Rajachandrasekar (Ph.D.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist– K. Hamidouche

– S. Sur

Past Post-Docs– D. Banerjee

– X. Besseron

– H.-W. Jin

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– K. Kulkarni (M.S.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– M. Li (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– J. Hashmi (Ph.D.)

– H. Javed (Ph.D.)

– P. Kousha (Ph.D.)

– D. Shankar (Ph.D.)

– H. Shi (Ph.D.)

– J. Zhang (Ph.D.)

– J. Lin

– M. Luo

– E. Mancini

Current Research Scientists– X. Lu

– H. Subramoni

Past Programmers– D. Bureddy

– J. Perkins

Current Research Specialist– J. Smith

– M. Arnold

– S. Marcarelli

– J. Vienne

– H. Wang

Current Post-doc– A. Ruhela

Current Students (Undergraduate)– N. Sarkauskas (B.S.)


Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

[email protected]

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

http://nowlab.cse.ohio-state.edu/


high-performance mpi and deep learning on openpower...

Documents