overview of the mvapich project: latest status and future...

Overview of the MVAPICH Project:Latest Status and Future Roadmap

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

MVAPICH2 User Group (MUG) Meeting

by

http://www.cse.ohio-state.edu/~panda

Drivers of Modern HPC Cluster Architectures

• Multi-core processors are ubiquitous

• InfiniBand very popular in HPC clusters•

• Accelerators/Coprocessors becoming common in high-end systems

• Pushing the envelope for Exascale computing

Accelerators / Coprocessors high compute density, high performance/watt

>1 TFlop DP on a chip

High Performance Interconnects - InfiniBand<1usec latency, >100Gbps Bandwidth

Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)

2

Multi-core Processors

MVAPICH User Group Meeting 2014

MVAPICH User Group Meeting 2014 3

Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

SHMEM, DSM

Distributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems

– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

• PGAS models and Hybrid MPI+PGAS models are gradually

receiving importance

Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges

Programming ModelsMPI, PGAS (UPC, Global Arrays, OpenSHMEM),

CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.

Application Kernels/Applications

Networking Technologies(InfiniBand, 10/40GigE,

Aries, BlueGene)

Multi/Many-core Architectures

Point-to-pointCommunication(two-sided & one-sided)

CollectiveCommunication

Synchronization & Locks

I/O & FileSystems

FaultTolerance

Communication Library or Runtime for Programming Models

4

Accelerators(NVIDIA and MIC)

Middleware


Co-Design

Opportunities

and Challenges

across Various

Layers

• Scalability for million to billion processors– Support for highly-efficient inter-node and intra-node communication (both two-sided

and one-sided)

– Extremely minimum memory footprint

• Hybrid programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, …)

• Balancing intra-node and inter-node communication for next generation multi-core (128-1024 cores/node)

– Multiple end-points per node

• Support for efficient multi-threading

• Scalable Collective communication– Offload

– Non-blocking

– Topology-aware

– Power-aware

• Support for MPI-3 RMA Model

• Support for GPGPUs and Accelerators

• Fault-tolerance/resiliency

• QoS support for communication and I/O

Designing (MPI+X) at Exascale

5MVAPICH User Group Meeting 2014

• http://mvapich.cse.ohio-state.edu

• High Performance open-source MPI Library for InfiniBand,

10Gig/iWARP, and RDMA over Converged Enhanced Ethernet

(RoCE)

– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since

2002

– MVAPICH2-X (MPI + PGAS), Available since 2012

– Used by more than 2,200 organizations (HPC Centers, Industry and

Universities) in 73 countries

MVAPICH2/MVAPICH2-X Software


D. K. Panda, K. Tomko, K. Schulz and A. Majumdar, The MVAPICH Project: Evolution and Sustainability of an Open Source Production

Quality MPI Library for HPC,, Int'l Workshop on Sustainable Software for Science: Practice and Experiences, held in conjunction with Int'l

Conference on Supercomputing (SC '13), November 2013.

http://mvapich.cse.ohio-state.edu/


MVAPICH Team Projects

Timeline

Jan-

02

Jan-

10

Nov

-12

MVAPICH2-X

OMB

MVAPICH2

MVAPICH

Oct-00

Nov

-04

Aug

-14

EOL

MVAPICH2-GDR


MVAPICH/MVAPICH2 Release Timeline and Downloads

• Download counts from MVAPICH2 website

0

50000

100000

150000

200000

250000

300000

350000

Sep

-04

Feb

-05

Jul-

05

Dec

-05

May

-06

Oct

-06

Mar

-07

Au

g-0

7

Jan

-08

Jun

-08

No

v-0

8

Ap

r-0

9

Sep

-09

Feb

-10

Jul-

10

Dec

-10

May

-11

Oct

-11

Mar

-12

Au

g-1

2

Jan

-13

Jun

-13

No

v-1

3

Ap

r-1

4

Nu

mb

er

of

Do

wn

load

s

Timeline

MV

0.9

.4

MV

2 0

.9.0

MV

2 0

.9.8

MV

21

.0

MV

1.0

MV

21

.0.3

MV

1.1

MV

21

.4

MV

21

.5 MV

21

.6

MV

21

.7 MV

21

.8

MV

21

.9 MV

22

.0

MV

22

.0a

MV

22

.0b

MV

22

.0rc

2

• Available with software stacks of many IB, HSE, and server

vendors including Linux Distros (RedHat and SuSE)

• Empowering many TOP500 clusters

– 7th ranked 519,640-core cluster (Stampede) at TACC

– 13th, 74,358-core (Tsubame 2.5) at Tokyo Institute of Technology

– 23rd, 96,192-core (Pleiades) at NASA and many others

• Partner in the U.S. NSF-TACC Stampede System

MVAPICH2/MVAPICH2-X Software (Cont’d)


• Research is done for exploring new designs

• Designs are first presented to conference/journal publications

• Best performing designs are incorporated into the codebase

• Rigorous Q&A procedure before making a release

– Exhaustive unit testing

– Various test procedures on diverse range of platforms and interconnects

– Performance tuning

– Applications-based evaluation

– Evaluation on large-scale systems

• Even alpha and beta versions go through the above testing

Strong Procedure for Design, Development and Release


MVAPICH2 Architecture (Latest Release 2.0)

Major Computing Platforms: IA-32, EM64T, Ivybridge, Westmere, Sandybridge, Opteron, Magny, ..


All Different PCI interfaces

12

MVAPICH2 2.0 and MVAPICH2-X 2.0


• Released on 06/20/2014

• Major Features and Enhancements

– Based on MPICH-3.1

– Extended support for MPI-3 RMA in OFA-IB-CH3, OFA-IWARP-CH3, and OFA-RoCE-CH3 interfaces

– Multiple optimizations and performance enhancements for RMA in OFA-IB-CH3 channel

– MPI-3 RMA support for CH3-PSM channel

– CMA support is now enabled by default

– Enhanced intra-node SMP performance

– Dynamic CUDA initialization

– Support for running on heterogeneous clusters with GPU and non-GPU nodes

– Multi-rail support for GPU communication and UD-Hybrid channel

– Improved job-startup performance for large-scale mpirun_rsh jobs

– Reduced memory footprint

– Updated compiler wrappers to remove application dependency on network and other extra libraries

– Updated hwloc to version 1.9

• MVAPICH2-X 2.0GA supports hybrid MPI + PGAS (UPC and OpenSHMEM) programming models.

– Based on MVAPICH2 2.0GA including MPI-3 features; Compliant with UPC 2.18.0 and OpenSHMEM v1.0f

– Enhanced optimization of OpenSHMEM collectives and Optimized shmalloc routine

– Optimized UPC collectives and support for GUPC translator


One-way Latency: MPI over IB with MVAPICH2

0.00

1.00

2.00

3.00

4.00

5.00

6.00Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.66

1.56

1.64

1.82

0.991.09

1.12

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switchFDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switchConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Large Message Latency


Late

ncy

(u

s)


Bandwidth: MPI over IB with MVAPICH2

0

2000

4000

6000

8000

10000

12000

14000Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)


3280

3385

19171706

6343

12485

12810

0

5000

10000

15000

20000

25000

30000Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)


33413704

4407

11643

6521

21025

24727

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switchFDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switchConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

0

0.2

0.4

0.6

0.8

1

0 1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

s)

Message Size (Bytes)

Latency

Intra-Socket Inter-Socket

MVAPICH2 Two-Sided Intra-Node Performance(Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))

15

Latest MVAPICH2 2.0

Intel Ivy-bridge

0.18 us

0.45 us

0

2000

4000

6000

8000

10000

12000

14000

16000

Ban

dw

idth

(M

B/s

)


Bandwidth (Inter-socket)

inter-Socket-CMA

inter-Socket-Shmem

inter-Socket-LiMIC

0

2000

4000

6000

8000

10000

12000

14000

16000

Ban

dw

idth

(M

B/s

)


Bandwidth (Intra-socket)

intra-Socket-CMA

intra-Socket-Shmem

intra-Socket-LiMIC

14,250 MB/s 13,749 MB/s


0

0.5

1

1.5

2

2.5

3

3.5

1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Late

ncy

(u

s)


Inter-node Get/Put Latency

Get Put

MPI-3 RMA Get/Put with Flush Performance

16

Latest MVAPICH2 2.0, Intel Sandy-bridge with Connect-IB (single-port)

2.04 us

1.56 us

0

5000

10000

15000

20000

1 4 16 64 256 1K 4K 16K 64K 256K

Ban

dw

idth

(M

byt

es/s

ec)


Intra-socket Get/Put Bandwidth

Get Put

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 4 8 16 32 64 128256512 1K 2K 4K 8K

Late

ncy

(u

s)


Intra-socket Get/Put Latency

Get Put

0.08 us


0

1000

2000

3000

4000

5000

6000

7000

8000

1 4 16 64 256 1K 4K 16K 64K 256K

Ban

dw

idth

(M

byt

es/s

ec)


Inter-node Get/Put Bandwidth

Get Put6881

6876

14926

15364

MVAPICH2-X for Hybrid MPI + PGAS Applications


MPI Applications, OpenSHMEM Applications, UPC

Applications, Hybrid (MPI + PGAS) Applications

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

OpenSHMEM Calls MPI CallsUPC Calls

• Unified communication runtime for MPI, UPC, OpenSHMEM available with

MVAPICH2-X 1.9 onwards!

– http://mvapich.cse.ohio-state.edu

• Feature Highlights

– Supports MPI(+OpenMP), OpenSHMEM, UPC, MPI(+OpenMP) + OpenSHMEM,

MPI(+OpenMP) + UPC

– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard

compliant (with initial support for UPC 1.3)

– Scalable Inter-node and intra-node communication – point-to-point and collectives

http://mvapich.cse.ohio-state.edu/overview/mvapich2x

Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS based

on communication characteristics

• Benefits:

– Best of Distributed Computing Model

– Best of Shared Memory Computing Model

• Exascale Roadmap*:

– “Hybrid Programming is a practical way to

program exascale systems”

* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420

Kernel 1MPI

Kernel 2MPI

Kernel 3MPI

Kernel NMPI

HPC Application

Kernel 2PGAS

Kernel NPGAS



Hybrid MPI+OpenSHMEM Graph500 Design Execution Time

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,

International Supercomputing Conference (ISC’13), June 2013

0

1

2

3

4

5

6

7

8

9

26 27 28 29

Bill

ion

s o

f Tr

ave

rse

d E

dge

s P

er

Seco

nd

(TE

PS)

Scale

MPI-Simple

MPI-CSC

MPI-CSR

Hybrid(MPI+OpenSHMEM)

0

1

2

3

4

5

6

7

8

9

1024 2048 4096 8192

Bill

ion

s o

f Tr

ave

rse

d E

dge

s P

er

Seco

nd

(TE

PS)

# of Processes

MPI-Simple

MPI-CSC

MPI-CSR

Hybrid(MPI+OpenSHMEM)

Strong Scalability

Weak Scalability

J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance

Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012

0

5

10

15

20

25

30

35

4K 8K 16K

Tim

e (

s)

No. of Processes

MPI-Simple

MPI-CSC

MPI-CSR

Hybrid (MPI+OpenSHMEM)13X

7.6X

• Performance of Hybrid (MPI+OpenSHMEM) Graph500 Design

• 8,192 processes- 2.4X improvement over MPI-CSR

- 7.6X improvement over MPI-Simple

• 16,384 processes- 1.5X improvement over MPI-CSR

- 13X improvement over MPI-Simple

20

MVAPICH2 2.1a and MVAPICH2-X 2.1a


• Will be available later this week or next week

• Features

– Based on MPICH 3.1.2

– PMI-2-based startup with SLURM

– Improved job-startup, etc.

21

MVAPICH2-GDR 2.0



• Major Features and Enhancements

– Based on MVAPICH2 2.0 (OFA-IB-CH3 interface)

– High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) using

GPUDirect RDMA and pipelining

– Support for MPI-3 RMA (one Sided) communication and atomics using GPU buffers with GPUDirect RDMA and pipelining

– Efficient small message inter-node communication using the new NVIDIA GDRCOPY module

– Efficient small message inter-node communication using loopback design

– Multi-rail support for inter-node point-to-point GPU communication

– High performance intra-node point-to-point communication for multi-GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU)

using CUDA IPC and pipelining

– Automatic communication channel selection for different GPU communication modes (DD, HH and HD) in different

configurations (intra-IOH and inter-IOH)

– Automatic selection and binding of the best GPU/HCA pair in multi-GPU/HCA system configuration

– Optimized and tuned support for collective communication from GPU buffers

– Enhanced and Efficient support for datatype processing on GPU buffers including support for vector/h-vector, index/h-index,

array and subarray

– Dynamic CUDA initialization. Support GPU device selection after MPI_Init

– Support for non-blocking streams in asynchronous CUDA transfers for better overlap

– Efficient synchronization using CUDA Events for pipelined device data transfers

22

Performance of MVAPICH2 with GPU-Direct-RDMA: LatencyGPU-GPU Internode MPI Latency


MVAPICH2-GDR-2.0Intel Ivy Bridge (E5-2680 v2) node with 20 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCACUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA

71 %

90 %

2.18 usec0

5

10

15

20

25

30

0 2 8 32 128 512 2K

MV2-GDR2.0

MV2-GDR2.0b

MV2 w/o GDR

Small Message Latency


Late

ncy

(u

s)

23

Performance of MVAPICH2 with GPU-Direct-RDMA: BandwidthGPU-GPU Internode MPI Uni-Directional Bandwidth


10x

0

500

1000

1500

2000

2500

3000

1 4 16 64 256 1K 4K

MV2-GDR2.0

MV2-GDR2.0b

MV2 w/o GDR

Small Message Bandwidth


Ban

dw

idth

(M

B/s

)

2.2x



24

Performance of MVAPICH2 with GPU-Direct-RDMA: Bi-BandwidthGPU-GPU Internode MPI Bi-directional Bandwidth


10x

2x

0

500

1000

1500

2000

2500

3000

3500

4000

1 4 16 64 256 1K 4K

MV2-GDR2.0

MV2-GDR2.0b

MV2 w/o GDR

Small Message Bi-Bandwidth


Bi-

Ban

dw

idth

(M

B/s

)



25

Performance of MVAPICH2 with GPU-Direct-RDMA: MPI-3 RMAGPU-GPU Internode MPI Put latency (RMA put operation Device to Device )


2.88 us

83%



MPI-3 RMA provides flexible synchronization and completion primitives

0

5

10

15

20

25

30

35

0 2 8 32 128 512 2K 8K

MV2-GDR2.0

MV2-GDR2.0b

Small Message Latency


Late

ncy

(u

s)


Application-Level Evaluation (HOOMD-blue)

• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)

• MV2-GDR 2.0 (released on 08/23/14) : try it out !!

• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0

MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768

MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1

MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

HOOMD-blue Strong Scaling HOOMD-blue Weak Scaling

MPI Applications on MIC Clusters

Xeon Xeon Phi

Multi-core Centric

Many-core Centric

MPI Program

MPI Program

OffloadedComputation

MPI Program

MPI Program

MPI Program

Host-only

Offload (/reverse Offload)

Symmetric

Coprocessor-only

• MPI (+X) continues to be the predominant programming model in HPC • Flexibility in launching MPI jobs on clusters with Xeon Phi


28

MVAPICH2-MIC Design for Clusters with IB and MIC

• Offload Mode

• Intranode Communication

• Coprocessor-only Mode

• Symmetric Mode

• Internode Communication

• Coprocessors-only

• Symmetric Mode

• Multi-MIC Node Configurations


MIC-Remote-MIC P2P Communication


Bandwidth

Bette

r Bet

ter

Bet

ter

Latency (Large Messages)

0

1000

2000

3000

4000

5000

8K 32K 128K 512K 2M

La

ten

cy (

use

c)


0

2000

4000

6000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

/sec

)


5236

Intra-socket P2P

Inter-socket P2P

0200040006000800010000120001400016000

8K 32K 128K 512K 2M

La

ten

cy (

use

c)


Latency (Large Messages)

0

2000

4000

6000

1 16 256 4K 64K 1M

Ba

nd

wid

th (

MB

/sec

)


Bette

r5594

Bandwidth

30

Latest Status on MVAPICH2-MIC


• Running on three major systems

– Stampede : module swap mvapich2 mvapich2-mic/20130911

– Blueridge(Virginia Tech) : module swap mvapich2 mvapich2-mic/1.9

– Beacon (UTK) : module unload intel-mpi; module load mvapich2-

mic/1.9

• A new version based on MVAPICH2 2.0 is being worked out

• Will be available in a few weeks

31

Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)


A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014

0

5000

10000

15000

20000

25000

1 2 4 8 16 32 64 128 256 512 1K

Late

ncy

(u

secs

)


32-Node-Allgather (16H + 16 M)Small Message Latency

MV2-MIC

MV2-MIC-Opt

0

500

1000

1500

8K 16K 32K 64K 128K 256K 512K 1M

Late

ncy

(u

secs

)x 1

00

0


32-Node-Allgather (8H + 8 M)Large Message Latency

MV2-MIC

MV2-MIC-Opt

0

200

400

600

800

4K 8K 16K 32K 64K 128K 256K 512K

Late

ncy

(u

secs

)x 1

00

0


32-Node-Alltoall (8H + 8 M)Large Message Latency

MV2-MIC

MV2-MIC-Opt

0

10

20

30

40

50

MV2-MIC-Opt MV2-MIC

Exe

cuti

on

Tim

e (

secs

)

32 Nodes (8H + 8M), Size = 2K*2K*1K

P3DFFT Performance

Communication

Computation

76%

58%

55%

• Started in 2004 and continuing steadily

• Allows MPI developers and users to

– Test and evaluate MPI libraries

• Has a wide-range of benchmarks

– Two-sided (MPI-1, MPI-2 and MPI-3)

– One-sided (MPI-2 and MPI-3)

– RMA (MPI-3)

– Collectives (MPI-1, MPI-2 and MPI-3)

– Extensions for GPU-aware communication (CUDA and OpenACC)

– UPC (Pt-to-Pt)

– OpenSHMEM (Pt-to-Pt and Collectives)

• Widely-used in the MPI community

OSU Micro-Benchmarks (OMB)



OMB (OSU Micro-Benchmarks) 4.4


• Include benchmarks for MPI, OpenSHMEM, and UPC Programming Models

• MPI

– Support MPI-1, MPI-2 and MPI-3 standards

– Point-to-Point Benchmarks: Latency, Multi-threaded latency, Multi-pair latency, Multiple bandwidth / Message rate test

bandwidth, Bidirectional bandwidth

– Collective Benchmarks: Collective latency test for MPI_Allgather, MPI_Alltoall, MPI_Allreduce, MPI_Barrier, MPI_Bcast,

MPI_Gather, MPI_Reduce, MPI_Reduce_Scatter, MPI_Scatter, and vector collectives

– One-sided Benchmarks: Put latency, Put bandwidth, Put bidirectional bandwidth, Get latency, Get bandwidth,

Accumulate latency, Atomics latency, and Get_accumulate latency

– CUDA Extensions to evaluate MPI communication performance from/to buffers on NVIDIA GPU devices

– OpenACC Extensions to evaluate MPI communication performance from/to buffers on NVIDIA GPU devices

– Support for MPI-3 RMA operations using GPU buffers

• OpenSHMEM

– Point-to-Point benchmarks: Put latency, Get latency, Put message rate, and Atomics latency

– Collective benchmarks: Collect latency, FCollect latency, Broadcast Latency, Reduce latency, and Barrier latency

• Unified Parallel C (UPC)

– Point-to-Point benchmarks: Put latency, Get latency

– Collective benchmarks: Barrier latency, Exchange latency, Gatherall latency, Gather latency, Reduce latency, and Scatter

latency

Virtualization Support with MVAPICH2 using SR-IOVIntra-node Inter-VM Point-to-Point Latency and Bandwidth

• 1 VM per Core

• MVAPICH2-SR-IOV-IB brings only 3-7% (latency) and 3-

8%(BW) overheads, compared to MAPICH2 over Native

InfiniBand Verbs (MVAPICH2-Native-IB)

0

20

40

60

80

100

120

140

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Late

ncy

(us)

Message Size (byte)

MVAPICH2-SR-IOV-IB

MVAPICH2-Native-IB

0

2000

4000

6000

8000

10000

12000

1 4 16 64 256 1K 4K 16K 64K 256K 1M

Ban

dw

idth

(MB

/s)

Message Size (byte)

MVAPICH2-SR-IOV-IB

MVAPICH2-Native-IB


Performance Evaluations with NAS and Graph500

• 8 VMs across 4 nodes, 1 VM per Socket, 64 cores totally

• MVAPICH2-SR-IOV-IB brings 3-7% and 3-9% overheads for

NAS Benchmarks and Graph500, respectively, compared to

MVAPICH2-Native-IB

0

1

2

3

4

5

6

7

MG-B-64 CG-B-64 EP-B-64 LU-B-64 BT-B-64

Exec

uti

on

Tim

e(s)

NAS Benchmarks

MVAPICH2-SR-IOV-IB

MVAPICH2-Native-IB

0

200

400

600

800

1000

1200

1400

1600

20,10 20,16 22，16 24,16

Exec

uti

on

tim

e(m

s)

Graph500

MVAPICH2-SR-IOV-IB

MVAPICH2-Native-IB


Performance Evaluation with LAMMPS

• 8 VMs across 4 nodes, 1 VM per Socket, 64 cores totally

• MVAPICH2-SR-IOV-IB brings 7% and 9% overheads for LJ

and CHAIN in LAMMPS, respectively, compared to

MVAPICH2-Native-IB

0

5

10

15

20

25

30

35

40

LJ-64-scaled CHAIN-64-scaled

Exec

uti

on

Tim

e(s

)

LAMMPS

MVAPICH2-SR-IOV-IB

MVAPICH2-Native-IB


• Performance and Memory scalability toward 500K-1M cores– Dynamically Connected Transport (DCT) service with Connect-IB

• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)• Enhanced Optimization for GPU Support and Accelerators

• Taking advantage of Collective Offload framework– Including support for non-blocking collectives (MPI 3.0)

– Core-Direct support

• Extended RMA support (as in MPI 3.0)

• Extended topology-aware collectives

• Power-aware collectives

• Extended Support for MPI Tools Interface (as in MPI 3.0)

• Extended Checkpoint-Restart and migration support with SCR

• Low-overhead and high-performance Virtualization support

MVAPICH2 – Plans for Exascale



Web Pointers

NOWLAB Web Page

http://nowlab.cse.ohio-state.edu

MVAPICH Web Page

http://mvapich.cse.ohio-state.edu

38

http://nowlab.cse.ohio-state.edu/

http://mvapich.cse.ohio-state.edu/

overview of the mvapich project: latest status and future...

Documents