performance evaluation of open mpi on cray xe/xk … · operated by los alamos national security,...

21
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Slide 1 Performance Evaluation of Open MPI on Cray XE/XK Systems Samuel K. Gutierrez – LANL Nathan T. Hjelm – LANL Manjunath Gorentla Venkata – ORNL Richard L. Graham – ORNL Hot Interconnects 2012 Aug 23, 2012 Thursday, August 23, 12

Upload: ngominh

Post on 29-Jun-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D Slide 1

Performance Evaluation of Open MPI on Cray XE/XK Systems

Samuel K. Gutierrez – LANL

Nathan T. Hjelm – LANL

Manjunath Gorentla Venkata – ORNL

Richard L. Graham – ORNL

Hot Interconnects 2012

Aug 23, 2012

Thursday, August 23, 12

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229

A Collaborative Effort

Slide 2

Thursday, August 23, 12

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229

Outline

1. Open MPI Overview

2. Gemini Overview

3. Protocols Overview

4. Test Environment

5. Results

6. Conclusions/Future Work

Slide 3

Thursday, August 23, 12

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229

First Things First – Open MPI Overview

Open-Source Implementation of the MPI-2 Standard

Developed and Maintained By• Academia• Industry• National Laboratories

Supports a Range of High-Performance Network APIs• Verbs (Infiniband, RoCE, iWarp)• PSM (QLogic/Intel HCAs)• MXM (Mellanox HCAs)• Portals (Cray SeaStar, Infiniband)• uGNI (Cray Gemini, Cray Ares)

Slide 4

Thursday, August 23, 12

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229 Slide 5

Open MPI’s Plugin Architecture – A High-level Overview1

User ApplicationMPI API

Modular Component Architecture (MCA)

Thursday, August 23, 12

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229 Slide 6

Open MPI’s Plugin Architecture – Main Code Sections1

Operating SystemOPAL

ORTEOMPI

Open MPI Layer (OMPI)• MPI API and Support Logic

Open Run-Time Environment (ORTE)• Run-Time System

Open Portability Access Layer (OPAL)• OS-Specific/Utility Code

Thursday, August 23, 12

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229

The Gemini System Interconnect3 – An Overview

Network Used by the Cray XE and XK System Families• Titan, Cielo, Hopper

Successor to the Cray SeaStar* Network Interconnect

3D Torus Network Built of Gemini ASICs

Gemini ASIC• Provides 10 Torus Connections – 2 x (+X, -X, +Z, -Z) – 1 x (+Y, -Y)• Provides 2 NICs and a 48-port Router

Slide 7

Thursday, August 23, 12

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229

OB1 PML High-Level Protocol Overview

Eager Message Protocol• Uses BTL buffered, inline, and in-place send protocols

Remote Get Protocol• 2 Protocol Messages: RGET (ready to send + segment), FIN• Available When Registration Cache is Enabled and BTL Implements Get

RDMA Pipeline Protocol (Put)• 3 Protocol Messages: RNDV + segment, RDMA, FIN• Used When Remote Get protocol is not Available

Remote Get Fallback (New)• Essentially a Rendezvous• Fallback Initiated by the Receiver During Remote Get Protocol if BTL Get Protocol

is not Available

Rendezvous (no RDMA)

Slide 8

Thursday, August 23, 12

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229

uGNI BTL Overview Protocols

• Send— In-place Send for Small Messages Directly Using Small Message Protocol

(SMSG)— Buffered Send Using Get for Larger Eager Messages (Eager Get)

• Get— Uses FMA Or BTE— Available Only if Source And Destination Segments Are 4-Byte aligned and a

Multiple of 4-Bytes in Size• Put

— Uses Fast Memory Access (FMA) or Byte Transport Engine (BTE)— No Alignment Restrictions

Lazy Connection Establishment• Resource Utilization Directly Related to Application Communication Characteristics

Slide 9

Thursday, August 23, 12

BTL Send Called

Size LargerThan SMSG

Limit?Send Data Using SMSGSendwTagNo

Send GET_INIT With SMSGSendwTag

Yes

Send Complete

Progress Remote SMSG

RDMA_ COMPLETE

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229

uGNI BTL Eager Get Protocol Details (Send)

Slide 10

Thursday, August 23, 12

Progress Remote SMSG

Size Smaller

Than FMA Limit

Start RDMA GET

Start FMA GET Progress Local FMA/RDMA CQ

Send RDMA_COMPLETE with SmsgSendwTag

Receive Complete

No

Yes

GET_INIT

Get Complete

OPAL Progress

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229

uGNI BTL Eager Get Protocol Details (Receive)

Slide 11

Thursday, August 23, 12

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229

Vader BTL Overview

MPICH Nemesis-like Design• Lock-Free Message Queues• “Fast Boxes” – I.e. Per-Peer Receive Queues for Short Messages

Copy Backend Changes Based on Message Size• E.g. bcopy [a,b) - memcpy Otherwise • User Tunable with Good Defaults

Cross-Process Memory Mapping Allows for RDMA-Like Semantics• Copy-In/Copy-Out (CICO) Avoided• No Backing Store Required• Heavy Use of Registration Cache to Amortize Attach Latency• Exposes Both Put and Get Interfaces to PML Layer

XPMEM Support Requires Kernel Patch and User-Level Library• Already Available and Leveraged by Cray’s Native MPI Implementation

Slide 12

Thursday, August 23, 12

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229

Test Environment

Testing Platforms• Cielito - 1088 Core XE6• Cielo - 142,304 Core XE6

Microbenchmarks• NetPIPE – Measure Lat/BW Benchmark• AMG2006 – Algebraic Multi-grid Solver• LAMMPS – Classical Molecular Dynamics Code• All Microbenchmarks Were Run on Live Production System

Launcher• orterun

Slide 13

Thursday, August 23, 12

1

4

16

64

256

1024

4096

1 32 1024 32768 1.04858e+06 3.35544e+07

Late

ncy

(use

c)

Message Size (Bytes)

Netpipe Latency

Vendor MPIOpen MPI

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229 Slide 14

NetPIPE Latency on XE6 (on ASIC)

Thursday, August 23, 12

0

1000

2000

3000

4000

5000

6000

1 32 1024 32768 1.04858e+06 3.35544e+07

Uni-d

irect

iona

l Ban

dwid

th (M

B/se

c)

Message Size (Bytes)

Netpipe Bandwidth

Vendor MPIOpen MPI

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229

NetPIPE Bandwidth on XE6 (on ASIC)

Slide 15

Thursday, August 23, 12

82

84

86

88

90

92

94

96

98

1 4 16 64 256 1024 4096 16384 65536

Loop

Tim

e (1

00 S

teps

, Sec

onds

)

Cores

LAMMPS ASC Benchmark for Scaled-size Lennard-Jones Liquid

Vendor MPIOpen MPI

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229 Slide 16

Microbenchmark – LAMMPS

Thursday, August 23, 12

0

5e+09

1e+10

1.5e+10

2e+10

2.5e+10

3e+10

3.5e+10

4e+10

4.5e+10

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

Syst

em S

ize *

Itera

tions

/ So

lve P

hase

Tim

e

Cores

AMG2006 ASC Benchmark (3D 7-Point Laplace Problem on a Cube)

Vendor MPIOpen MPI

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229 Slide 17

Microbenchmark – AMG2006

Thursday, August 23, 12

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229

Conclusion and Ongoing/Future Work

Conclusion• Bandwidth, Latency, and Scalability Similar to Vendor MPI Implementation

Stabilization/Optimization• Improve Launch Scalability (Over a Minute to Launch 131072 MPI Tasks)• Investigating New Protocols (Shared Message Queue-- MSGQ)• Reduce Memory Requirements

Improved Collective Performance Using uGNI Atomics

Work with Friendly Testers

Prepare for General Release in Open MPI 1.7.0

Slide 18

Thursday, August 23, 12

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229

Thanks!

Slide 19

Thursday, August 23, 12

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229

Questions?

Questions?

Comments?

Slide 20

Thursday, August 23, 12

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA

U N C L A S S I F I E D – LA-UR-12-24229

References

[1] Open MPI. 13 Feb. 2012 <open-mpi.org>.

[2] R. Alverson, et al., “The Gemini System Interconnect,” in High Performance Interconnects (HOTI), 2010 IEEE 18th Annual Symposium on, Aug. 2010, pp. 83 –87.

Slide 21

Thursday, August 23, 12