performance evaluation of open mpi on cray xe/xk … · operated by los alamos national security,...
TRANSCRIPT
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D Slide 1
Performance Evaluation of Open MPI on Cray XE/XK Systems
Samuel K. Gutierrez – LANL
Nathan T. Hjelm – LANL
Manjunath Gorentla Venkata – ORNL
Richard L. Graham – ORNL
Hot Interconnects 2012
Aug 23, 2012
Thursday, August 23, 12
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229
A Collaborative Effort
Slide 2
Thursday, August 23, 12
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229
Outline
1. Open MPI Overview
2. Gemini Overview
3. Protocols Overview
4. Test Environment
5. Results
6. Conclusions/Future Work
Slide 3
Thursday, August 23, 12
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229
First Things First – Open MPI Overview
Open-Source Implementation of the MPI-2 Standard
Developed and Maintained By• Academia• Industry• National Laboratories
Supports a Range of High-Performance Network APIs• Verbs (Infiniband, RoCE, iWarp)• PSM (QLogic/Intel HCAs)• MXM (Mellanox HCAs)• Portals (Cray SeaStar, Infiniband)• uGNI (Cray Gemini, Cray Ares)
Slide 4
Thursday, August 23, 12
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229 Slide 5
Open MPI’s Plugin Architecture – A High-level Overview1
User ApplicationMPI API
Modular Component Architecture (MCA)
Thursday, August 23, 12
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229 Slide 6
Open MPI’s Plugin Architecture – Main Code Sections1
Operating SystemOPAL
ORTEOMPI
Open MPI Layer (OMPI)• MPI API and Support Logic
Open Run-Time Environment (ORTE)• Run-Time System
Open Portability Access Layer (OPAL)• OS-Specific/Utility Code
Thursday, August 23, 12
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229
The Gemini System Interconnect3 – An Overview
Network Used by the Cray XE and XK System Families• Titan, Cielo, Hopper
Successor to the Cray SeaStar* Network Interconnect
3D Torus Network Built of Gemini ASICs
Gemini ASIC• Provides 10 Torus Connections – 2 x (+X, -X, +Z, -Z) – 1 x (+Y, -Y)• Provides 2 NICs and a 48-port Router
Slide 7
Thursday, August 23, 12
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229
OB1 PML High-Level Protocol Overview
Eager Message Protocol• Uses BTL buffered, inline, and in-place send protocols
Remote Get Protocol• 2 Protocol Messages: RGET (ready to send + segment), FIN• Available When Registration Cache is Enabled and BTL Implements Get
RDMA Pipeline Protocol (Put)• 3 Protocol Messages: RNDV + segment, RDMA, FIN• Used When Remote Get protocol is not Available
Remote Get Fallback (New)• Essentially a Rendezvous• Fallback Initiated by the Receiver During Remote Get Protocol if BTL Get Protocol
is not Available
Rendezvous (no RDMA)
Slide 8
Thursday, August 23, 12
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229
uGNI BTL Overview Protocols
• Send— In-place Send for Small Messages Directly Using Small Message Protocol
(SMSG)— Buffered Send Using Get for Larger Eager Messages (Eager Get)
• Get— Uses FMA Or BTE— Available Only if Source And Destination Segments Are 4-Byte aligned and a
Multiple of 4-Bytes in Size• Put
— Uses Fast Memory Access (FMA) or Byte Transport Engine (BTE)— No Alignment Restrictions
Lazy Connection Establishment• Resource Utilization Directly Related to Application Communication Characteristics
Slide 9
Thursday, August 23, 12
BTL Send Called
Size LargerThan SMSG
Limit?Send Data Using SMSGSendwTagNo
Send GET_INIT With SMSGSendwTag
Yes
Send Complete
Progress Remote SMSG
RDMA_ COMPLETE
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229
uGNI BTL Eager Get Protocol Details (Send)
Slide 10
Thursday, August 23, 12
Progress Remote SMSG
Size Smaller
Than FMA Limit
Start RDMA GET
Start FMA GET Progress Local FMA/RDMA CQ
Send RDMA_COMPLETE with SmsgSendwTag
Receive Complete
No
Yes
GET_INIT
Get Complete
OPAL Progress
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229
uGNI BTL Eager Get Protocol Details (Receive)
Slide 11
Thursday, August 23, 12
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229
Vader BTL Overview
MPICH Nemesis-like Design• Lock-Free Message Queues• “Fast Boxes” – I.e. Per-Peer Receive Queues for Short Messages
Copy Backend Changes Based on Message Size• E.g. bcopy [a,b) - memcpy Otherwise • User Tunable with Good Defaults
Cross-Process Memory Mapping Allows for RDMA-Like Semantics• Copy-In/Copy-Out (CICO) Avoided• No Backing Store Required• Heavy Use of Registration Cache to Amortize Attach Latency• Exposes Both Put and Get Interfaces to PML Layer
XPMEM Support Requires Kernel Patch and User-Level Library• Already Available and Leveraged by Cray’s Native MPI Implementation
Slide 12
Thursday, August 23, 12
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229
Test Environment
Testing Platforms• Cielito - 1088 Core XE6• Cielo - 142,304 Core XE6
Microbenchmarks• NetPIPE – Measure Lat/BW Benchmark• AMG2006 – Algebraic Multi-grid Solver• LAMMPS – Classical Molecular Dynamics Code• All Microbenchmarks Were Run on Live Production System
Launcher• orterun
Slide 13
Thursday, August 23, 12
1
4
16
64
256
1024
4096
1 32 1024 32768 1.04858e+06 3.35544e+07
Late
ncy
(use
c)
Message Size (Bytes)
Netpipe Latency
Vendor MPIOpen MPI
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229 Slide 14
NetPIPE Latency on XE6 (on ASIC)
Thursday, August 23, 12
0
1000
2000
3000
4000
5000
6000
1 32 1024 32768 1.04858e+06 3.35544e+07
Uni-d
irect
iona
l Ban
dwid
th (M
B/se
c)
Message Size (Bytes)
Netpipe Bandwidth
Vendor MPIOpen MPI
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229
NetPIPE Bandwidth on XE6 (on ASIC)
Slide 15
Thursday, August 23, 12
82
84
86
88
90
92
94
96
98
1 4 16 64 256 1024 4096 16384 65536
Loop
Tim
e (1
00 S
teps
, Sec
onds
)
Cores
LAMMPS ASC Benchmark for Scaled-size Lennard-Jones Liquid
Vendor MPIOpen MPI
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229 Slide 16
Microbenchmark – LAMMPS
Thursday, August 23, 12
0
5e+09
1e+10
1.5e+10
2e+10
2.5e+10
3e+10
3.5e+10
4e+10
4.5e+10
0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000
Syst
em S
ize *
Itera
tions
/ So
lve P
hase
Tim
e
Cores
AMG2006 ASC Benchmark (3D 7-Point Laplace Problem on a Cube)
Vendor MPIOpen MPI
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229 Slide 17
Microbenchmark – AMG2006
Thursday, August 23, 12
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229
Conclusion and Ongoing/Future Work
Conclusion• Bandwidth, Latency, and Scalability Similar to Vendor MPI Implementation
Stabilization/Optimization• Improve Launch Scalability (Over a Minute to Launch 131072 MPI Tasks)• Investigating New Protocols (Shared Message Queue-- MSGQ)• Reduce Memory Requirements
Improved Collective Performance Using uGNI Atomics
Work with Friendly Testers
Prepare for General Release in Open MPI 1.7.0
Slide 18
Thursday, August 23, 12
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229
Thanks!
Slide 19
Thursday, August 23, 12
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229
Questions?
Questions?
Comments?
Slide 20
Thursday, August 23, 12
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA
U N C L A S S I F I E D – LA-UR-12-24229
References
[1] Open MPI. 13 Feb. 2012 <open-mpi.org>.
[2] R. Alverson, et al., “The Gemini System Interconnect,” in High Performance Interconnects (HOTI), 2010 IEEE 18th Annual Symposium on, Aug. 2010, pp. 83 –87.
Slide 21
Thursday, August 23, 12