www.openfabrics.org resource utilization in large scale infiniband jobs galen m. shipman los alamos...

27
www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

Upload: hester-francis

Post on 05-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

www.openfabrics.org

Resource Utilization in Large Scale InfiniBand Jobs

Galen M. Shipman

Los Alamos National LabsLAUR-07-2873

Page 2: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

2www.openfabrics.org

The Problem

InfiniBand specifies that receive resources are consumed in order regardless of size

Small messages may therefore consume much larger receive buffers

At very large scale, many applications are dominated by small message transfers

Message sizes vary substantially from job to job and even rank to rank

Page 3: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

3www.openfabrics.org

Receive Buffer Efficiency

Page 4: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

4www.openfabrics.org

Implication for SRQ

Flood of small messages may exhaust SRQ resources

Probability of RNR NAK increases Stalls the pipeline

Performance degrades Wasted resource utilization Application may not complete within allotted time

slot (12 + Hours for some jobs)

Page 5: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

5www.openfabrics.org

Why not just tune the buffer size?

There is no “one size fits all” solution! Message size patterns differ based on:

Number of processes in the parallel job Input deck Identity / function in the parallel job

Need to balance optimization between: Performance Memory footprint

Tuning for each application run is not acceptable

Page 6: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

6www.openfabrics.org

What Do Users Want?

Optimal performance is important But predictability at “acceptable” performance is more

important

HPC users want a default/“good enough” solution Parameter tweaking is fine for papers Not for our end users

Parameter explosion OMPI OpenFabrics-related driver parameters: 48 OMPI other parameters: …many…

Page 7: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

7www.openfabrics.org

What Do Others Do?

Portals Contiguous memory region for unexpected messages

(Receiver managed offset semantic) Myrinet GM

Variable size receive buffers can be allocated Sender specifies which size receive buffer to consume

(SIZE & PRIORITY fields) Quadrics Elan

TPORTS manages pools of buffers of various sizes On receipt of an unexpected message a buffer is chosen

from the relevant pool

Page 8: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

8www.openfabrics.org

Bucket-SRQ

Inspired from standard bucket allocation methods

Multiple “buckets” of receive descriptors are created in multiple SRQs Each associated a different size buffer

A small pool of per-peer resources is also allocated

Page 9: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

9www.openfabrics.org

Bucket-SRQ

Page 10: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

10www.openfabrics.org

Performance Implications

Good overall performance Decreased/no RNR NAKS from draining SRQ

• Never trigger “SRQ limit reached” event

Latency penalty for SRQ ~1 usec

Large number of QPs may not be efficient Still investigating impact of high QP count on

performance

Page 11: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

11www.openfabrics.org

Results

Evaluation applications SAGE (DOE/LANL application) Sweep3D (DOE/LANL application) NAS Parallel Benchmarks (benchmark)

Instrumented Open MPI Measured receive buffer efficiency:

Size of receive buffer / size of data received

Page 12: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

12www.openfabrics.org

SAGE: Hydrodynamics

SAGE – SAIC’s Adaptive Grid Eulerian hydrocode

Hydrodynamics code with Adaptive Mesh Refinement (AMR)

Applied to: water shock, energy coupling, hydro instability problems, etc.

Routinely run on 1,000’s of processors.

Scaling characteristic: Weak

Data Decomposition (Default): 1-D (of a 3-D AMR spatial grid)

"Predictive Performance and Scalability Modeling of a Large-Scale Application", D.J. Kerbyson, H.J. Alme, A. Hoisie, F. Petrini, H.J. Wasserman, M. Gittings, in Proc. SC, Denver, 2001 Courtesy: PAL Team - LANL

Page 13: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

13www.openfabrics.org

SAGE

Adaptive Mesh Refinement (AMR) hydro-code

3 repeated phases

Gather data (including processor boundary data) Compute Scatter data (send back results)

3-D spatial grid, partitioned in 1-D

Parallel characteristics Message sizes vary, typically 10 - 100’s Kbytes Distance between neighbors increases with scale

Courtesy: PAL Team - LANL

Page 14: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

14www.openfabrics.org

SAGE: Receive Buffer Usage

256 Processes

Page 15: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

15www.openfabrics.org

SAGE: Receive Buffer Usage

4096 Processes

Page 16: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

16www.openfabrics.org

SAGE: Receive buffer efficiency

Page 17: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

17www.openfabrics.org

SAGE: Performance

Page 18: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

18www.openfabrics.org

Sweep3D

3-D spatial grid, partitioned in 2-D

Pipelined wavefront processing Dependency in ‘sweep’ direction

Parallel Characteristics: logical neighbors in X and Y Small message sizes: 100’s bytes (typical) Number of processors determines pipe-line length (PX + PY)

2-D example:

Courtesy: PAL Team - LANL

Page 19: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

19www.openfabrics.org

Sweep3D: Wavefront Algorithm

Characterized by a dependency in cell processing

1 2 3 4 51-D

2-D

3-D

Direction of wavefront can change start from any corner-point

previouslyprocessed

wavefrontedge

Courtesy: PAL Team - LANL

Page 20: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

20www.openfabrics.org

Sweep3D Receive Buffer Usage

256 Processes

Page 21: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

21www.openfabrics.org

Sweep3D: Receive Buffer Efficiency

Page 22: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

22www.openfabrics.org

Sweep3d: Performance

Page 23: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

23www.openfabrics.org

NPB Receive Buffer Usage

Class D 256 Processes

Page 24: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

24www.openfabrics.org

NPB Receive Buffer Efficiency

Class D 256 Processes

IS Benchmark Not Available for Class D

Page 25: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

25www.openfabrics.org

NPB Performance Results

NPB Class D 256 Processes

Page 26: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

26www.openfabrics.org

Conclusions

Bucket SRQ provides Good performance at scale “One size fits most” solution

• Eliminates need to custom-tune each run

Minimizes receive buffer memory footprint• No more than 25 MB was allocated for any run

Avoids RNR NAKs in communication patterns we examined

Page 27: Www.openfabrics.org Resource Utilization in Large Scale InfiniBand Jobs Galen M. Shipman Los Alamos National Labs LAUR-07-2873

27www.openfabrics.org

Future Work

Take advantage of ConnectX SRC feature to reduce the number of active QPs

Further examine our protocol at 4K+ processor count on SNL’s ThunderBird cluster