www.openfabrics.org resource utilization in large scale infiniband jobs galen m. shipman los alamos...

www.openfabrics.org

Resource Utilization in Large Scale InfiniBand Jobs

Galen M. Shipman

Los Alamos National LabsLAUR-07-2873

2www.openfabrics.org

The Problem

InfiniBand specifies that receive resources are consumed in order regardless of size

Small messages may therefore consume much larger receive buffers

At very large scale, many applications are dominated by small message transfers

Message sizes vary substantially from job to job and even rank to rank


Receive Buffer Efficiency


Implication for SRQ

Flood of small messages may exhaust SRQ resources

Probability of RNR NAK increases Stalls the pipeline

Performance degrades Wasted resource utilization Application may not complete within allotted time

slot (12 + Hours for some jobs)


Why not just tune the buffer size?

There is no “one size fits all” solution! Message size patterns differ based on:

Number of processes in the parallel job Input deck Identity / function in the parallel job

Need to balance optimization between: Performance Memory footprint

Tuning for each application run is not acceptable


What Do Users Want?

Optimal performance is important But predictability at “acceptable” performance is more

important

HPC users want a default/“good enough” solution Parameter tweaking is fine for papers Not for our end users

Parameter explosion OMPI OpenFabrics-related driver parameters: 48 OMPI other parameters: …many…


What Do Others Do?

Portals Contiguous memory region for unexpected messages

(Receiver managed offset semantic) Myrinet GM

Variable size receive buffers can be allocated Sender specifies which size receive buffer to consume

(SIZE & PRIORITY fields) Quadrics Elan

TPORTS manages pools of buffers of various sizes On receipt of an unexpected message a buffer is chosen

from the relevant pool


Bucket-SRQ

Inspired from standard bucket allocation methods

Multiple “buckets” of receive descriptors are created in multiple SRQs Each associated a different size buffer

A small pool of per-peer resources is also allocated


Bucket-SRQ


Performance Implications

Good overall performance Decreased/no RNR NAKS from draining SRQ

• Never trigger “SRQ limit reached” event

Latency penalty for SRQ ~1 usec

Large number of QPs may not be efficient Still investigating impact of high QP count on

performance


Results

Evaluation applications SAGE (DOE/LANL application) Sweep3D (DOE/LANL application) NAS Parallel Benchmarks (benchmark)

Instrumented Open MPI Measured receive buffer efficiency:

Size of receive buffer / size of data received


SAGE: Hydrodynamics

SAGE – SAIC’s Adaptive Grid Eulerian hydrocode

Hydrodynamics code with Adaptive Mesh Refinement (AMR)

Applied to: water shock, energy coupling, hydro instability problems, etc.

Routinely run on 1,000’s of processors.

Scaling characteristic: Weak

Data Decomposition (Default): 1-D (of a 3-D AMR spatial grid)

"Predictive Performance and Scalability Modeling of a Large-Scale Application", D.J. Kerbyson, H.J. Alme, A. Hoisie, F. Petrini, H.J. Wasserman, M. Gittings, in Proc. SC, Denver, 2001 Courtesy: PAL Team - LANL


SAGE

Adaptive Mesh Refinement (AMR) hydro-code

3 repeated phases

Gather data (including processor boundary data) Compute Scatter data (send back results)

3-D spatial grid, partitioned in 1-D

Parallel characteristics Message sizes vary, typically 10 - 100’s Kbytes Distance between neighbors increases with scale

Courtesy: PAL Team - LANL


SAGE: Receive Buffer Usage

256 Processes


SAGE: Receive Buffer Usage

4096 Processes


SAGE: Receive buffer efficiency


SAGE: Performance


Sweep3D

3-D spatial grid, partitioned in 2-D

Pipelined wavefront processing Dependency in ‘sweep’ direction

Parallel Characteristics: logical neighbors in X and Y Small message sizes: 100’s bytes (typical) Number of processors determines pipe-line length (PX + PY)

2-D example:



Sweep3D: Wavefront Algorithm

Characterized by a dependency in cell processing

1 2 3 4 51-D

2-D

3-D

Direction of wavefront can change start from any corner-point

previouslyprocessed

wavefrontedge



Sweep3D Receive Buffer Usage

256 Processes


Sweep3D: Receive Buffer Efficiency


Sweep3d: Performance


NPB Receive Buffer Usage

Class D 256 Processes


NPB Receive Buffer Efficiency

Class D 256 Processes

IS Benchmark Not Available for Class D


NPB Performance Results

NPB Class D 256 Processes


Conclusions

Bucket SRQ provides Good performance at scale “One size fits most” solution

• Eliminates need to custom-tune each run

Minimizes receive buffer memory footprint• No more than 25 MB was allocated for any run

Avoids RNR NAKs in communication patterns we examined


Future Work

Take advantage of ConnectX SRC feature to reduce the number of active QPs

Further examine our protocol at 4K+ processor count on SNL’s ThunderBird cluster

www.openfabrics.org resource utilization in large scale infiniband jobs galen m. shipman los alamos...

Documents