lanl hpc acceptance testing with gpu emphasis …...experiences testing supercomputing clusters...

Operated by Los Alamos National Security, LLC for NNSA

LANL HPC Acceptance tes/ng w/GPU emphasis

Craig Idler, LANL Phil Romero, LANL Laura Monroe, LANL

UNCLASSIFIED / LA-‐UR-‐13-‐20717


Overview •  Machine summary

•  Briefly discuss overall acceptance test strategy and principles

•  Reliability and Performance expecta/ons

•  Tools and test types used during acceptance •  Experiences

•  Moonlight specifics

2


Luna & Moonlight

Tri-‐Laboratory Capacity Clusters

Luna

similar to Moonlight, but no GPU’s


Machine Summary Name, CPU arch, OS, Number of Nodes, Peak Tflops/s

•  Luna, Intel Xeon Sandybridge, Linux, 1540 , 513 •  Cielo, AMD Magny-‐Cours, SLES/CLE, 8894, 1370 •  Roadrunner, AMD Opteron + cell BE, Fedora 9, 3060, 1380 •  Typhoon, AMD Magny-‐Cours, Linux, 416, 106 •  Cielito, AMD Magny-‐Cours, SLES/CLE, 68, 10.4 •  Cerrillos, AMD Opteron + cell BE, 360, 152 •  Conejo, Intel Xeon, Linux, 620, 52.8 •  Lobo, AMD Opteron, Linux, 272, 38.3 •  Mapache, Intel Xeon, Linux, 592, 50.4 •  Moonlight, Intel Xeon + Nvidia Tesla M2090 (Cores: 4,928 CPU + 315,392 GPU) , Linux, 308, 488 (dedicated

PCIe x16 link to each M2090 GPU ) •  Mustang, AMD Opteron, Linux, 1600, 353 •  Pinto, Intel Xeon, Linux, 154, 51.3

4


Moonlight Socware

•  Common Compu/ng Environment (CCE): –  Tri-‐Lab Opera/ng System Stack (TOSS 2.0)

•  Red Hat Enterprise Linux 6 + enhancements for HPC

•  SLURM

•  Open MPI, Mvapich, OFED InfiniBand socware

•  Lustre and/or Panasas clients •  System administra/on and management tools

–  3rd party licensed socware

•  Intel, PGI, and PathScale compiler suites

•  Moab scheduler

•  TotalView debugger –  Open source & CCE tools

•  Evolu/on of tools and capabili/es on exis/ng TLCC clusters

5


Acceptance Tes/ng Principles •  Focus on func/onality and performance tes/ng before shipment •  Added focus on performance and stability tes/ng acer shipment •  Test indirectly verify user view of system func/ons (DRM, libraries or modules, file

system(s) properly mounted, etc.) •  Tests target network, CPU, memory, and MPI socware based components. •  Limited file system tes/ng at this point since real produc/on infrastructure not

available at this /me. –  Albeit, enough file system to be useful.

•  Some power characteriza/on now becoming part of tes/ng/evalua/on process •  Start with single node tests and move to larger size applica/ons once sa/sfied with

test results.

6


Acceptance Tes/ng Process •  Define and create a set of tests for this architecture on small “like” testbed •  Build/Provision the cluster at factory to “produc/on” state, although

limited File System •  Boot and run tests at factory

–  Full coverage across all nodes –  High u/liza/on (> 92%) over 3+ days –  Verify MTBF as stated in RFP (< .1% node failures/day)

•  Tear down, ship to site, rebuild and re-‐provision en/re cluster. •  Boot and run post-‐ship tests in our produc/on environment.

–  Much like pre-‐ship tests, but with –  Longer running /me and some added apps

7


Acceptance Tes/ng Tools •  Gazebo – Primary test framework (DOE CCE funded project)

–  Submits tests (jobs) under Moab

–  Keeps system busy to desired level

–  Iden/fies system u/liza/on and node coverage –  Test/job summary tool

–  Basic job failure analy/cs •  Splunk – commercial product (www.splunk.com) used for specific test

performance summaries, job failure analy/cs, fast data queries, etc.

•  GnuPlot – test data visualiza/on

8


Common tes/ng experiences •  Slow network links – ocen cable problems, or faulty interface devices on switches or nodes, and

occasionally a improper link speed configura/on issue. •  Slow nodes – single node tests such as HPL run slower than “norm”. Ocen is a temperature

problem (thermal throoling) with node. Can some/mes be an issue with NUMA island mapping. •  Slow apps – jobs running slower than expected. If not one of two prior problems, then network

conges/on or even an inadequate file system are suspect. •  Jobs fail or slow to start – typically seen as some kind of mpi /meout error. Usually a version

incompa/bility issue with new system socware or large scale jobs crea/ng heavy startup load on network infrastructure.

•  GPU’s with inconsistent performance – possibili/es to be discussed. •  Failing jobs – ocen due to infant mortality of nodes during mul/-‐day stability runs. The acceptable

level at which this can occur established with vendor prior to stability runs.

9


Gazebo Job Summary Output Example Luna Gazebo Results for 2011-11-16!

*** Job Summary ***!From: 2011-11-15 16:50:00 through 2011-11-16 06:59:59 !

cluster lu:!Test Total Passed Failed Unknown Avg. Run Time Node Hours!================= ====== ====== ====== ====== ============= ==========!COMOPS.16x256() 115 115 0 0 727.77 371.971!HPCC.2x32() 88 88 0 0 1470.86 71.9087!HPCC.32x512() 44 44 0 0 1522.95 595.643!MATMULT.1x16() 220 220 0 0 1185.59 72.4527!NPB.16x256(bt lu sp) 220 220 0 0 534.33 522.456!SPaSM-bench.16x256() 92 90 0 2 1237.27 494.908!SPaSM-bench.64x1024() 176 174 0 2 1409.17 4359.03!STREAM.16x256() 92 92 0 0 93.08 38.0594!STREAM.3x128() 1 1 0 0 92.81 0.0773417!STREAM.8x128() 91 91 0 0 92.96 18.7986!STRIDE.4x64() 90 90 0 0 3345.21 334.521!chkout-imb.8x128() 115 114 1 0 97.23 24.6316!hpl-gnu-mkl.2x32(HPL-32.dat) 230 229 1 0 2400.97 305.457!hpl-gnu-mkl.32x512(HPL-512.dat) 113 112 0 1 1448.80 1442.36!hpl-gnu-mkl.64x1024(HPL-1024.dat) 63 61 1 1 1879.67 2038.4!iperf.3x128() 1 0 0 1 10.08 0!iperf.4x64() 184 182 2 0 34.81 7.03936!

First test started at: 2011-11-15 16:56:53-0700!Job Totals! Passed: 2762! Failed: 9! Unknown: 7 generally means jobs in progress!Last test started at: 2011-11-16 08:01:54-0700!


Gazebo Failed Job Summary Analysis /home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-16/lu/SPaSM-bench/SPaSM-bench__runSPasm__11145__lu.2011-11-16T03:12:07-0700! ###, 2011-11-16, 03:12:07, SPaSM-bench, 256, 1, incomplete, -, Time Limit Exceeded (1 hr limit, 20 min avg.) !

/home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-16/lu/SPaSM-bench/SPaSM-bench__runSPasm__11762__lu.2011-11-16T06:17:04-0700!

###, 2011-11-16, 06:17:04, SPaSM-bench, 256, 1, incomplete, -, Time Limit Exceeded (1 hr limit, 20 min avg.) !

/home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-15/lu/chkout-imb/chkout-imb__runImb__10379__lu.2011-11-15T23:17:33-0700!

###, 2011-11-15, 23:17:33, chkout-imb, 128, 1, failing, -, multiple warnings due to SendRecv rate dropping below 1800MB/sec (1711 - 1791) !

/home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-15/lu/hpl-gnu-mkl/hpl-gnu-mkl__runHpl__9890__lu.2011-11-15T20:47:27-0700! ###, 2011-11-15, 20:47:27, hpl-gnu-mkl, 32, 1, failing, -, HPL below efficiency minimum of 50% ( 82% normal ) !

/home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-16/lu/hpl-gnu-mkl/hpl-gnu-mkl__runHpl__11631__lu.2011-11-16T05:38:28-0700!

###, 2011-11-16, 05:38:28, hpl-gnu-mkl, 512, 1, incomplete, -, Job canceled @ 06:00:04 due to node failure !

/home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-15/lu/hpl-gnu-mkl/hpl-gnu-mkl__runHpl__10396__lu.2011-11-15T23:44:45-0700! ###, 2011-11-15, 23:44:45, hpl-gnu-mkl, 1024, 1, failing, -, HPL below efficiency minimum of 50% ( 79 % normal) !

11


GPGPU Tes/ng •  U/lized Four Different Clusters, 2 smaller test beds and Clusters of 92 and 308

node sizes. •  Used SHOC tests (all level 0), Gpuburn, GPUBandwidth, Nbody, Luxrays, and

HPL versions 13 and 14 modified to ensure proper affini/es. •  Performance in transferring data from GPUs to CPUs varies widely even when

correct affini/es are u/lized. Not true when transferring data from CPUs to GPUs.

•  GPUs much more sensi/ve to memory usage on the host than CPUs, some/mes even producing non-‐linear results on performance.

•  CPUs perform in a /ght performance range closely approxima/ng a normal distribu/on, GPGPUs performances have wider variability and less closely approximate a normal distribu/on with the High Performance Linpack test.


SHOC Low Level Test Results


Finding Performance Anomalies from Shoc Low Level Test Results


Typical GPU to Numa Node Transfer Bandwidth Distribu/ons


HPL Power Usage and CPU/GPU U/liza/ons before and acer Slurm Memory Usage Fix

•  Te"


Moonlight Single Node HPL Performance Distribu/on U/lizing Only CPUs


Moonlight Single Node HPL Performance Distribu/on U/lizing GPUs and CPUs


Moonlight Single Node HPL Performance Distribu/on by Node


3D Moonlight Single Node HPL U/lizing Both GPUs and CPUs


Moonlight Single Node HPL Performance Distribu/on U/lizing GPUs and CPUs Acer Elimina/on of Eight Outlier Nodes


Moonlight Single Node HPL Performance Standard Devia/ons Group Into Two Ranges


Moonlight Single Node HPL v 14 Maximum Power to Each GPU Colored by Performance


Moonlight Single Node HPL v 14 Run Performance Distribu/on By Minimum Power Usage for Both GPUs


Movie of Moonlight Single Node HPL v 14 Performance Distribu/on Changes With Minimum Power Reached on Both GPUs


Replacing IFB Boards Yields Results ….. Fastest HPL so far (used 270 Nodes that were maximum available at the /me…)

=============================================================================== T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR10C2L4 906040 1024 15 36 2208.57 2.245e+05 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0012534 ...... PASSED ================================================================================

Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. --------------------------------------------------------------------------------

End of Tests. This result is enough to place 78th on the June 2012 list in between the following entries. •  77 Los Alamos National Laboratory United States Mustang - Xtreme-X 1320H-LANL, Opteron 12 Core 2.30 GHz, Infiniband QDR / 2011 Appro 37056 230.60 340.92 540.4

78 Universitaet Aachen/RWTH Germany RWTH Compute Cluster (RCC) - Bullx B500 Cluster, Xeon X56xx 3.06Ghz, QDR Infiniband / 2011 Bull 25448 219.84 270.54

Current Status Finding and Fixing Underperforming Nodes and Best Linpack Results

lanl hpc acceptance testing with gpu emphasis …...experiences testing supercomputing clusters...

Documents