lanl hpc acceptance testing with gpu emphasis …...experiences testing supercomputing clusters...
TRANSCRIPT
Operated by Los Alamos National Security, LLC for NNSA
LANL HPC Acceptance tes/ng w/GPU emphasis
Craig Idler, LANL Phil Romero, LANL Laura Monroe, LANL
UNCLASSIFIED / LA-‐UR-‐13-‐20717
Operated by Los Alamos National Security, LLC for NNSA
Overview • Machine summary
• Briefly discuss overall acceptance test strategy and principles
• Reliability and Performance expecta/ons
• Tools and test types used during acceptance • Experiences
• Moonlight specifics
2
Operated by Los Alamos National Security, LLC for NNSA
Luna & Moonlight
Tri-‐Laboratory Capacity Clusters
Luna
similar to Moonlight, but no GPU’s
Operated by Los Alamos National Security, LLC for NNSA
Machine Summary Name, CPU arch, OS, Number of Nodes, Peak Tflops/s
• Luna, Intel Xeon Sandybridge, Linux, 1540 , 513 • Cielo, AMD Magny-‐Cours, SLES/CLE, 8894, 1370 • Roadrunner, AMD Opteron + cell BE, Fedora 9, 3060, 1380 • Typhoon, AMD Magny-‐Cours, Linux, 416, 106 • Cielito, AMD Magny-‐Cours, SLES/CLE, 68, 10.4 • Cerrillos, AMD Opteron + cell BE, 360, 152 • Conejo, Intel Xeon, Linux, 620, 52.8 • Lobo, AMD Opteron, Linux, 272, 38.3 • Mapache, Intel Xeon, Linux, 592, 50.4 • Moonlight, Intel Xeon + Nvidia Tesla M2090 (Cores: 4,928 CPU + 315,392 GPU) , Linux, 308, 488 (dedicated
PCIe x16 link to each M2090 GPU ) • Mustang, AMD Opteron, Linux, 1600, 353 • Pinto, Intel Xeon, Linux, 154, 51.3
4
Operated by Los Alamos National Security, LLC for NNSA
Moonlight Socware
• Common Compu/ng Environment (CCE): – Tri-‐Lab Opera/ng System Stack (TOSS 2.0)
• Red Hat Enterprise Linux 6 + enhancements for HPC
• SLURM
• Open MPI, Mvapich, OFED InfiniBand socware
• Lustre and/or Panasas clients • System administra/on and management tools
– 3rd party licensed socware
• Intel, PGI, and PathScale compiler suites
• Moab scheduler
• TotalView debugger – Open source & CCE tools
• Evolu/on of tools and capabili/es on exis/ng TLCC clusters
5
Operated by Los Alamos National Security, LLC for NNSA
Acceptance Tes/ng Principles • Focus on func/onality and performance tes/ng before shipment • Added focus on performance and stability tes/ng acer shipment • Test indirectly verify user view of system func/ons (DRM, libraries or modules, file
system(s) properly mounted, etc.) • Tests target network, CPU, memory, and MPI socware based components. • Limited file system tes/ng at this point since real produc/on infrastructure not
available at this /me. – Albeit, enough file system to be useful.
• Some power characteriza/on now becoming part of tes/ng/evalua/on process • Start with single node tests and move to larger size applica/ons once sa/sfied with
test results.
6
Operated by Los Alamos National Security, LLC for NNSA
Acceptance Tes/ng Process • Define and create a set of tests for this architecture on small “like” testbed • Build/Provision the cluster at factory to “produc/on” state, although
limited File System • Boot and run tests at factory
– Full coverage across all nodes – High u/liza/on (> 92%) over 3+ days – Verify MTBF as stated in RFP (< .1% node failures/day)
• Tear down, ship to site, rebuild and re-‐provision en/re cluster. • Boot and run post-‐ship tests in our produc/on environment.
– Much like pre-‐ship tests, but with – Longer running /me and some added apps
7
Operated by Los Alamos National Security, LLC for NNSA
Acceptance Tes/ng Tools • Gazebo – Primary test framework (DOE CCE funded project)
– Submits tests (jobs) under Moab
– Keeps system busy to desired level
– Iden/fies system u/liza/on and node coverage – Test/job summary tool
– Basic job failure analy/cs • Splunk – commercial product (www.splunk.com) used for specific test
performance summaries, job failure analy/cs, fast data queries, etc.
• GnuPlot – test data visualiza/on
8
Operated by Los Alamos National Security, LLC for NNSA
Common tes/ng experiences • Slow network links – ocen cable problems, or faulty interface devices on switches or nodes, and
occasionally a improper link speed configura/on issue. • Slow nodes – single node tests such as HPL run slower than “norm”. Ocen is a temperature
problem (thermal throoling) with node. Can some/mes be an issue with NUMA island mapping. • Slow apps – jobs running slower than expected. If not one of two prior problems, then network
conges/on or even an inadequate file system are suspect. • Jobs fail or slow to start – typically seen as some kind of mpi /meout error. Usually a version
incompa/bility issue with new system socware or large scale jobs crea/ng heavy startup load on network infrastructure.
• GPU’s with inconsistent performance – possibili/es to be discussed. • Failing jobs – ocen due to infant mortality of nodes during mul/-‐day stability runs. The acceptable
level at which this can occur established with vendor prior to stability runs.
9
Operated by Los Alamos National Security, LLC for NNSA
Gazebo Job Summary Output Example Luna Gazebo Results for 2011-11-16!
*** Job Summary ***!From: 2011-11-15 16:50:00 through 2011-11-16 06:59:59 !
cluster lu:!Test Total Passed Failed Unknown Avg. Run Time Node Hours!================= ====== ====== ====== ====== ============= ==========!COMOPS.16x256() 115 115 0 0 727.77 371.971!HPCC.2x32() 88 88 0 0 1470.86 71.9087!HPCC.32x512() 44 44 0 0 1522.95 595.643!MATMULT.1x16() 220 220 0 0 1185.59 72.4527!NPB.16x256(bt lu sp) 220 220 0 0 534.33 522.456!SPaSM-bench.16x256() 92 90 0 2 1237.27 494.908!SPaSM-bench.64x1024() 176 174 0 2 1409.17 4359.03!STREAM.16x256() 92 92 0 0 93.08 38.0594!STREAM.3x128() 1 1 0 0 92.81 0.0773417!STREAM.8x128() 91 91 0 0 92.96 18.7986!STRIDE.4x64() 90 90 0 0 3345.21 334.521!chkout-imb.8x128() 115 114 1 0 97.23 24.6316!hpl-gnu-mkl.2x32(HPL-32.dat) 230 229 1 0 2400.97 305.457!hpl-gnu-mkl.32x512(HPL-512.dat) 113 112 0 1 1448.80 1442.36!hpl-gnu-mkl.64x1024(HPL-1024.dat) 63 61 1 1 1879.67 2038.4!iperf.3x128() 1 0 0 1 10.08 0!iperf.4x64() 184 182 2 0 34.81 7.03936!
First test started at: 2011-11-15 16:56:53-0700!Job Totals! Passed: 2762! Failed: 9! Unknown: 7 generally means jobs in progress!Last test started at: 2011-11-16 08:01:54-0700!
Operated by Los Alamos National Security, LLC for NNSA
Gazebo Failed Job Summary Analysis /home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-16/lu/SPaSM-bench/SPaSM-bench__runSPasm__11145__lu.2011-11-16T03:12:07-0700! ###, 2011-11-16, 03:12:07, SPaSM-bench, 256, 1, incomplete, -, Time Limit Exceeded (1 hr limit, 20 min avg.) !
/home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-16/lu/SPaSM-bench/SPaSM-bench__runSPasm__11762__lu.2011-11-16T06:17:04-0700!
###, 2011-11-16, 06:17:04, SPaSM-bench, 256, 1, incomplete, -, Time Limit Exceeded (1 hr limit, 20 min avg.) !
/home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-15/lu/chkout-imb/chkout-imb__runImb__10379__lu.2011-11-15T23:17:33-0700!
###, 2011-11-15, 23:17:33, chkout-imb, 128, 1, failing, -, multiple warnings due to SendRecv rate dropping below 1800MB/sec (1711 - 1791) !
/home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-15/lu/hpl-gnu-mkl/hpl-gnu-mkl__runHpl__9890__lu.2011-11-15T20:47:27-0700! ###, 2011-11-15, 20:47:27, hpl-gnu-mkl, 32, 1, failing, -, HPL below efficiency minimum of 50% ( 82% normal ) !
/home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-16/lu/hpl-gnu-mkl/hpl-gnu-mkl__runHpl__11631__lu.2011-11-16T05:38:28-0700!
###, 2011-11-16, 05:38:28, hpl-gnu-mkl, 512, 1, incomplete, -, Job canceled @ 06:00:04 due to node failure !
/home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-15/lu/hpl-gnu-mkl/hpl-gnu-mkl__runHpl__10396__lu.2011-11-15T23:44:45-0700! ###, 2011-11-15, 23:44:45, hpl-gnu-mkl, 1024, 1, failing, -, HPL below efficiency minimum of 50% ( 79 % normal) !
11
Operated by Los Alamos National Security, LLC for NNSA
GPGPU Tes/ng • U/lized Four Different Clusters, 2 smaller test beds and Clusters of 92 and 308
node sizes. • Used SHOC tests (all level 0), Gpuburn, GPUBandwidth, Nbody, Luxrays, and
HPL versions 13 and 14 modified to ensure proper affini/es. • Performance in transferring data from GPUs to CPUs varies widely even when
correct affini/es are u/lized. Not true when transferring data from CPUs to GPUs.
• GPUs much more sensi/ve to memory usage on the host than CPUs, some/mes even producing non-‐linear results on performance.
• CPUs perform in a /ght performance range closely approxima/ng a normal distribu/on, GPGPUs performances have wider variability and less closely approximate a normal distribu/on with the High Performance Linpack test.
Operated by Los Alamos National Security, LLC for NNSA
SHOC Low Level Test Results
Operated by Los Alamos National Security, LLC for NNSA
Finding Performance Anomalies from Shoc Low Level Test Results
Operated by Los Alamos National Security, LLC for NNSA
Typical GPU to Numa Node Transfer Bandwidth Distribu/ons
Operated by Los Alamos National Security, LLC for NNSA
HPL Power Usage and CPU/GPU U/liza/ons before and acer Slurm Memory Usage Fix
• Te"
Operated by Los Alamos National Security, LLC for NNSA
Moonlight Single Node HPL Performance Distribu/on U/lizing Only CPUs
Operated by Los Alamos National Security, LLC for NNSA
Moonlight Single Node HPL Performance Distribu/on U/lizing GPUs and CPUs
Operated by Los Alamos National Security, LLC for NNSA
Moonlight Single Node HPL Performance Distribu/on by Node
Operated by Los Alamos National Security, LLC for NNSA
3D Moonlight Single Node HPL U/lizing Both GPUs and CPUs
Operated by Los Alamos National Security, LLC for NNSA
Moonlight Single Node HPL Performance Distribu/on U/lizing GPUs and CPUs Acer Elimina/on of Eight Outlier Nodes
Operated by Los Alamos National Security, LLC for NNSA
Moonlight Single Node HPL Performance Standard Devia/ons Group Into Two Ranges
Operated by Los Alamos National Security, LLC for NNSA
Moonlight Single Node HPL v 14 Maximum Power to Each GPU Colored by Performance
Operated by Los Alamos National Security, LLC for NNSA
Moonlight Single Node HPL v 14 Run Performance Distribu/on By Minimum Power Usage for Both GPUs
Operated by Los Alamos National Security, LLC for NNSA
Movie of Moonlight Single Node HPL v 14 Performance Distribu/on Changes With Minimum Power Reached on Both GPUs
Operated by Los Alamos National Security, LLC for NNSA
Replacing IFB Boards Yields Results ….. Fastest HPL so far (used 270 Nodes that were maximum available at the /me…)
=============================================================================== T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR10C2L4 906040 1024 15 36 2208.57 2.245e+05 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0012534 ...... PASSED ================================================================================
Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. --------------------------------------------------------------------------------
End of Tests. This result is enough to place 78th on the June 2012 list in between the following entries. • 77 Los Alamos National Laboratory United States Mustang - Xtreme-X 1320H-LANL, Opteron 12 Core 2.30 GHz, Infiniband QDR / 2011 Appro 37056 230.60 340.92 540.4
78 Universitaet Aachen/RWTH Germany RWTH Compute Cluster (RCC) - Bullx B500 Cluster, Xeon X56xx 3.06Ghz, QDR Infiniband / 2011 Bull 25448 219.84 270.54
Current Status Finding and Fixing Underperforming Nodes and Best Linpack Results