lanl hpc acceptance testing with gpu emphasis …...experiences testing supercomputing clusters...

26
Operated by Los Alamos National Security, LLC for NNSA LANL HPC Acceptance tes/ng w/GPU emphasis Craig Idler, LANL Phil Romero, LANL Laura Monroe, LANL UNCLASSIFIED / LAUR1320717

Upload: others

Post on 24-Aug-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

LANL  HPC  Acceptance  tes/ng  w/GPU  emphasis  

Craig  Idler,  LANL  Phil  Romero,  LANL  Laura  Monroe,  LANL  

UNCLASSIFIED  /  LA-­‐UR-­‐13-­‐20717  

Page 2: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Overview  •  Machine  summary  

•  Briefly  discuss  overall  acceptance  test  strategy  and  principles  

•  Reliability  and  Performance  expecta/ons  

•  Tools  and  test  types  used  during  acceptance  •  Experiences  

•  Moonlight  specifics  

2  

Page 3: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Luna  &  Moonlight  

Tri-­‐Laboratory  Capacity  Clusters  

Luna    

   similar  to  Moonlight,  but  no  GPU’s  

Page 4: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Machine  Summary                Name,    CPU  arch,  OS,  Number  of  Nodes,  Peak  Tflops/s  

•  Luna,  Intel  Xeon  Sandybridge,  Linux,  1540  ,  513  •  Cielo,  AMD  Magny-­‐Cours,  SLES/CLE,  8894,  1370  •  Roadrunner,  AMD  Opteron  +  cell  BE,  Fedora  9,  3060,  1380  •  Typhoon,  AMD  Magny-­‐Cours,  Linux,  416,  106  •  Cielito,  AMD  Magny-­‐Cours,  SLES/CLE,  68,  10.4  •  Cerrillos,  AMD  Opteron  +  cell  BE,  360,  152  •  Conejo,  Intel  Xeon,  Linux,  620,  52.8  •  Lobo,  AMD  Opteron,  Linux,  272,  38.3  •  Mapache,  Intel  Xeon,  Linux,  592,  50.4  •  Moonlight,  Intel  Xeon  +  Nvidia  Tesla  M2090  (Cores:  4,928  CPU  +  315,392  GPU)  ,  Linux,  308,  488    (dedicated  

PCIe  x16  link  to  each  M2090  GPU  )  •  Mustang,  AMD  Opteron,  Linux,  1600,  353    •  Pinto,  Intel  Xeon,  Linux,  154,  51.3  

4  

Page 5: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Moonlight  Socware  

•  Common  Compu/ng  Environment  (CCE):  –  Tri-­‐Lab  Opera/ng  System  Stack  (TOSS  2.0)  

•  Red  Hat  Enterprise  Linux  6  +  enhancements  for  HPC  

•  SLURM  

•  Open  MPI,  Mvapich,  OFED  InfiniBand  socware  

•  Lustre  and/or  Panasas  clients  •  System  administra/on  and  management  tools  

–  3rd  party  licensed  socware  

•  Intel,  PGI,  and  PathScale  compiler  suites  

•  Moab  scheduler  

•  TotalView  debugger  –  Open  source  &  CCE  tools  

•  Evolu/on  of  tools  and  capabili/es  on  exis/ng  TLCC  clusters  

5  

Page 6: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Acceptance  Tes/ng  Principles  •  Focus  on  func/onality  and  performance  tes/ng  before  shipment  •  Added  focus  on  performance  and  stability  tes/ng  acer  shipment  •  Test  indirectly  verify  user  view  of  system  func/ons  (DRM,  libraries  or  modules,  file  

system(s)  properly  mounted,  etc.)  •  Tests  target  network,  CPU,  memory,  and  MPI  socware  based  components.      •  Limited  file  system  tes/ng  at  this  point  since  real  produc/on  infrastructure  not  

available  at  this  /me.    –  Albeit,  enough  file  system  to  be  useful.  

•  Some  power  characteriza/on  now  becoming  part  of  tes/ng/evalua/on  process  •  Start  with  single  node  tests  and  move  to  larger  size  applica/ons  once  sa/sfied  with  

test  results.  

6  

Page 7: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Acceptance  Tes/ng  Process  •  Define  and  create  a  set  of  tests  for  this  architecture  on  small  “like”  testbed  •  Build/Provision  the  cluster  at  factory  to  “produc/on”  state,  although  

limited  File  System  •  Boot  and  run  tests  at  factory  

–  Full  coverage  across  all  nodes  –  High  u/liza/on  (>  92%)  over  3+  days  –  Verify  MTBF  as  stated  in  RFP  (<    .1%  node  failures/day)  

•  Tear  down,  ship  to  site,  rebuild  and  re-­‐provision  en/re  cluster.  •  Boot  and  run  post-­‐ship  tests  in  our  produc/on  environment.  

–  Much  like  pre-­‐ship  tests,  but  with  –  Longer  running  /me  and  some  added  apps  

7  

Page 8: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Acceptance  Tes/ng  Tools  •  Gazebo  –  Primary  test  framework  (DOE  CCE  funded  project)    

–  Submits  tests  (jobs)  under  Moab  

–  Keeps  system  busy  to  desired  level  

–  Iden/fies  system  u/liza/on  and  node  coverage  –  Test/job  summary  tool  

–  Basic  job  failure  analy/cs  •  Splunk  –  commercial  product  (www.splunk.com)  used  for  specific  test  

performance  summaries,  job  failure  analy/cs,  fast  data  queries,  etc.  

•  GnuPlot  –  test  data  visualiza/on  

8  

Page 9: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Common  tes/ng  experiences  •  Slow  network  links  –  ocen  cable  problems,  or  faulty  interface  devices  on  switches  or  nodes,  and  

occasionally  a  improper  link  speed  configura/on  issue.  •  Slow  nodes  –  single  node  tests  such  as  HPL  run  slower  than  “norm”.    Ocen  is  a  temperature  

problem  (thermal  throoling)  with  node.  Can  some/mes  be  an  issue  with  NUMA  island  mapping.  •  Slow  apps  –  jobs  running  slower  than  expected.  If  not  one  of  two  prior  problems,  then  network  

conges/on  or  even  an  inadequate  file  system  are  suspect.  •  Jobs  fail  or  slow  to  start  –  typically  seen  as  some  kind  of  mpi  /meout  error.      Usually  a  version  

incompa/bility  issue  with  new  system  socware  or  large  scale  jobs  crea/ng  heavy  startup  load  on  network  infrastructure.  

•  GPU’s  with  inconsistent  performance  –    possibili/es  to  be  discussed.  •  Failing  jobs  –  ocen  due  to  infant  mortality  of  nodes  during  mul/-­‐day  stability  runs.    The    acceptable  

level  at  which  this  can  occur  established  with  vendor  prior  to  stability  runs.    

9  

Page 10: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Gazebo  Job  Summary  Output  Example    Luna Gazebo Results for 2011-11-16!

*** Job Summary ***!From: 2011-11-15 16:50:00 through 2011-11-16 06:59:59 !

cluster lu:!Test Total Passed Failed Unknown Avg. Run Time Node Hours!================= ====== ====== ====== ====== ============= ==========!COMOPS.16x256() 115 115 0 0 727.77 371.971!HPCC.2x32() 88 88 0 0 1470.86 71.9087!HPCC.32x512() 44 44 0 0 1522.95 595.643!MATMULT.1x16() 220 220 0 0 1185.59 72.4527!NPB.16x256(bt lu sp) 220 220 0 0 534.33 522.456!SPaSM-bench.16x256() 92 90 0 2 1237.27 494.908!SPaSM-bench.64x1024() 176 174 0 2 1409.17 4359.03!STREAM.16x256() 92 92 0 0 93.08 38.0594!STREAM.3x128() 1 1 0 0 92.81 0.0773417!STREAM.8x128() 91 91 0 0 92.96 18.7986!STRIDE.4x64() 90 90 0 0 3345.21 334.521!chkout-imb.8x128() 115 114 1 0 97.23 24.6316!hpl-gnu-mkl.2x32(HPL-32.dat) 230 229 1 0 2400.97 305.457!hpl-gnu-mkl.32x512(HPL-512.dat) 113 112 0 1 1448.80 1442.36!hpl-gnu-mkl.64x1024(HPL-1024.dat) 63 61 1 1 1879.67 2038.4!iperf.3x128() 1 0 0 1 10.08 0!iperf.4x64() 184 182 2 0 34.81 7.03936!

First test started at: 2011-11-15 16:56:53-0700!Job Totals! Passed: 2762! Failed: 9! Unknown: 7 generally means jobs in progress!Last test started at: 2011-11-16 08:01:54-0700!

Page 11: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Gazebo  Failed  Job  Summary  Analysis  /home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-16/lu/SPaSM-bench/SPaSM-bench__runSPasm__11145__lu.2011-11-16T03:12:07-0700! ###, 2011-11-16, 03:12:07, SPaSM-bench, 256, 1, incomplete, -, Time Limit Exceeded (1 hr limit, 20 min avg.) !

/home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-16/lu/SPaSM-bench/SPaSM-bench__runSPasm__11762__lu.2011-11-16T06:17:04-0700!

###, 2011-11-16, 06:17:04, SPaSM-bench, 256, 1, incomplete, -, Time Limit Exceeded (1 hr limit, 20 min avg.) !

/home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-15/lu/chkout-imb/chkout-imb__runImb__10379__lu.2011-11-15T23:17:33-0700!

###, 2011-11-15, 23:17:33, chkout-imb, 128, 1, failing, -, multiple warnings due to SendRecv rate dropping below 1800MB/sec (1711 - 1791) !

/home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-15/lu/hpl-gnu-mkl/hpl-gnu-mkl__runHpl__9890__lu.2011-11-15T20:47:27-0700! ###, 2011-11-15, 20:47:27, hpl-gnu-mkl, 32, 1, failing, -, HPL below efficiency minimum of 50% ( 82% normal ) !

/home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-16/lu/hpl-gnu-mkl/hpl-gnu-mkl__runHpl__11631__lu.2011-11-16T05:38:28-0700!

###, 2011-11-16, 05:38:28, hpl-gnu-mkl, 512, 1, incomplete, -, Job canceled @ 06:00:04 due to node failure !

/home/gazebo/atc-results-luna/gzshared/2011/2011-11/2011-11-15/lu/hpl-gnu-mkl/hpl-gnu-mkl__runHpl__10396__lu.2011-11-15T23:44:45-0700! ###, 2011-11-15, 23:44:45, hpl-gnu-mkl, 1024, 1, failing, -, HPL below efficiency minimum of 50% ( 79 % normal) !

11  

Page 12: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

GPGPU  Tes/ng  •  U/lized  Four  Different  Clusters,  2  smaller  test  beds  and  Clusters  of  92  and  308  

node  sizes.  •  Used  SHOC  tests  (all  level  0),  Gpuburn,  GPUBandwidth,  Nbody,  Luxrays,  and  

HPL  versions  13  and  14  modified  to  ensure  proper  affini/es.  •  Performance  in  transferring  data  from  GPUs  to  CPUs  varies  widely  even  when  

correct  affini/es  are  u/lized.    Not  true  when  transferring  data  from  CPUs  to  GPUs.  

•  GPUs  much  more  sensi/ve  to  memory  usage  on  the  host  than  CPUs,  some/mes  even  producing  non-­‐linear  results  on  performance.  

•  CPUs  perform  in  a  /ght  performance  range  closely    approxima/ng  a  normal  distribu/on,  GPGPUs  performances  have  wider  variability  and  less  closely  approximate  a  normal  distribu/on  with  the  High  Performance  Linpack  test.  

Page 13: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

SHOC  Low  Level  Test  Results  

Page 14: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Finding  Performance  Anomalies  from  Shoc  Low  Level  Test  Results  

Page 15: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Typical  GPU  to  Numa  Node  Transfer  Bandwidth  Distribu/ons  

Page 16: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

HPL  Power  Usage  and  CPU/GPU  U/liza/ons  before  and  acer  Slurm  Memory  Usage  Fix  

•  Te"

Page 17: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Moonlight  Single  Node  HPL  Performance  Distribu/on  U/lizing  Only  CPUs  

Page 18: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Moonlight  Single  Node  HPL  Performance  Distribu/on  U/lizing  GPUs  and  CPUs  

Page 19: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Moonlight  Single  Node  HPL  Performance  Distribu/on  by  Node  

Page 20: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

3D  Moonlight  Single  Node  HPL  U/lizing  Both  GPUs  and  CPUs  

Page 21: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Moonlight  Single  Node  HPL  Performance  Distribu/on  U/lizing  GPUs  and  CPUs  Acer  Elimina/on  of  Eight  Outlier  Nodes  

Page 22: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Moonlight  Single  Node  HPL  Performance  Standard  Devia/ons  Group  Into  Two  Ranges  

Page 23: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Moonlight  Single  Node  HPL  v  14  Maximum  Power  to  Each  GPU  Colored  by  Performance  

Page 24: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Moonlight  Single  Node  HPL  v  14  Run  Performance  Distribu/on  By  Minimum  Power  Usage  for  Both  GPUs  

Page 25: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Movie  of  Moonlight  Single  Node  HPL  v  14  Performance  Distribu/on  Changes  With  Minimum  Power  Reached  on  Both  GPUs  

Page 26: LANL HPC Acceptance Testing With GPU Emphasis …...Experiences testing supercomputing clusters equipped with GPUs, how they differ from CPU-only clusters, finding and segregating

Operated by Los Alamos National Security, LLC for NNSA

Replacing  IFB  Boards  Yields  Results  …..  Fastest  HPL  so  far  (used  270  Nodes  that  were  maximum  available  at  the  /me…)  

=============================================================================== T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR10C2L4 906040 1024 15 36 2208.57 2.245e+05 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0012534 ...... PASSED ================================================================================

Finished 1 tests with the following results: 1 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. --------------------------------------------------------------------------------

End of Tests. This result is enough to place 78th on the June 2012 list in between the following entries. •  77 Los Alamos National Laboratory United States Mustang - Xtreme-X 1320H-LANL, Opteron 12 Core 2.30 GHz, Infiniband QDR / 2011 Appro 37056 230.60 340.92 540.4

78 Universitaet Aachen/RWTH Germany RWTH Compute Cluster (RCC) - Bullx B500 Cluster, Xeon X56xx 3.06Ghz, QDR Infiniband / 2011 Bull 25448 219.84 270.54  

Current  Status  Finding  and  Fixing  Underperforming  Nodes  and  Best  Linpack  Results