large&scale&systems&tes/ng&facility& ·...

12
Large Scale Systems Tes/ng Facility h4p://www.nmcprobe.org/ probe@newmexicoconsor/um.org NSF PRObE Status May 2013

Upload: others

Post on 08-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large&Scale&Systems&Tes/ng&Facility& · NSF&PRObE&Status&May&2013& They&mostly&go&into&the&chipper…&

Large  Scale  Systems  Tes/ng  Facility  h4p://www.nmc-­‐probe.org/  

probe@newmexicoconsor/um.org    

NSF  PRObE  Status  May  2013  

Page 2: Large&Scale&Systems&Tes/ng&Facility& · NSF&PRObE&Status&May&2013& They&mostly&go&into&the&chipper…&

•  Systems  research  community  lacked  very  large  dedicated  resource  for  repeatable  experiments,  fault  injec/on,  and  hardware  control  

•  Research  on  large  compute  resources  oPen  constrained  by  imposed  soPware  stack,  cost  of  extensive  dedicated  alloca/on  

NSF  PRObE  Status  May  2013  

Page 3: Large&Scale&Systems&Tes/ng&Facility& · NSF&PRObE&Status&May&2013& They&mostly&go&into&the&chipper…&

NSF  PRObE  Status  May  2013  

They  mostly  go  into  the  chipper…  

Page 4: Large&Scale&Systems&Tes/ng&Facility& · NSF&PRObE&Status&May&2013& They&mostly&go&into&the&chipper…&

•  NSF  Funds  the  New  Mexico  Consor/um  (NMC)  to  bring  LANL  supercomputers  back  to  life  

•  PRObE  –    Parallel  Reconfigurable  

Observa/onal  Environment  

NSF  PRObE  Status  May  2013  

Page 5: Large&Scale&Systems&Tes/ng&Facility& · NSF&PRObE&Status&May&2013& They&mostly&go&into&the&chipper…&

•  NSF  systems  research  community  resources  •  Days  to  weeks  of  dedicated  large  scale  alloca/ons  

–  Physical  and  remote  access  •  Complete  control  of  hardware  and  all  soPware  •  Enables  fault  injec/on,  even  destruc/ve  possible  •  Targets  parallel  and  data  intensive  usage  •  Managed  by  community’s  own  Emulab  (www.emulab.net)  

NSF  PRObE  Status  May  2013  

Page 6: Large&Scale&Systems&Tes/ng&Facility& · NSF&PRObE&Status&May&2013& They&mostly&go&into&the&chipper…&

•  Marmot  –  128  nodes  /  256  cores  (CMU)  •  Dual  socket,  Single  core  AMD  Opteron,  16GB/node  •  SDR  Infiniband,  2TB  disk  /  node  

•  Denali  –  64+  nodes  /  128+  cores  (NMC)  •  Dual  Socket,  Single  core  AMD  Opteron,  8GB/node  •  SDR  Infiniband,  2x1TB  disk  /  node  

•  Kodiak  –  1024  nodes  /  2048  cores  (NMC)  – Dual  Socket,  Single  Core  AMD  Opteron,  8GB/node  –  SDR  Infiniband,  2x1TB  disk  /  node  (1.8PiB  total)  

NSF  PRObE  Status  May  2013  

Page 7: Large&Scale&Systems&Tes/ng&Facility& · NSF&PRObE&Status&May&2013& They&mostly&go&into&the&chipper…&

256  Nodes  

256  Nodes  

256  Nodes  

256  Nodes  

420  Port  GigE  +  20Gbps    

420  Port  GigE  +  20Gbps    

 

184  Port  GigE  +  20Gbps    

288  Port  IB  Switch  256+32  

288  Port  IB  Switch  256+32  

288  Port  IB  Switch  256+32  

288  Port  IB  Switch  256+32  

128  Port  IB  

320Gb/s 320Gb/s 320Gb/s 320Gb/s

256x

10G

b/s

256x

10G

b/s

256x

10G

b/s

256x

10G

b/s

PRObE Kodiak Cluster Network diagram: www.nmc-probe.org

10Gb/s  switch  

NSF  PRObE  Status  May  2013  

Page 8: Large&Scale&Systems&Tes/ng&Facility& · NSF&PRObE&Status&May&2013& They&mostly&go&into&the&chipper…&

•  4  x  AMD6272  =  64  core  128  GB,  40GE,  FDR10  IB,  1TB  OS  +  2  x  3TB  disk    

NSF  PRObE  Status  May  2013  

Page 9: Large&Scale&Systems&Tes/ng&Facility& · NSF&PRObE&Status&May&2013& They&mostly&go&into&the&chipper…&

•  1.2  TFLOPS  double-­‐precision,  2500  cudacores,  5GB  @  200  GB/s,  225W,  7B  transistors  

NSF  PRObE  Status  May  2013  

Page 10: Large&Scale&Systems&Tes/ng&Facility& · NSF&PRObE&Status&May&2013& They&mostly&go&into&the&chipper…&

•  More  than  1200  computers  available  today  •  Kodiak  PIs:  Calton  Pu  (GaTech),  Mike  Dahlin  (UT  Aus/n),  Robbert  van  

Renesse  (Cornell),  Mike  Freedman  (Princeton),  Rob  Ricci  (Utah),  Dave  Andersen  (CMU),  Christos  Christodoulou  (UNM),  Kevin  Harms  (Argonne),  Michael  Lang  (LANL)  

•  Staging  cluster  PIs:  Greg  Ganger  (CMU),  Daniel  Freedman  (Technion),  Jonathan  Appavoo  (BU),  Patrick  Bridges  (UNM),  Jun  Wang  (UCF)  

•  Publica/ons  star/ng  to  appear,  e.g.:  –  Wya4  Lloyd,  et.  al.,  “Stronger  Seman/cs  for  Low-­‐Latency  Geo-­‐

Replicated  Storage,”  NSDI’13.  Uses  Kodiak  to  show  scaling.  –  Garth  Gibson,  et.  al.,  “PRObE:  A  Thousand-­‐Node  Experimental  Cluster  

for  Computer  Systems  Research,”USENIX  login,  June  2013.  Adver/sing.  

NSF  PRObE  Status  May  2013  

Page 11: Large&Scale&Systems&Tes/ng&Facility& · NSF&PRObE&Status&May&2013& They&mostly&go&into&the&chipper…&

•  Website  – h4p://www.nmc-­‐probe.org/      Info  &  case  studies  – h4p://portal.nmc-­‐probe.org/    Emulab  portal  – Email:  probe@newmexicoconsor/um.org  – Groups:  probe-­‐[email protected]  

NSF  PRObE  Status  May  2013  

Page 12: Large&Scale&Systems&Tes/ng&Facility& · NSF&PRObE&Status&May&2013& They&mostly&go&into&the&chipper…&

Your  par/cipa/on  is  greatly  encouraged!  

NSF  PRObE  Status  May  2013