dynamic provisioning of data intensive computing middleware frameworks

17
Dynamic Provisioning of Data Intensive Computing Middleware Frameworks: A Case Study Linh B. Ngo 1 Michael E. Payne 1 Flavio Villanustre 2 Richard Taylor 2 Amy W. Apon 1 1 School of Computing, Clemson University 2 LexisNexis ® Risk Solutions

Upload: linh-ngo

Post on 13-Aug-2015

24 views

Category:

Presentations & Public Speaking


0 download

TRANSCRIPT

Page 1: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

Dynamic Provisioning of Data Intensive Computing Middleware Frameworks: A Case

Study

Linh B. Ngo1 Michael E. Payne1 Flavio Villanustre2

Richard Taylor2 Amy W. Apon1

1School of Computing, Clemson University 2LexisNexis® Risk Solutions

Page 2: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

Contents

1.  Overview  of  Clemson  University’s  Cyberinfrastructure  Resource  2.  Demand  for  Dynamic  Data-­‐Intensive  Compu@ng  Middleware  Frameworks  3.  Dynamic  Provisioning  of  Data-­‐Intensive  Compu@ng  Framework    4.  Deploying  Hadoop  Ecosystem  vs.  Deploying  HPCC  Systems®  5.  Lessons  Learned    

Page 3: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

Cyberinfrastructure Resource at Clemson University

Condominium model

2,007 Computer Nodes (21,400 cores), including 276 GPU nodes

Sustained 551 Tflops (benchmarked on GPU nodes only)

1289 active users, 12 academic departments across 36 fields of research

Facilities

Page 4: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

Cyberinfrastructure Resource at Clemson University

•  1G/10G/Myrinet-­‐10G/Infiniband-­‐40G/Infiniband-­‐56G  •  Local  storage  between  100-­‐200GB  (majority)  and  400-­‐900GB  (since  2013)  •  Shared  233TB  OrangeFS  scratch  space  and  more  than  3PB  archival  space  

Page 5: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

Demand  for  Dynamic  Data-­‐Intensive  Compu@ng  Middleware  Frameworks

•  Genome  Sequencing  (Hadoop  MapReduce/GPGPU)  •  Molecular  Dynamic  Forward  Flux  Sampling  (Hadoop  Streaming/LAMMPS)  •  Streaming  Data  Infrastructure  for  Connected  Vehicle  System  (Hadoop  

Distributed  File  System/Spark/Ka_a)  •  Big  Scholarly  Data  (HPCC  Systems)  •  CS  Course  in  Distributed  and  Cluster  Compu@ng  (MPI/MapReduce,  

Hadoop/Spark/HPCC  Systems®  …)  

Page 6: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

Demand  for  Dynamic  Data-­‐Intensive  Compu@ng  Middleware  Frameworks

•  Changes  in  cyberinfrastructure  support  model  for  data  infrastructure:  –  Beyond  a  tradi@onal  remote  distributed  file  system  model  –  From  sta@c  and  dedicated  resource  to  dynamic  resource  –  Data  management  processes  co-­‐locate  with  compu@ng  processes  

•  Challenges  for  system  administrators:  –  Accommoda@ng  different  frameworks  for  different  research  –  Complying  with  exis@ng  administra@ve  policy  and  scheduling  priority  

•  What  can  users  do?  –  Deploying  dynamic  data-­‐intensive  compu@ng  frameworks  within  the  

limits  of  user  privilege  and  without  the  interven@on  of  administrators  

Page 7: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

Dynamic  Provisioning  of  Data-­‐Intensive  Compu@ng  Framework:  Installa@on  

•  Where  to  install  1.   Home  directory:  Persistent,  limited  in  storage  2.   Shared  distributed  storage:  Fast,  semi-­‐persistent,  “unlimited”  storage  3.   Local  storage  on  compute  node:  Fast,  non-­‐persistent,  requires  

reinstalla@on  •  How  to  handle  dependencies  

1.   Ideally  in  home  or  shared  distributed  storage  (persistency)  2.   Dynamic  loading  mechanisms  via  environment  paths  

Page 8: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

Target  deployment  directories  on  local  disks  

 

PBS_NODEFILE  Deployment/

ConfiguraBon  Scripts  

1  

2  3  

4  

user.palmeHo.clemson.edu  

Dynamic  Provisioning  of  Data-­‐Intensive  Compu@ng  Framework:  Deployment  

Page 9: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

Deploying  Hadoop  Ecosystem  vs.  deploying  HPCC  Systems®:  Overview  

•  Open  source  alterna@ves  based  on  the  conceptual  architecture  of  a  data-­‐intensive  compu@ng  infrastructure  developed  by  Google  

•  Comprehensive  data-­‐intensive  compu@ng  system  targe@ng  enterprise  users,  developed  in  early  2000,  open  source  since  2011  

 

Page 10: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

Deploying  Hadoop  Ecosystem  vs.  deploying  HPCC  Systems®:  Installa@on:  Hadoop  

•  Self-­‐contained,  pre-­‐compiled  jar  files  •  No  installa@on  is  needed,  relies  on  shell  scripts  to  launch  component  

daemons  •  Dependencies:  JDK  

Page 11: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

Deploying  Hadoop  Ecosystem  vs.  deploying  HPCC  Systems®:  Installa@on:  HPCC  Systems  

•  Standard  configure/make/make  install  –  Assump@on  about  an  industrial  produc@on  environment  (with  

administra@ve  privileges)  –  Modifica@on  to  avoid  hard-­‐coded  system  installa@on  paths  –  Modifica@on  of  template  XML  configura@on  files  to  avoid  default  

HPCC  Systems-­‐specific  user  crea@on  and  administra@ve  check  •  Dependencies:    

–  Not  on  Palmeko:  ICU,  Xalan,  Xerces,  APR  …  –  On  Palmeko  but  no  correct  version:  Binu@ls  

Page 12: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

Deploying  Hadoop  Ecosystem  vs.  deploying  HPCC  Systems:  Deployment:  Hadoop  

•  Component  placement  determina@on  

•  Cleanup  target  directories  from  previous  deployment  

•  Create  target  directories  (log,  storage,  pid  …)  

•  Synchronize  order  of  component  start-­‐up  

Namenode   ResourceManager   SparkMaster  

DataNode  

NodeManager  

SparkExecutor  

DataNode  

NodeManager  

SparkExecutor  

DataNode  

NodeManager  

SparkExecutor  

1st  node  in  PBS_NODEFILE  

2nd  node  in  PBS_NODEFILE  

3rd  node  in  PBS_NODEFILE  

4th  node  in  PBS_NODEFILE  

5th  node  in  PBS_NODEFILE  

nth  node  in  PBS_NODEFILE  

•  Addi@onal  components  (Hbase,  Hive,  Ka_a  …)  can  be  added  to  this  deployment  model  

Page 13: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

Deploying  Hadoop  Ecosystem  vs.  deploying  HPCC  Systems:  Deployment:  HPCC  Systems  

 •  Determine  node  

alloca@on  and  internal  IP  addresses  

•  HPCC  Systems  is  configured  via  its  own  deployment  programs  (configmgr,  configgen,  hpcc-­‐init)  

1st  node  in  PBS_NODEFILE  

2nd  node  in  PBS_NODEFILE  

1st  node  in  PBS_NODEFILE  

3rd  node  in  PBS_NODEFILE  

4th  node  in  PBS_NODEFILE  

5th  node  in  PBS_NODEFILE  

nth  node  in  PBS_NODEFILE  

Page 14: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

Deploying  Hadoop  Ecosystem  vs.  deploying  HPCC  Systems:  Deployment:  HPCC  Systems  

 •  Node  memory  constraints  •  HPCC  Systems  reserves  

75%  of  available  memory  for  thor  by  default  

•  Palmeko  does  not  allow  unlimited  memory  reserva@on    

•  As  a  result,  thor_master    cannot  launch  new  jobs  via  fork()  

•  Resolved  by  lower  memory  reserva@on  

1st  node  in  PBS_NODEFILE  

2nd  node  in  PBS_NODEFILE  

1st  node  in  PBS_NODEFILE  

3rd  node  in  PBS_NODEFILE  

4th  node  in  PBS_NODEFILE  

5th  node  in  PBS_NODEFILE  

nth  node  in  PBS_NODEFILE  

Page 15: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

Lessons  Learned  

•  A  common  approach  can  be  adapted  for  both  Hadoop  Ecosystem  and  HPCC  Systems  

•  Limita@ons  on  non-­‐administra@ve  accounts  can  impact  the  deployment  and  performance  via  system  resource  constraints  –  Unable  to  u@lize  all  available  memory  on  allocated  node  (HPCC  

Systems)  •  Dynamic  deployment  via  non-­‐administra@ve  accounts  provide  ini@a@ve  

for  users  to  experiment  with  and  u@lize  new  large  scale  frameworks  without  addi@onal  burden  for  administrators  

Page 16: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

Lessons  Learned  

•  Experience  in  deploying  as  users  is,  in  turn,  extremely  applicable  to  the  process  of  deployment  with  administra@ve  privileges.    

•  E.g.:  CloudLab  cloud  compu@ng  experimental  testbed  with  non-­‐persistent,  ephemeral,  and  short-­‐term  (15  hours)  alloca@on  –  Script-­‐based  installa@on  and  deployment  are  needed,  even  with  

administra@ve  right,  to  automate  the  deployment  of  the  experiment  •  Experience  in  deploying  as  administrators  is  helpful  in  debugging  user-­‐

based  deployment:  –  Iden@fica@on  and  resolu@on  of  memory  alloca@on  issue  in  HPCC  

Systems  were  done  by  changing  system  limita@on  using  administra@ve  commands.    

Page 17: Dynamic Provisioning of Data Intensive Computing Middleware Frameworks

QUESTIONS?

Linh B. Ngo1 Michael E. Payne1 Flavio Villanustre2 Richard Taylor2 Amy W. Apon1

{lngo,mpayne3,aapon}@clemson.edu 1School of Computing, Clemson University

{flavio.villanustre,richard.taylor}@lexisnexis.com 2LexisNexis Risk Solutions

More information about HPCCSystems can be found at http://hpccsystems.com