dynamic provisioning of data intensive computing middleware frameworks

Dynamic Provisioning of Data Intensive Computing Middleware Frameworks: A Case

Study

Linh B. Ngo1 Michael E. Payne1 Flavio Villanustre2

Richard Taylor2 Amy W. Apon1

1School of Computing, Clemson University 2LexisNexis® Risk Solutions

Contents

1. Overview of Clemson University’s Cyberinfrastructure Resource 2. Demand for Dynamic Data-‐Intensive Compu@ng Middleware Frameworks 3. Dynamic Provisioning of Data-‐Intensive Compu@ng Framework 4. Deploying Hadoop Ecosystem vs. Deploying HPCC Systems® 5. Lessons Learned

Cyberinfrastructure Resource at Clemson University

Condominium model

2,007 Computer Nodes (21,400 cores), including 276 GPU nodes

Sustained 551 Tflops (benchmarked on GPU nodes only)

1289 active users, 12 academic departments across 36 fields of research

Facilities

Cyberinfrastructure Resource at Clemson University

•  1G/10G/Myrinet-‐10G/Infiniband-‐40G/Infiniband-‐56G •  Local storage between 100-‐200GB (majority) and 400-‐900GB (since 2013) •  Shared 233TB OrangeFS scratch space and more than 3PB archival space

Demand for Dynamic Data-‐Intensive Compu@ng Middleware Frameworks

•  Genome Sequencing (Hadoop MapReduce/GPGPU) •  Molecular Dynamic Forward Flux Sampling (Hadoop Streaming/LAMMPS) •  Streaming Data Infrastructure for Connected Vehicle System (Hadoop

Distributed File System/Spark/Ka_a) •  Big Scholarly Data (HPCC Systems) •  CS Course in Distributed and Cluster Compu@ng (MPI/MapReduce,

Hadoop/Spark/HPCC Systems® …)

Demand for Dynamic Data-‐Intensive Compu@ng Middleware Frameworks

•  Changes in cyberinfrastructure support model for data infrastructure: –  Beyond a tradi@onal remote distributed file system model –  From sta@c and dedicated resource to dynamic resource –  Data management processes co-‐locate with compu@ng processes

•  Challenges for system administrators: –  Accommoda@ng different frameworks for different research –  Complying with exis@ng administra@ve policy and scheduling priority

•  What can users do? –  Deploying dynamic data-‐intensive compu@ng frameworks within the

limits of user privilege and without the interven@on of administrators

Dynamic Provisioning of Data-‐Intensive Compu@ng Framework: Installa@on

•  Where to install 1.   Home directory: Persistent, limited in storage 2.   Shared distributed storage: Fast, semi-‐persistent, “unlimited” storage 3.   Local storage on compute node: Fast, non-‐persistent, requires

reinstalla@on •  How to handle dependencies

1.   Ideally in home or shared distributed storage (persistency) 2.   Dynamic loading mechanisms via environment paths

Target deployment directories on local disks

PBS_NODEFILE Deployment/

ConfiguraBon Scripts

1

2 3

4

user.palmeHo.clemson.edu

Dynamic Provisioning of Data-‐Intensive Compu@ng Framework: Deployment

Deploying Hadoop Ecosystem vs. deploying HPCC Systems®: Overview

•  Open source alterna@ves based on the conceptual architecture of a data-‐intensive compu@ng infrastructure developed by Google

•  Comprehensive data-‐intensive compu@ng system targe@ng enterprise users, developed in early 2000, open source since 2011

Deploying Hadoop Ecosystem vs. deploying HPCC Systems®: Installa@on: Hadoop

•  Self-‐contained, pre-‐compiled jar files •  No installa@on is needed, relies on shell scripts to launch component

daemons •  Dependencies: JDK

Deploying Hadoop Ecosystem vs. deploying HPCC Systems®: Installa@on: HPCC Systems

•  Standard configure/make/make install –  Assump@on about an industrial produc@on environment (with

administra@ve privileges) –  Modifica@on to avoid hard-‐coded system installa@on paths –  Modifica@on of template XML configura@on files to avoid default

HPCC Systems-‐specific user crea@on and administra@ve check •  Dependencies:

–  Not on Palmeko: ICU, Xalan, Xerces, APR … –  On Palmeko but no correct version: Binu@ls

Deploying Hadoop Ecosystem vs. deploying HPCC Systems: Deployment: Hadoop

•  Component placement determina@on

•  Cleanup target directories from previous deployment

•  Create target directories (log, storage, pid …)

•  Synchronize order of component start-‐up

Namenode ResourceManager SparkMaster

DataNode

NodeManager

SparkExecutor

DataNode

NodeManager

SparkExecutor

DataNode

NodeManager

SparkExecutor

1st node in PBS_NODEFILE

2nd node in PBS_NODEFILE

3rd node in PBS_NODEFILE

4th node in PBS_NODEFILE


nth node in PBS_NODEFILE

•  Addi@onal components (Hbase, Hive, Ka_a …) can be added to this deployment model

Deploying Hadoop Ecosystem vs. deploying HPCC Systems: Deployment: HPCC Systems

•  Determine node

alloca@on and internal IP addresses

•  HPCC Systems is configured via its own deployment programs (configmgr, configgen, hpcc-‐init)








Deploying Hadoop Ecosystem vs. deploying HPCC Systems: Deployment: HPCC Systems

•  Node memory constraints •  HPCC Systems reserves

75% of available memory for thor by default

•  Palmeko does not allow unlimited memory reserva@on

•  As a result, thor_master cannot launch new jobs via fork()

•  Resolved by lower memory reserva@on








Lessons Learned

•  A common approach can be adapted for both Hadoop Ecosystem and HPCC Systems

•  Limita@ons on non-‐administra@ve accounts can impact the deployment and performance via system resource constraints –  Unable to u@lize all available memory on allocated node (HPCC

Systems) •  Dynamic deployment via non-‐administra@ve accounts provide ini@a@ve

for users to experiment with and u@lize new large scale frameworks without addi@onal burden for administrators

Lessons Learned

•  Experience in deploying as users is, in turn, extremely applicable to the process of deployment with administra@ve privileges.

•  E.g.: CloudLab cloud compu@ng experimental testbed with non-‐persistent, ephemeral, and short-‐term (15 hours) alloca@on –  Script-‐based installa@on and deployment are needed, even with

administra@ve right, to automate the deployment of the experiment •  Experience in deploying as administrators is helpful in debugging user-‐

based deployment: –  Iden@fica@on and resolu@on of memory alloca@on issue in HPCC

Systems were done by changing system limita@on using administra@ve commands.

QUESTIONS?

Linh B. Ngo1 Michael E. Payne1 Flavio Villanustre2 Richard Taylor2 Amy W. Apon1

{lngo,mpayne3,aapon}@clemson.edu 1School of Computing, Clemson University

{flavio.villanustre,richard.taylor}@lexisnexis.com 2LexisNexis Risk Solutions

More information about HPCCSystems can be found at http://hpccsystems.com

dynamic provisioning of data intensive computing middleware frameworks

Presentations & Public Speaking