job dispatch and termination performance agent teamwork vs. globus/openpbs

Evaluation of Agent TeamworkEvaluation of Agent TeamworkHigh Performance Distributed ComputingHigh Performance Distributed Computing

Middleware Middleware ..

Solomon Lane Agent Teamwork Research AssistantSolomon Lane Agent Teamwork Research AssistantOctober 2006 – March 2007October 2006 – March 2007

Job Dispatch and Termination Performance

Agent Teamwork VS. Globus/OpenPBS

Framework Execution Performance

Agent Teamwork VS. MPIJava

TerminologyGrid vs. Cluster A computing grid is commonly distinguished from a computing cluster by the geographic distance between members. A cluster would be a group of computers in the same room or building and connected to the same physical network, while the members of grid could be located anywhere and may connected over several different networks.

PlatformI define an HPDC platform as software that provides Infrastructure and Scheduling services. Infrastructure services include authentication and authorization, job submission, and file transfer for job deployment. Scheduling services include dynamic resource identification and allocation, scheduling policies, and coordinating job execution.

FrameworkI define a framework as a related set of software libraries that are used to write software in a particular programming model. The Single Program Multiple Data (SPMD) programming model is commonly used to achieve data level parallelism in HPDC. MPIJava is a Java implementation of the Message Passing Interface standard which provides a framework for programming in the SPMD model.

Agent TeamworkAgentTeamwork is a mobile-agent-based job coordination system that targets a mixture of computing nodes, some directly connected to the public Internet, and others simply clustered in a private IP domain but not managed by a commodity job scheduler.1

Globus ToolkitThe Globus Toolkit is an open source software toolkit used for building Grid systems and applications.2

OpenPBSOpenPBS is the original version of the Portable Batch System. It is a flexible batch queueing system developed for NASA in the early to mid-1990s3. The purpose of the OpenPBS system is to provide additional controls over initiating or scheduling execution of batch jobs; and to allow routing of those jobs between different hosts.4

Message Passing Interface (MPI)MPI is a library specification for message-passing, proposed as a standard by a broadly based committee of vendors, implementors, and users. MPI was designed for high performance on both massively parallel machines and on workstation clusters.5

MPICH-G2 A grid-enabled implementation of the MPI v1.1 standard. It uses services from the Globus Toolkit (e.g., job startup, security), MPICH-G2 allows you to couple multiple machines, potentially of different architectures, to run MPI applications.6

MPIJavampiJava is an object-oriented Java interface to the standard Message Passing Interface (MPI).7

1 Fault-Tolerant Job Execution over Multi-Clusters using Mobile agents, Munehiro Fukuda gca07.pdf2 http://www.globus.org/3 http://www.openpbs.org/about.html 4 Overview of the OpenPBS, http://www.openpbs.org/overview.html5 What is MPI, http://www-unix.mcs.anl.gov/mpi/6 What is MPICH-G2 http://www3.niu.edu/mpi/7 http://www.hpjava.org/mpiJava.html

The Clusters

Overview

Technology

AgentTeamwork

My goal as a research assistant was to evaluate Agent Teamwork’s “Job Dispatch & Termination” and “Framework” performance against a contemporary alternative. Job Dispatch & Termination Evaluation:I built a reference platform to compare Agent Teamwork against by integrating the Globus Toolkit with the OpenPBS scheduler and the MPICH-G2 MPI framework. Framework Function Evaluation:To evaluate the framework performance I wrote three benchmark programs in the Agent Teamwork MPI framework and the MPIJava framework and compared their runtimes.

Reference Platform Overview

Results: These graphs compare job dispatch & termination time when submitting a test program to different numbers of cluster nodes in either a depth or breadth first distribution. Agent Teamwork’s job dispatch and termination performance was comparable with the reference platform in the depth first distribution And agent teamwork outperformed the reference platform with a large number of nodes in a breadth first distribution.

1 In order to run a job you generate a job definition file using the Resource Specification Language (RSL) andsubmit it along with your user certificate using globusrun.

The gram client submits the job to a gatekeeper on thecluster head, which uses the GSI to authenticate andauthorize the job submission. It then starts a jobmanager which issues a callback to the gram client toconnect std error and std out back to the client. The jobmanager then submits the job details to the PBS Server.

The PBS Scheduler selects appropriate nodes from thecluster and transfers the executable to the PBS mom onthe cluster nodes. The PBS mom launches the application. Applications are written in the MPICH-G2 framework whichuses the grid infrastructure to coordinate the parallelexecution.

2

3

Framework Results: Currently two of the Agent Teamwork versions of the benchmark programs cannot be run across the clusters due to outstanding bugs in the framework. One of the benchmark programs, Wave2D, was able to run on a limited number of nodes. The graphs to the right show these partial results which indicate that the Agent Teamwork version is at least one order of magnitude slower than MPIJava. At this point however framework debugging is ongoing.

The following tables describe the hardware that was used. There were a total of 66 machines divided into two clusters.

Medusa Cluster Phoebe Cluster

a 32-node cluster for research use

a 32-node cluster for instructional use

Head Node:specification outbound1.8GHz Xeon x2, 512MB memory, and 70GB HD 100Mbps

Head node:specification outbound1.5 memory, and 40GB HD 100Mbps

Computing nodes:#nodes specification inbound24 3.2GHz Xeon, 512MB memory, and 36GB HD 1Gbps8 2.8GHz Xeon, 512MB memory, and 60GB HD 2Gbps

Computing nodes:#nodes specification inbound16 1.5GHz Xeon, 512MB memory, and 30GB HD 100Mbps16 1.5GHz Xeon, 512MB memory, and 30GB HD 1Gbps

job dispatch and termination performance agent teamwork vs. globus/openpbs

Documents

job submission

job startup

job deployment

agent teamworkagentteamwork

framework performance

based job coordination

mpichg2 http

scheduling services