performance tuning for high performance computing systems

Performance Tuning for High Performance

Computing Systems

By

Himanshu Pahuja

Thesis submitted in partial fulfilment of the requirements for the

Degree of Master of Information Technology (Honours)

in the Caulfield School of Information Technology at

Monash University, Australia

August, 2011

DECLARATION

I declare that the thesis contains no material that has been

accepted for the award of any degree in any University and

that, to the best of my knowledge, this thesis contains no

material previously published or written by any other person

except where due reference is made in the text.

Signed

Date

Caulfield School of Information Technology

Monash University

Australia

Copyright Notices Notice 1 Under the Copyright Act 1968, this thesis must be used only under the normal conditions of scholarly fair dealing. In particular no results or conclusions should be extracted from it, nor should it be copied or closely paraphrased in whole or in part without the written consent of the author. Proper written acknowledgement should be made for any assistance obtained from this thesis. Notice 2 I certify that I have made all reasonable efforts to secure copyright permissions for third-party content included in this thesis and have not knowingly added copyright content to my work without the owner's permission.

ACKNOWLEDGEMENT

I thank my supervisor Dr. Jefferson Tan for his continuous

guidance, innovative ideas and motivation in this research.

I thank my friends who have supported me throughout the

year. Special mentions go to Anjali and Chandni.

I also acknowledge my fellow Honours students (Hasn, Haihao

and Sepehr) for making the last year extremely enjoyable.

Finally, I dedicate this thesis to my parents,

Mr Rakesh Kumar Pahuja and Mrs Madhu Pahuja, in appreciation

of their love and support.

Performance Tuning for High Performance Computing Systems

ABSTRACT

A Distributed System is composed by integration between loosely coupled

software components and the underlying hardware resources that can be

distributed over the standard internet framework. High Performance

Computing used to involve utilization of supercomputers which could churn

a lot of computing power to process massively complex computational

tasks, but is now evolving across distributed systems, thereby having the

ability to utilize geographically distributed computing resources.

We often do not realize that today, we unknowingly are dependent on one

or the other distributed high performance computing systems. With the

world progressing towards paradigms such as cloud computing, installing

software and buying high end computing systems will be a thing of the past,

with everything being accessible through the internet. So much dependent

on such systems will ensure their quick growth and the amount of resources

under these systems will increase drastically in count as well. We are

though not ready for such a development, mainly because the domain of

resource utilization still lacks some control for the user and the system

developer as well. This thesis therefore explores the specific domain of

resource utilization across a computing grid and highlights a key aspect of

resource allocation.

We also describe a performance tuning application that validates our

proposed hypothesis by highlighting the amount of resource savings that

can be achieved if such an optimization is incorporated.

Table of Contents

1 Chapter 1 Introduction 1.1 PREAMBLE ………………………………………………………………………..…………………… 1

1.2 OBJECTIVES ………………………………………………………………………..………………….. 2

1.3 OUTLINE …………..…………………………………………………………..…………………… 4

2 Chapter 2 – High Performance Computing 2.1 HIGH PERFORMANCE COMPUTING SYSTEM ………………………………………… 5

2.1.1 NETWORKING INFRASTRUCTURE ……………………………. 5

2.1.2 DATA PLACEMENT AND ACCESS TECHNIQUES ………………. 6

2.1.3 JOB SCHEDULER ……………………………………………………. 7

2.1.4 RESOURCE MANAGER ……………………………………………………… 9

2.1.5 RESOURCE ALLOCATOR ……………………………………………………… 12

2.1.6 APPLICATION TUNING …………………………………………………….. 14

2.1.7 SOME OPTIMIZED HPC SYSTEMS …………………………… 15

2.2 PERFORMANCE EVALUATION OF HPC …………………………………………………….. 16

2.2.1 FAULTS ..……………………………………………………………….. 17

3 Chapter 3 – Performance Tuning Implementation 3.1 ONTOLOGY ATTACHMENT ………………………………………………………………… 18

3.1.1 RESOURCE FIRST ……………………………………………………. 19

3.1.2 JOB FIRST …………………………………………………………………. 20

3.1.3 METRIC FIRST …………………………………………………………………. 21

3.2 EXPERIMENTAL SETUP …………………………………………………………………………….. 22

3.2.1 STATISTICAL DATA COLLECTION ………………………………………. 22

3.2.1.1 HPC SYSTEM INTERFACES …………………………. 22

3.2.1.1.1 NIMROD …………………………. 22

3.2.1.1.2 MONASH SUN GRID ……………. 24

3.2.1.2 SELECTED METRICS ……………………………………… 27

3.2.1.3 DATA COLLECTION MECHANISM ….………… 27

3.2.1.3.1 CONSTANT MONITORING 28

3.2.1.3.2 JOB STATISTICS COLLECTION 28

3.2.1.3.3 JOB STATISTICS COMPILER 29

3.2.2 STATISTICAL DATA ANALYSIS ……………………….……………. 30

3.2.2.1 DATA SET …………………………………………………. 30

3.2.2.2 RESULT SET …………………………………………………. 32

3.2.3 DATA ANALYSER AND PERFORMANCE TUNER …………… 35

3.2.3.1 COMPONENTS ……………………………………. 35

3.2.3.1.1 DATA SOURCE ………….. 35

3.2.3.1.2 ONTOLOGY BUILDER 36

3.2.3.1.3 SCENARIO BUILDER …….……. 39

4 Chapter 4 – Optimizer Integration 4.1 OPTIMIZER INTEGRATION ……………………………………………………………………… 44

4.2 ADVANTAGES …………………………………………………………………………………… 46

4.3 DRAWBACKS …………………………………………………………………………………… 47

5 Chapter 5 – Summary 5.1 RESEARCH SUMMARY & CONTRIBUTIONS ………………….…………………………. 48

5.2 FUTURE WORK …………..…………………………………………………………………………………… 50

5.3 REFERENCES ……………………………………………………………………………………………….. 51

6 Appendix 6.1 DATA SET ……………………………………………………………………………………………….. 54

6.2 SOURCE CODE ……………………………………………………………………………………………….. 54

6.3 SCRIPTS ……………………………………………………………………………………………….. 54

List of Figures

Figure 1 RESOURCE FIRST MODEL .................................................................................................... 19

Figure 2 JOB FIRST MODEL ............................................................................................................... 20

Figure 3 METRIC FIRST MODEL ........................................................................................................ 21

Figure 4 MONASH SUN GRID UTILIZATION ....................................................................................... 26

Figure 5 - Monash sun grid usage statistics ...................................................................................... 30

Figure 6 ONTOLOGY BUILDER .......................................................................................................... 35

Figure 7 Data organization process flow .......................................................................................... 37

Figure 8 PERFORMANCE TUNER - SCENARIO SIMULATOR ................................................................ 39

Figure 9 JOB PROCESSING - LEvel 1 data flow diagram ..................................................................... 44

Figure 10 OPTIMIZATION MODULE ON NIMROD (Nimrod) ............................................................... 46

List of Tables

Table 1 NIMROD TOOLS (Nimrod) .................................................................................................... 22

Table 2 MSG Execute Nodes (Monash Sun Grid) .............................................................................. 24

Table 3 RESULT SET 1 ....................................................................................................................... 32

Table 4 RESULT SET 2 ....................................................................................................................... 33

Table 5 RESULT SET 3 ....................................................................................................................... 34

Table 6 SIMULATION RESULTS ......................................................................................................... 40

Table 7 JOB TYPE: SHELL SCRIPTS: RESULT SET ................................................................................. 42

Table 8 JOB TYPE: GLOBUS JOB: RESULT SET .................................................................................... 43

1 | P a g e

1.1. PREAMBLE The first high performance computer surfaced in the 1960s and was known as the CDC

6600 series, designed by Seymour Cray. The same organization continued to evolve the

design and introduced the T3E system series in 1996. Today, Cray’s Jaguar is ranked as

the second best High Performance Computer, the best being NUDT’s Tianhe-1A.

The 50 years of development greatly revamped the design and performance of the High

Performance Systems exponentially.

High Performance Computing (HPC) or the computing paradigm wherein

supercomputers and computer clusters are used to solve advanced computational

problems, has been around for a while now and there have been quite a number of

investigations in the domain as well. These have particularly led to development of

paradigms such as High Throughput Computing (HTC), a computer science term to

describe the use of many computing resources over long periods of time to accomplish a

computational task; Grid Computing, a term referring to the combination of computer

resources from multiple administrative domains to reach common goal; and Cloud

Computing which is Internet-based computing, whereby shared resources, software, and

information are provided to computers and other devices on demand, like the electricity

grid. . These are basically the branches of the same tree with different aspects. They

focus on different requirements yet with a common principle. The key idea is seamless

integration of computing power that is spread across a city, a country or the entire

globe. Different objectives branched out to different paradigms and products such as

Computational Grids for scientific research and Cloud Computing for fulfilling computing

requirements through shared resources. The architecture of high performance

distributed system, particularly of a Grid System has three core components. A Task

Manager to receive jobs from Authenticated Users, Task Scheduler to schedule jobs and

a Resource Manager to manage the resources and their allocation as well. The domain

of Authentication and Task Management has been explored well and solutions such as

Shibboleth which is an Internet2 Middleware Initiative project that has created an

architecture and open-source implementation for federated identity-

based authentication and authorization infrastructure based on Security Assertion

Markup Language, have served fairly well. The next link of Task Allocation or Execution

as we may say has not been managed stupendously. Most of concern has been taken by

2 | P a g e

security and authentication aspects related to resource sharing. The focus is now

perhaps on Intelligent Task Schedulers (ITS).

The evaluation metrics for a HPC system are therefore peculiarly designed to provide a

fair analysis of their benchmarks. These metrics are the metrics which help in measuring

their efficiency. They also help us analyse the shortcomings and highlight the key indices

that can be useful for improvising the associated aspects of computing.

Hence in this very context, HPC environments can be defined as performance oriented

computing architectures that provide extreme computational power with the help of

shared resources over a network. The performance aspects of this environment attract

performance evaluation indeed. An evaluation schema would be able to present a

statistical view of the concerned scenario. This schema that constitutes the metrics by all

means should provide information to identify key areas prone to inefficiencies or areas

that can be optimized for improving the performance thresholds. This is the field which

lacks specifications and standards. The major use of such a dataset would enable scope

for performance tuning in various High Performance computing scenarios to achieve

desired results.

This particular performance tweaking for various performances oriented computing

scenarios is the subject of this research, hence titled “Performance Tuning for High

Performance Computing Systems”.

1.2. OBJECTIVES

The existing research in this domain has not considered linking statistical information

from resource usage with the job requirements. This crucial aspect can open a plethora

of opportunities that can be utilized in either tuning or optimizing the performance of

HPC systems.

The advent of optimized computing requirements has made resource sharing more of a

necessity than a choice. Systems such as Nimrod and Condor utilize the computing

power distributed across varying geographies and assist in executing complex tasks. The

progress in development of such systems has been quite remarkable, but paradigms

such as HPC still lack that perfect Task Scheduler which can analyse the state of the

distributed system and allocate jobs with a sense of intelligence. This is perhaps one of

the key challenges that Distributed Computing faces today.

There have been remarkable efforts to bridge this gap between Task Scheduling and

Task Execution, such as the Grid Harvest Service (Wu & Sun, 2006) which defines a

performance evaluation and task scheduling system. It is based on a prediction model

that can schedule three types of tasks i.e. Single Task, Meta-Task and Parallel Processing.

3 | P a g e

A lot of analysis happens across the distributed system before a resource is allocated to

a job. The state of the distributed system is measured in terms of Resource Usage

History. This aspect provides the Job scheduler with an opportunity to keep a track of

efficient resources and improve the overall performance. Ontology based Grid Resource

management systems (Amarnath, Somasundaram, Ellappan, & Buyya, 2009) present the

benefits of engaging the job scheduler with an ontology based on resource

characteristics, so as to have a better match between job requirements and available

resources. This though overlooks the relation that exists between types of jobs and their

resource counterparts.

The limitations in the existing researches provide the motivation for this research and

also assist in framing the objectives of this research.

The objectives have therefore been divided into three core components.

1. Resource Usage Statistics.

1.1. It is very important to extract the most accurate information without significant

performance negation to any component of the distributed system. Therefore

besides just having a Statistics builder, it is quite crucial to decide the placement of

such a component within the layers of the system itself.

2. Statistical Analysis

2.1. The statistics gathered under the first objective will be utilized to identify relations if

any, amongst the jobs submitted by users with the resources that they are executed

on. This will help in defining the affinity factor of a job.

3. Simulation Application

3.1. Lastly, a heuristics based Simulation application to utilize the statistics and

demonstrate a sample job scheduling operation using the performance tuning

options.

4 | P a g e

1.3. OUTLINE

The thesis has been organized into following chapters.

Chapter 2 provides a comprehensive literature review on HPC Systems’ composition.

We present an analysis and critique the various issues related to performance

dependency on various components of a HPC System.

Chapter 3 begins by describing the concept of Performance Tuning, the methods

implemented for Statistic Builder and an analysis of the collected data set. This

chapter also demonstrates the sample job scheduling application.

Chapter 4 discusses the placement of such an Optimization Module within the layers

of a HPC system.

Chapter 5 presents an analysis of the proposed mechanism.

Chapter 6 summarizes and concludes the thesis with an outline of the future work.

5 | P a g e

High Performance Computing scenarios such as Grids are composed of well-defined entities.

These include an authentication service, job schedulers and resource managers or resource

allocators. They have been defined in order to manage and accomplish the task of utilizing

the geographically distributed computing power. These entities need a robust networking

infrastructure to communicate, besides having recommended physical specifications. This is

quite a standard requirement for all the computing and processing entities alike, such as the

resource managing servers; worker nodes etc., for the entire architecture to have some an

accepted level of performance.

The review presented ahead highlights the purpose of these constituting entities and their

current utilization in the domain of high performance computing.

2.1 HIGH PERFORMANCE COMPUTING SYSTEM

An HPC System can be defined as a distinct collection of computing resources grouped

together to achieve a common goal. They interact with the help of a computer network

which may or may not be a part of the Distributed system but is very crucial for the

performance of the entire system. Therefore we begin our literature analysis highlighting

the importance of network infrastructure.

2.1.1 NETWORKING INFRASTRUCTURE

Networking infrastructure forms the backbone of distributed computing, as it enables the

communication amongst various computing entities. It comprises the selection of protocols,

connection modes and other specifications. If these characteristics are not chosen carefully,

it could alleviate the number of faults. The faults could happen due to frequent packet

drops, congestion, inefficient routing protocols, or simply disconnection between services.

Many researches have proposed efficient and robust mechanisms to encounter some typical

faults that hinder performance. The ever growing need for greater computation power and

recent advances in field of VLSI has led to rapid development of new generation-cluster and

grid computing systems (Azad, Khaoua, & Zomaya, 2005). Performance of any such system is

highly influenced by the interconnection network employed for inter-nodal communication.

All the computing resources of a distributed system connected with each other require a

6 | P a g e

basic set of network configuration to have connectivity. Efficient approaches to dynamically

reconfigure the nodes that belong to a distributed system are vital for the performance of

such systems.

Routing algorithms is an integral part of this basic configuration and play an important role

in the performance of the entire system. This is because an efficient algorithm will have a

lower rate of packet drops or transferring packets out of order. Adaptive routing (Azad,

Khaoua, & Zomaya, 2005) assists in enforcing in-order packet delivery in system wide area

networks. It selects the route of a packet dynamically, after considering the state of the

network at real time. This is crucial for massively parallel computers and system area

networks as certain MPI libraries will not accept packets out of order.

This efficiency in the supporting network infrastructure drastically improves the throughput

of a distributed computing system. Phoebus is one such model capable of dynamically

allocating network resources and using segment specific transport protocols between

gateways (Kissel, Swany, & Brown, 2010). Most of the existing networks bind end-to-end

communication to transport protocol whereas Phoebus utilizes session layer protocols and

is thus able to break end-to-end connections into a chain of connections each of which can

be chosen across the best accessible network segment. This mechanism is not a way around

congestion but an effective methodology for bulk data transfer.

2.1.2 DATA PLACEMENT AND ACCESS TECHNIQUES

Sharing data is essential in any distributed computing environment as it may slow down

parallel execution of involved applications. Data placement therefore becomes important as

it further defines the workflows associated with data fetching and recording mechanisms.

This aspect is very prominent in affecting the performance of the entire system (Azad,

Khaoua, & Zomaya, 2005). Also data placement jobs commonly known as storage requests

cannot be treated as computational jobs.

Data placement impacts performance due to the very fact that jobs that are associated with

data that is either locally available or can be accessed without consuming much resources,

execute more efficiently than jobs which have to utilize more resources to access the

required data. This is derived from the effect of spatial layout of jobs on I/O hotspots

(regions with frequent access of data and are usually busy with requests). This very layout of

jobs determines the performance and cannot be improved with the addition of I/O nodes.

Data access aspect has further been resolved with two options (Azad, Khaoua, & Zomaya,

2005). A Distributed Virtual Shared Memory system based on infini-band architecture is

based on next generation of an interconnection technique.

This data representation scheme incarnates a data entity into a set of objects that are then

distributed in the cluster. A runtime support system manages the incarnated objects and

7 | P a g e

data access is possible only with the help of an appropriate interface. This technique though

has few limitations in terms of generalizing the concept for all the applications.

A compilation of such a data management strategy has been implemented in ARTS (Buttner,

Nolte, & Schrode-Preikschat, 1999) (Adaptable RunTime System) which is a middleware

defined and implemented for the PEACE (W.Schroder-Preikschat, 1994) family of parallel

operating systems. The high performance middleware layer is applicable to performance

sensitive virtual shared memory systems. The reason behind the development of this

middleware was the lack of performance in existing common operating systems, the

consumption of processing power in complex system call mechanisms and the huge

resource utilization by the operating system alone. The middleware provides quite basic

services for distributed processing in a global object space. This global environment consists

of active and passive objects which themselves reside in distributed domains or virtual

groups. Passive objects are under the influence of Active objects and hence can only be

manipulated by them. These two entities come handy in the implementation of shared

memory paradigm. Since passive objects act like children of active objects, they are very

well utilized for concurrent invocation of procedures with the help of temporary resources.

Resource sharing is further implemented for processes using the same hierarchy of objects.

For resources to be shared, each process creates a clone of its dual (active and passive)

object instances and transfers these clones to other processes. Only the user segment of the

process is copied to the cloned process whereas the system level segment is shared by all

clones. This effective resource sharing protects system level information from being

modified by remote entities, yet providing easy access to user level information. Therefore

enabling the clients to customize specific aspects of system services dynamically and

remove the requirement of maintaining client specific data at the server.

2.1.3 JOB SCHEDULER

Reinforcement of Intelligence to the Task Scheduler and the Resource Allocator may imply

the use of adaptation and reservation techniques at the application level. These approaches

have their own limitations and overheads but have been overcome with efficient service

designs (Foster, Roy, & Sander, A quality of service architecture that combines resource

reservation and application adaptation, 2000). This adaptation to resource requests is

perhaps the key to improvising performance of distributed systems. Anomaly detection is

another important region which can be worked upon to impact the overall performance. An

added functionality can be defined to diagnose and resolve such erratic behaviours (Yang,

Liu, M. Schopf, & Foster, 2007).

At times the scheduler may need to transport processes from an allocated processing entity

to another entity. Pre-emptive transfer methodology has therefore been found out to be

8 | P a g e

much better than remote execution, even though the memory transfer is expensive

(Harchol-Balter & Downey, 1997).

Adaptation techniques that are applied to the Resource Managers can be very well applied

to Schedulers as well. Also the application of heuristics to schedulers has been found to be

effective in improvising the performance of the system (Casanova, Legrand, Zagorodnov, &

Berman, 2000). These improvisations may not belong to a particular entity but do contribute

heavily towards the entire workflow. The FALKON (Raicu, Zhao, Dumitrescu, Foster, &

Wilde, 2007) framework presents some of the key features such as Multi Level scheduling

which can be very useful to enhance the performance. The schedules generated by such

mechanisms are mostly dependent on task deadlines. A schedule is although believed to be

optimal if it achieves the smallest possible value for ‘system hazard’ which is defined as the

maximum normalized task flow-time and is a better performance measure than task

deadlines (Peng & Shin, 1993).

Many a times, parallel applications are migrated over to distributed computing systems with

heterogeneous resources. It may require a level of load balancing in respect to various

components of the application itself (Korkhov, Krzhizhanovskaya, & Sloot, 2008). Therefore

an adaptive technique for load balancing of parallel applications assists in improvising the

performance of the application and the distributed system indeed. This is primarily because

performance of parallel applications when hosted across heterogeneous resources is very

much dependent on quality of workload distribution. This workload distribution can be

achieved from functional decomposition or domain decomposition (Korkhov,

Krzhizhanovskaya, & Sloot, 2008) but involves additional processing. It can be combined

with application tuning fundamentals to improve the efficiency of the job scheduler.

Job schedulers for parallel machines are often built to optimize metrics that may not deliver

the best possible application performance. A comparison of performance associated with

various job scheduling and processor allocating algorithms (Li, 2005) highlights that another

feature that would empower the job scheduler to dynamically manage resource allocations,

would be the ability to allocate and deallocate resources between applications at run time.

This brings the job scheduler much closer to the resource manager and allocator, thereby

bridging the gap between the two entities. IScheduling (Weissman, Abburi, & England, 2003)

is an application aware job scheduler that can dynamically control resource allocation

between active and queued applications even during the execution. This allows the

iScheduler to have certain policies and rules to assign system performance a priority over

application performance. It is basically based on three main ideas.

2.1.3.1 Cost prediction - Considers the costs of allocation and deallocation of resources to

respective applications in terms of resource consumption, wastage, achieving task deadline

etc.

9 | P a g e

2.1.3.2 Adaptivity - Responds to the dynamic environment in an efficient manner. This may

include reacting to application or hardware faults or other dynamic run-time events.

2.1.3.3 Admission Control - Controls the submission of an application to a queue or directly

to processing sub-system. This is crucial as it needs to evaluate the current situation in

terms of resource availability and requirements of the entering application.

Alternate perspective highlights that a job scheduler that considers priorities and deadlines

associated with tasks usually does not require the knowledge of the global state. This idea is

very much efficient in maintaining the performance thresholds of the system, but only in

respect to critical applications meeting deadlines (Liu & Baskiyar, 2009). The non-critical

applications will however suffer under such a policy and the overall performance of the

system will not be the best achievable.

A job scheduler would be much more efficient if supplied with fault aware policies (Al-

Azzoni & Down, 2008). The decisions made by the job scheduler will be more effective and

aware of the current and a possible state of the environment in terms of jobs in queue, jobs

in execution and last but not the least the persistent faults as well.

Imparting the job scheduler with flexibility to perform dynamic resource scheduling or load

balancing is deemed beneficial only when required and also prevents the mechanism from

becoming a burdensome cost factor. Balancing of processor load without regard for

communication costs can deteriorate performance when network communication becomes

a dominant factor.

A point of discussion would be related to the overheads introduced by these entity specific

features or improvisations as we define them. A mechanism to consider these overheads

and at the same time minimize the execution time as well is of utmost importance when

considering such modifications (Ito, Ohsaki, & Imase, 2005).

2.1.4 RESOURCE MANAGER

A resource manager plays an important role within a Distributed System. It is important to

keep a track of the status of various resources that constitute the system. It is critical to Task

Allocation and Scheduling as well. If a resource goes unavailable without informing the Task

Manager, the Task Manager would consider the resource to be available and allocate it to

the task. The task may then return incomplete or failed and hence will require another

phase of processing, bringing down the throughput of the entire system.

In grid computing, the term resource management refers to the operations used to control

how capabilities provided by the resources and services of a grid are made available to

other entities such as users, applications, or peer services (Foster & Kesselman, The Grid:

Blueprint for a New Computing, 2004). Resources can be computational, storage, network

or blocks of code, with each having a specific attribute defining its processor utilization, disk

space, memory or network bandwidth (Foster & Kesselman, The Grid: Blueprint for a New

10 | P a g e

Computing, 2004). Resource management is quite a critical component for any distributed

computing architecture, specifically for a grid. The main responsibilities of a resource

manager include resource selection for jobs in queue, resource monitoring and task

migration in case of faults or resource failures.

Resources can be managed in two possible ways. The first is more of a manual responsibility

for the user to search for desired resources using an information service and submitting the

selection of resource to the local resource manager, while the second one refers to the

utilization of a designated resource manager. The first method is not only an added

responsibility for the user but is inefficient as well. The user may not be aware of the status

or capabilities of the chosen resources, and therefore this random selection may lead to

performance degradation due to the consumption of non-productive resources.

Such resource managers are available in Condor-G (Frey, Foster, Livny, Tannenbaum, &

Tuecke, 2001), PBS (Papakhian, 1998), and Legion (Grimshaw & Wulf, 1996). Condor-G

utilizes the efficiencies of components present in Globus and Condor to enable the users to

use multi-domain resources (Leea, Chungb, China, Leea, & Leea, 2005) with the help of a

genetic algorithm named Gang-Matching.

The existing approaches as employed in Condor and Globus do not focus on best resource

selection and hence have a constrained performance. They have focused on fault tolerance

issues but do not provide a generic work flow for dealing with failures and rectifying them.

Globus specifically lacks a fault tolerance mechanism.

The efficient resource manager (Leea, Chungb, China, Leea, & Leea, 2005) utilizes a genetic

algorithm to provide the set of optimal resources for job execution. The user describes a

resource type, a resource condition, and the number of resources using Resource

Specification Language. Resource manager utilizes this information to discover resources

that satisfy these requirements. The Monitoring and Discovery service provides information

about resources available in the grid and generates the list of prospective resources that fit

the job requirements (Leea, Chungb, China, Leea, & Leea, 2005).

The components of this resource manager are:

2.1.4.1 RSL Parser-RSL is the specification language used by the Globus Toolkit to

describe task configuration and service requirements (Foster & Kesselman, The Grid:

Blueprint for a New Computing, 2004). A user can specify necessary resources using RSL. To

execute a job with the resource manager, the user describes a resource type, a resource

condition, and the number of resources using RSL.

2.1.4.2 Resource Search Agent-handles the responsibility of discovering resources

that are accessible and would be able to satisfy the requirements posed by the user. Being

based on the Globus Toolkit architecture, it uses the Monitoring and Discovery service to

achieve the defined objectives.

11 | P a g e

2.1.4.3 Resource Selection Agent-utilizes the list of resources supplied by the search

agent and determines the optimal resource. It executes the genetic algorithm (Leea,

Chungb, China, Leea, & Leea, 2005) to find the optimal resources for an efficient job

execution. Here, optimal resources are those that minimize the longest execution time of

jobs that are running on grids.

2.1.4.4 Execution Time Predictor-does the job of predicting the execution time for a

given job. An RTA (Running Time Advisor) (Dinda, 2002) is used to predict the running time

for a computation bound task.

2.1.4.5 Resource Allocation Request Agent- converts the user RSL into a ground RSL

for sending resource allocation requests to the GRAM. Further this entity uses the

Dynamically Updated Request Online Coallocator (DUROC) (Foster & Kesselman, The Grid:

Blueprint for a New Computing, 2004)for co-allocating resources.

2.1.4.6 Genetic Algorithm-utilized by the resource selection agent incorporates the

iterated evaluation of multiple mutations constituted by the resources received as input. It

defines a threshold value for fitness of selected resources. The combinations are generated

by employing a single point crossover mechanism across the list of input resources that are

deemed fit by the algorithm. Eventually using mutation to select the members from the new

generation developed.

Handling Failures is another aspect that a resource manager should consider. An efficient

resource manager efficiently handles following types of failures (Leea, Chungb, China, Leea,

& Leea, 2005).

2.1.4.7 Process failure

2.1.4.7.1 A process stops (process stop failure)

2.1.4.7.2 A starvation of process (process QoS failure)

2.1.4.8 Processor failure

2.1.4.8.1 A processor crash (processor stop failure)

2.1.4.8.2 A decrease of processor throughput due to burst job

(processor QoS failure)

2.1.4.9 Network failure

2.1.4.9.1 A network disconnection and partition (network disconnection

failure)

12 | P a g e

2.1.4.9.2 A decrease of network bandwidth due to communication

traffic (network QoS failure)

There are a number of techniques (Ravindran, Devarasetty, & Shirazi, 2002) that enable

continuous monitoring of application at run-time for verifying the status of its real-time

requirements, detecting anomalies or failures. They even dynamically allocate resources by

duplicating sub-tasks of the specific application in order to achieve load sharing. These

techniques aim to minimize the missed-deadline ratios of the associated tasks. A predictive

resource allocation algorithm helps in determining the number of sub-task replicas that will

be required for making the application suitable to current workload situation in the system.

The design of such resource managers highlights the importance of three things.

A resource manager requires support and intelligence to evaluate the fitness of available

resources and their suitability for a give job.

It should also be aware of the time for which these optimal resources should be allocated to

a job and be able to adjust allocation at run time.

A resource manager should at the very minimum, be tolerant to the defined failures.

2.1.5 RESOURCE ALLOCATOR

Many works do not emphasize the development of adaptive resource allocation algorithms,

but rather focus on challenges of system implementation, often using naive resource

allocations mechanisms as place holders. Other works target resource allocation in its full

complexity, formulate a precise optimization problem and good algorithm to solve this

problem but do not define the implications of incorporating such individually efficient

entities into a distributed system.

Resource allocation in parallel supercomputers requires the user to specify a couple of

things beforehand (Cirne, 2002).

How many processors are to be allocated to the job

For how long does the job requires processors to be allocated.

So a number of requests are supplied to the supercomputer for executing a job. The

supercomputer may choose from the many options available and execute the job.

Application scheduler estimates the turnaround for each request based on the current state

and decides upon a particular request. Jobs receive a dedicated partition to run for a pre-

established amount of time. Arriving jobs may not find enough resources to execute

13 | P a g e

immediately and hence have to wait in queue that is controlled by supercomputer scheduler

(Cirne, 2002).

Moldable Job is another paradigm that has been employed to improve the performance of a

HPC. A moldable job can run on partitions of different sizes even though it may not be

possible to change the size of the partition during execution. Since moldable jobs can use

multiple partition sizes, there are multiple requests that can be used to submit a given

moldable job.

In reality most jobs execute for less time than they request, therefore a delta time always

exists where in the processors are not executing any job. Common practice in

supercomputer scheduling exhibits some key features:

Out of order jobs start up

Allocation recycle

o If requested time is not used it is allocated to another job.

Aging

Avoid starvation of large jobs.

To address these very issues, it is ideal to have a conservative backfilling mechanism. A

conservative backfilling mechanism uses allocation lists to backfill vacant slots with most

suitable jobs.

Making appropriate decisions for allocating hardware resources is always a key challenge,

especially for services that have conflicting demands. Here appropriate implies meeting QoS

requirements, ensuring fairness among services, maximizing platform utilization, and

maximizing service defined utility functions (Stillwell, Schanzenbach, Vivien, & Casanova,

2010).

Many times resource allocation involves sharing servers among application instances. It is

also important for the resource provider to maximize revenue for such a service. This

requires the resource allocator to employ number of profiling techniques to obtain

statistical bounds on the resource usage and therefore minimize resource needs. But this

idea involves a number of challenges (Urgaonkar, Shenoy, & Roscoe, Resource Overbooking

and Application profiling in shared hosting platforms, 2002). Resource allocation systems

control how multiple services share the platform. Each service consists of one or multiple

VM instances and system ensures that requests to the services are dispatched to

appropriate servers. A mechanism to handle such scenarios includes the definition of two

resource needs (Urgaonkar, Shenoy, Chandra, Goyal, & Wood, 2008)

2.1.5.1 Fluid Resource needs – specifies the maximum fraction of a resource that the

service could use if alone on the server and implies that the service cannot benefit from a

larger fraction but can operate with a smaller fraction at cost of reduced performance.

14 | P a g e

2.1.5.2 Rigid Resource needs- imply that specific fraction of a resource is required

and the requesting service will not benefit from a larger fraction and cannot operate with

lesser fraction.

Determining actual values of these resource fractions is the key issue. To reason on the

resource fractions allocated to services and optimize the allocation as well, a metric named

‘yield’ has been defined (Mi, Casale, Cherkasova, & Smirni, 2008). A service is allocated

resource fractions based on its ‘yield’ value and the remaining resources are allocated to

services beyond their QoS requirements. The yield metric enables the resource allocator to

strike a balance between resource demands posed by two services without affecting the

performance of the entire system. It also enables the resource allocator to maximize the

efficiency of jobs by providing them with a good deal in the form of resources. This type of

resource allocation has been formulated to maximize minimum yield over all services,

thereby making the least happy service as happy as possible. The design was evaluated with

a number of algorithms to achieve best resource allocation and defined the ‘chose pack

vector algorithm’ to be most promising as it runs at best speeds and is very effective.

Another metric ‘stretch’ is applicable to time bound computational jobs rather than

continuous services and is defined as the job’s flow time (Bender, Chakrabarti, &

Muthukrishnan, 1998).

2.1.6 APPLICATION TUNING

The system efficiency can be also be increased by exploring the possibilities of application

tuning. Tuning basically refers to the ability of an application to trade of resource

requirements with other dimensions such as time, resource specification, quality of results

(Chang, Karamcheti, & Kedem, 2000). This provides flexibility in decisions made by the

resource manager to select resources for the submitted jobs.

Application tuning though requires the representation of a generic language or ontology

with which the scope of tuning can be defined. The language and scheduler extensions

defined for the MILAN (Baratloo, Dasgupta, Karamcheti, & Kedem, 1999) meta-computing

environment apparently improve resource utilization with the feature of application tuning.

The middleware layers constituting MILAN enable reliable, efficient and predictable

execution of applications on a dynamically changing group of nodes.

Resource allocations in MILAN use two-level strategy. The application conveys its

requirements to the resource broker, which monitors resource availability in the entire

system and dynamically divides them amongst the queued jobs. An application manager

controls the second level of the strategy. This entity further divides the resource allocations

amongst the individual tasks of the assigned computation or job. This flow of resource

15 | P a g e

allocation though is quite efficient but lacks intelligent resource allocations because the

selections are made on the basis of availability and not suitability.

Application tuning requires the application to be analysed to a specific depth in order to

figure out its various paths of execution, which further assist in determining application

profiles. These profiles enable the tuning mechanism to make decisions in respect to trade-

offs and help the resource manager select the most appropriate resource for the task. The

term appropriate refers to a resource which is available and allocation of which will not

hamper the performance of the system.

Application tuning incorporates primarily two types of modifications. The first one applies

variations to behaviours constituting the application and the second one studies the effects

of resource variations on the outputs generated.

MATE (Caymes-Scutari, Morajko, Margalef, & Luque, 2010) (Monitoring, Analysis and Tuning

Environment) is another model based on the same paradigm of application tuning to

achieve better performance. This model though evolves with each execution cycle and is

able to make better decisions regarding tuning of parallel applications.

2.1.7 SOME OPTIMIZED HPC SYSTEMS

The defined entities when empowered and brought together can assist in establishing a

performance oriented and efficient distributed system.

The idea can be summarized effectively by comparing it with the mechanism defined for

FireGrid (Han, et al., 2010). The basis behind FireGrid is an efficient response system to

support large scale emergencies. The proposed mechanism is utilized to model a solution

where in the advantages of distributed computing are utilized to impart efficiency and

accuracy to the decision support system being used for responding to emergencies. The

mechanism consists of two tasks.

The first task is data gathering, where in information reflecting the state of the environment

is gathered in the form of metrics. This information can be gathered in varying ways

described earlier and as suitable for the scenario.

The second task involves data interpretation. This task is apparently a critical entity in the

system and provides an analysis of the information gathered in a form that can be used by

the system to generate a set of possible suggestions.

A logical representation of the information is defined with the help of an ontology specific

to the system. This ontology not only provides the system to model information in an

efficient manner but also enables the model to be accurately utilized by the processing

entities. Therefore, the ontology proves to be a common ground for enabling understanding

between the information generator and the user of the information. The FireGrid was

tested with the help of K-CRISP simulation model that generated large number of

16 | P a g e

independent scenarios to be processed by the system and the results were strongly in

favour of the system.

The provision of an efficient system support for concurrent applications has always been an

issue. There are two potential hindrances to this provision (Cruz & Park, 1999). First one is

the presence of an effective mechanism to maintain relationships among data or task

objects with respect to their semantics. Second is the scheduling algorithm that should be

able to match coupled processes to their costs. DUNES (Cruz & Park, 1999) is a system

support environment designed to overcome these hindrances with following characteristics.

2.1.7.1 Dependency Maintenance

DUNES makes sure that functional dependency of the form process to process, process to

file is maintained by the system in presence of dynamic scheduling. So if a process migrates

to another host, its relations are kept intact with preserved semantics

2.1.7.2 Performance features

Moderates overhead costs by employing active and passive end point caching.

Communication sensitive load balancing at algorithm level affects dynamic scheduling which

is itself linked to communication costs.

2.2 PERFORMANCE EVALUATION OF HIGH PERFORMANCE COMPUTING

SYSTEMS

Over the past few years of research in this domain, harnessing the collective power for

distributed computing had become an ultimate goal. This goal was though coupled with a

number of specific requirements. Today, harnessing the distributed computing potential

alone is no more a complex task and the focus has shifted to perfecting the mechanism of

harnessing itself. Performance evaluation is therefore necessary to judge this perfection.

In order to fine tune performance, it needs to be evaluated. This is a challenge in itself

considering the distributed nature of high performance computing scenarios such as

computational grids. Generalized Stochastic Petri Nets is a viable solution for this problem

(Bruneo, Scarpa, & Puliafito, 2010).

A work flow is established to record time stamps for various events and using this gathered

data set to accomplish performance evaluation in terms of throughput, resource utilization,

response time etc. The jobs are primarily segregated on the basis of types which reflect the

queue type as well. A chain of logging events is triggered upon variation in status of each

job. These logging events enable recording of data reflecting times when job state

transitions from one state to another. For instance a job transitioned into ready from its

17 | P a g e

waiting state, or from running state to its finished state. These recorded time stamps enable

the calculation of the duration that each job spent in each of the states constituting its life

cycle. The calculation when magnified to the view of the system provides a fair evaluation of

the performance during the specific period.

The defined model also highlights the necessity of such an evaluation to adhere to the

service licence agreements held with the clients. The performance evaluation model though

is quite effective, has not been utilized to support the entities constituting the distributed

system. This important data set if shared with the entities of a distributed system, would

improve the performance drastically.

Petri-Nets can also be applied to evaluate the performance of scheduling algorithms that

are used in real time distributed systems. S-Nets (Balaji, Patnaik, Jenkins, & Goel, 1992) is an

example of such a scenario.

2.2.1 FAULTS

Faults tend to become hindrance to the performance of any computing scenario. Their

detection is an overhead and should be compensated with their resolution alone. An

algorithm based fault tolerant model (Bosilca, Delmas, Dongarra, & Langou, 2009) for high

performance computing would be worth the associated computational overhead, only if

such faults can be recovered within the scope of the job itself. The fault tolerance system

(Bosilca, Delmas, Dongarra, & Langou, 2009) described in context of high performance

computing is capable of detecting and correcting errors during computation itself. This is

achieved with the utilization of additional processors to store redundant information. The

information is stored in the form of checksums which represent an efficient and memory

effective way of supporting multiple failures (Bosilca, Delmas, Dongarra, & Langou, 2009).

This redundant information is fetched when required with the help of processor matrices.

This approach is very much practical but not scalable enough to incorporate large amounts

of data sharing. As a good number of processors would be utilized in keeping redundant

information and could turn out to be more expensive in terms of performance and

throughput.

18 | P a g e

Tuning is a process often used for overcoming bottlenecks in any system and optimizing its

performance. This process also involves a trade-off between different aspects of the

concerned system. If a system has two evaluation metrics, for instance consider the case of

a car, it has two basic evaluation metrics, Top Speed and Acceleration. Performance Tuning

for a car would either aim at maximizing Top Speed or Acceleration. So there is a trade-off

between Top Speed and Acceleration as increasing the top speed of the car will bring the

acceleration down and vice versa.

Similar is the case for High Performance Computing Systems. The metrics within the scope

of this thesis are CPU usage, MEMORY usage and VIRTUAL MEMORY usage. It is a fair

assumption that if the performance tuner tries to optimize CPU usage, there would be

potential increase in MEMORY usage and vice versa.

Optimized utilization of resources is not the only reason to incorporate Performance Tuning

in high performance computing systems. The fact that most of these systems are designated

as pay per use systems, the organizations providing such systems as services need to have

stringent control over allocation of resources and at the same time be flexible enough to

provide these resources with an agreed level of service. This will not only provide efficient

resource utilization but also enable the system owners to charge differently for various

levels of service provided to users. So by categorizing resources amongst quality levels, the

cost controlling department will be able to charge more for resources at higher quality

levels as compared to those at lower qualities.

3.1 ONTOLOGY ATTACHMENT

Performance Tuning does not necessarily require the support of ontology to describe the

underlying resources and their corresponding metrics, but having a system describing

ontology enables efficient organization of resources amongst desired quality levels and also

makes it easier for the job scheduler to query the ontology and retrieve the list of suitable

resources.

System Ontology therefore can be built up to organize the available resources into desired

hierarchies or levels. Selecting the relation and order of hierarchies is an important aspect

while building up system ontologies. There are different options to build up such

hierarchies.

19 | P a g e

3.1.1 Resource First

This order of building up selects the resources as head nodes. The next level

of hierarchy includes job groups. Further down the depth, are the system

aspects followed by quality level designations. This has been illustrated in the

figure below.

FIGURE 1 RESOURCE FIRST MODEL

This model though is the simplest, but calls for some complex querying when

a set of optimized resources is required for a particular job. The job group is

the first set of information that is determined upon receiving a job for

execution. If this model would have been utilized for resource categorization,

it would first look for suitable resources, then navigate through the matching

job groups to find match for required quality levels of the corresponding

aspects. So in this model, a depth scan will be required for each resource

node.

RESOURCES

NODE 1

JOB GROUP A

JOB GROUP B

JOB GROUP C

CPU

LEVEL 5

MEMORY

LEVEL 2

NODE 2 NODE 3 NODE 4

20 | P a g e

3.1.2 Job First

This model starts with Job First grouping. Job groups sit at the top of the tree,

followed by various system aspects. The quality levels are defined for each of

these aspects and finally the resources are associated with these quality

levels. Such a model has been illustrated in the figure below.

FIGURE 2 JOB FIRST MODEL

This model overcomes the scanning required for each resource node in the

previous model and hence is more efficient and simple. This model also

requires minimum redundancy of nodes, as each job group will have a fixed

number of aspects as its children and will be consistent across ‘n’ number of

job groups.

JOBS

JOB GROUP 1

CPU

LEVEL 5

NODE 3 NODE 4

MEMORY VIRTUAL MEMORY

LEVEL 2

NODE 1 NODE 2

LEVEL 4

NODE 6

JOB GROUP 2

JOB GROUP 3

21 | P a g e

3.1.3 Metric First

This model starts with Metric First grouping. Metrics sit at the head of the

tree, followed by respective job groups. These job groups are then descended

by quality levels which are then linked to resource nodes. The illustration

below reflects one such model.

FIGURE 3 METRIC FIRST MODEL

This model is very much same as the previous model and differs only in the

amount of redundancy in nodes. The level of redundancy in this model is

much more than it is in the previous model. This is basically because, for any

high performance computing system, the number of job groups will always

outnumber the number of aspects or metrics that require tuning, and this will

lead to the repetition of those ‘n’ number of job groups for each aspect.

We have chosen the Job First model in our ontology because the tree structure perfectly fits

into the optimization process. The optimization process starts with the job characterization

and progresses towards finding out the optimum set of resources. This model with minimum

data redundancy is therefore most ideal and efficient for doing a breadth first search on the

tree hence formed.

METRICS

CPU

JOB GROUP A

LEVEL 5

NODE 3 NODE 4

JOB GROUP B JOB GROUP C

LEVEL 2

NODE 1 NODE 2

LEVEL 4

NODE 6

MEMORY VIRTUAL MEMORY

22 | P a g e

3.2 EXPERIMENTAL SETUP

Performance Tuning requires scoping of the system to highlight specific metrics which can

be tuned. This further requires gathering system usage statistics and their analysis.

Therefore the objectives for the first phase of experimentation are as follows:-

3.2.1 Statistical-Data Collection

3.2.2 Statistical-Data Analysis

3.2.1 Statistical-Data Collection

For the purpose of this study, we had access to two distributed system interfaces, Nimrod

and Monash Sun Grid.

3.2.1.1 HPC SYSTEM INTERFACES

3.2.1.1.1 Nimrod

Nimrod is a specialized parametric modelling system that was developed to provide a means

of exploring the behavior of computationally intensive experiments. It uses a simple job

definition language to express a parametric experiment and also provides a mechanism that

automates the formulation, execution, monitoring and collection of results from the various

sub-experiments generated from it. This system includes a distributed scheduling module to

manage the scheduling of all the experiments to the available resource nodes. The system

basically provides five distinct tools that can utilized over the underlying grid structures.

Tool Purpose

Nimrod/G Hosts the Metric Sweep service and grid/cloud execution tools that include scheduling across the available computational resources.

Nimrod/O Optimization tool that provides the framework for optimizing a target output value for an application. It is used along with Nimrod/G and can allow for parallelism in the search algorithm.

Nimrod/OI Interactive interface for Nimrod/O, as some applications may require the user to select the best output, which can be further fed back into Nimrod/O and provide more suggestions.

Nimrod/E This tool offers experimental design techniques for understanding the effects of metric switching on the output. This tool also runs in accordance with Nimrod/G and allows scaling of experiments on resources that are part of a grid or cloud.

Nimrod/K Also known as Kepler is the compilation of all the nimrod tools and also enhances them with the addition of dynamic parallelism in work processes.

TABLE 1 NIMROD TOOLS (NIMROD)

https://messagelab.monash.edu.au/NimrodG

https://messagelab.monash.edu.au/NimrodO

https://messagelab.monash.edu.au/NimrodOI

https://messagelab.monash.edu.au/NimrodE

https://messagelab.monash.edu.au/NimrodK

23 | P a g e

3.2.1.1.1.1 Access Methods

Nimrod can be accessed through two basic modes. Users can log in to the Nimrod portal at

https://messagelab.monash.edu.au/NimrodMeRC/. Alternatively, Nimrod can also be

accessed by using ‘ssh’ or any other shell login utility like ‘Putty’ and connecting to

nimrod.messagelab.monash.edu.au .

3.2.1.1.1.2 Job Characteristics

Nimrod/G works on the basis of experiment describing plan files which are written using a

simple declarative language. The plan files consist of two sections. The metrics involved in

the experiment are defined in the first section, while the tasks required to complete the

execution of one instance of the job are defined in the second section.

3.2.1.1.1.2.1 Parameters

This is basically the list of values that can be constant or dynamic. Their scope and

type can be defined in the following format:

parameter <name> <type> [<domain>]

Name defines the identity of the parameter and should be unique for each

parameter. Nimrod creates environment variables for each defined parameter and uses it

during the execution of the job.

Type describes the attribute of the parameter. There are basically four self-

describing types that are allowed, namely Float, Integer, Text and Files.

Domain defines the range of values that the variable can have during the execution

of experiment. The domain can be a single value, range of values, a random value, multiple

random selections from a list known as selectanyof, or a single random selection from a

given list known as selectoneof.

The above information about Nimrod High Performance System has been taken from

https://messagelab.monash.edu.au/NimrodG (Nimrod)

https://messagelab.monash.edu.au/NimrodMeRC/

nimrod.messagelab.monash.edu.au


24 | P a g e

3.2.1.1.2 Monash Sun Grid (MSG)

Monash Sun Grid is a high performance computing system ideal for development of data intensive tasks. It was setup in 2005 to serve the requirements of high performance computing within Monash.

It supports C, C++ and Fortran 95 GNU compilers, and also a Fortran 95 Intel Compiler. It utilizes the Sun N1 Grid Engine 6.1 as its batching system. Globus Toolkit 4.0.5 provides all the Grid Services.

The Monash Sun Grid was commissioned in April 2005 to meet the needs of high-end compute researchers within Monash.

The head node has a failover capability and each node has a dual gigabit Ethernet connection. Applications such as OpenMPI, CCP4, NetCDF, FLTK, NAMD, CTSIM are available as well.

The pools of nodes within the Monash Sun Grid run the Scientific Linux 5.1, Microsoft Windows XP, Sun Solaris v10 and Java.

Cluster CPU Number of Nodes

CPU cores

RAM per core

Total RAM

Total GFlop/s

Node IDs

Interactive Logon

X4600 2 32 2 GB 64 GB 294 msgln1,msgln2

MSG-II X2100 32 64 2 GB 128 GB 333 gn33-gn67

MSG-III X6250 130 1040 2-3 GB 2360 GB 10,400 gn101-gn190,gn201-gn240

MSG-IIIe X6220 24 96 10 GB 960 GB 500 gn68-gn91

MSG-IV X4600 3 96 3 GB 288 GB 883 gn98-gn100

MSG-GPU 10 80 6-12 GB 576 GB nn101-nn110

Total 201 1408 4376 GB 12410 TABLE 2 MSG EXECUTE NODES (MONASH SUN GRID)

Monash Sun Grid has incorporated four primary classes of compute nodes. These classes are defined as follows:-

Standard Cores – MSG-II; MSG-II This class of resources has 784 CPU cores and

with 2 GB RAM per core, the group has access to 1568 GBs of RAM.

Interactive Login This class has 32 CPU cores with 2GB of RAM each.

Hi-RAM Cores – MSG-IIIe There are 96 CPU cores with 10 GB RAM per core or

40GB RAM per node in this class. Each of the MSG-IIIe nodes besides the 40GB RAM

have a similar quantity of virtual memory swap space which provide them with the

flexibility to be configured in a number of ways to cater to specific memory

requirements while still maintaining minimal swap load.

Tightly-Coupled – MSG-IV (Brecca-in-a-box) This class of resources are specifically

for communication intensive MPI applications and larger scale shared memory

25 | P a g e

applications that require up to 32 processes. The group includes 96 CPU cores with 3

GB RAM per core, making it a whopping 96 GB RAM per node. The nodes are

supplied with high-bandwidth hyper-transport and Infiniband CPU/Memory

interconnects. The 32 CPU cores within each Sun X4600 chassis, inter-communicate

with the RAM via AMD hyper-transport bus SMP architecture and the Infiniband

interconnecting between the three nodes.

MSG was also enforced with GPU (Graphics Processing Unit) processing capabilities with the

help of five nVidia Tesla S1070 quad-GPU arrays. Addition of CPU cores was further required

to support this GPU processing. Therefore, this allocation provides a total of 4800 GPU ALU

(Algorithmic Logic Units) that yield double precision 2Tflop/sec and single precision

20Tflop/sec besides the additional computing capacity of 5Gflop/sec per watt for suitable

applications that can adapt to parallel processing on the GPU architecture.

The above information about Monash Sun Grid has been taken from

http://www.monash.edu.au/eresearch/services/mcg/msg.html (Monash Sun Grid)

3.2.1.1.2.1 Access Methods

The Monash Sun Grid provides access to two nodes for submission of jobs. Users can

log in to the Grid portal at msgln1.its.monash.edu.au or msgln2.its.monash.edu.au

using ‘ssh’ or any other shell login utilities like ‘Putty’. Since the Grid runs on Globus

Toolkit, users can use the Globus Client tools to connect and execute jobs on

Monash Sun Grid.

The jobs can be in the form of shell scripts, MPI programs, Gaussian Job Scripts,

NAMD program and other supported GNU Compiler source files as well. Since the

Monash Sun Grid supports Globus Toolkit as well, it can very well accept Globus jobs

as well. (Monash Sun Grid Usage Guide)

3.2.1.1.2.2 Resource Usage

The Monash Sun Grid Utilization can be monitored with the help of Ganglia

monitoring system installed there in. this monitoring system can be accessed at

http://msg.its.monash.edu.au/ganglia/ (Monash Sun Grid Resource Utilization).

Ganglia System enables a user to not only analyse system utilization but provides

node utilization as well.

http://www.monash.edu.au/eresearch/services/mcg/msg.html

http://msg.its.monash.edu.au/ganglia/

26 | P a g e

FIGURE 4 MONASH SUN GRID UTILIZATION

The interface of ganglia allows for selecting a specific resource node and also a particular

metric to evaluate its performance.

Amongst the two mentioned High Performance Systems, we focused our experimentation

on the Monash Sun Grid mainly because of two reasons.

1. The Queue Manager on the Head Node of MSG allows a user to see the list of jobs

submitted by all the other users. This allows for an extensive exploration of usage

statistics on the system.

2. The number of resources linked with the Monash Sun Grid is a lot more than what

is available through Nimrod. This further adds depth to the data set gathered.

27 | P a g e

3.2.1.2 Selected Metrics

The critical metrics for a computational job on HPC system are CPU Usage or Wall Time, I/O

Usage and Memory Usage. Though tuning can be very well expanded to other metrics as

well, but these are the ones that matter the most.

CPU USAGE

o Often known as Wall time is the actual number of ticks spent by the CPU on

executing the job or the particular task. A distributed system has a time

sharing characteristic and hence the CPU usage cannot be a mere difference

between the job submission time and job completion time. The various

phases such as queuing, scheduling, execution and result compilation in the

life cycle of a job induce a number of delays. So the difference of start and

finish times cannot provide the actual CPU usage.

I/O USAGE

o I/O or Input/output determines the amount of data transferred between

processes, jobs, resource nodes or complete systems themselves. It is

important to consider the I/O usage for a job executed on a distributed

system, as it may highlight some weak links that lead to unexplained delays

and overheads. At the same point, it may also highlight the strong links

between nodes or systems that can be used alternatively to route

information if required.

MEMORY USAGE

o Is quite a self-descriptive metric. It defines the memory utilized by the

executing process or the job. A high memory using process can help in

determining memory leaks or excess memory requirements of a job, so that

either the leaks can be rectified or for the latter case, more resources can be

devoted.

3.2.1.3 Data Collection Mechanism

The main entities that were enquired include the Queue Manager and the Resource

Manager of the Monash Sun Grid engine. The Query manager provides the key data

statistics regarding the jobs that are being executed and also their usage statistics. The

resource manager on the other hand provides characteristics of all the resource nodes

constituting the system. This set of resource data is crucial for linking resource types and

their affinity towards certain kinds of jobs.

28 | P a g e

We have therefore defined a group of scripts to gather the required statistics. These scripts

run on the head node of the Monash Sun Grid and simulate a system monitor to gather

information. The scripts perform the following tasks.

3.2.1.3.1 Constant Monitoring

i. This script starts of by recording the resource allocations along with

available information.

ii. It then recursively calls the subsequent scripts which handle the task

of gathering job statistics.

SCRIPT

#echo "START" > MasterJobID.txt

#echo "START" > JobUsage.txt

echo "START"

while true;

do

Qstat | grep "gn" > Test.file

cut -d\ -f1 Test.file > TempJobList.txt

./AutoFilter.sh TempJobList.txt

./CompileJobUsage.sh

done

3.2.1.3.2 Job Statistics Collection

i. This script records the Job Name, Job Script and the Usage definition

for all jobs in respective files.

SCRIPT

cat $1 | while read line;

do

chkFlag=`cat MasterJobID.txt | grep $line | wc -l;`

if [ $chkFlag == "0" ];

then

qstat -j $line | grep "cpu" > tempJobState.txt

chkFlag=`cat tempJobState.txt | grep "N/A" | wc -l;`

if [ $chkFlag == "0" ];

then

echo "Adding Job : " $line

echo `Qstat | grep "$line"` >> JobResource.txt

echo $line >> MasterJobID.txt

fi

fi

29 | P a g e

echo `qstat -j $line | grep "job_name"` > "$line"_Name

echo `qstat -j $line | grep "script_file"` > "$line"_Script

echo `qstat -j $line | grep "cpu"` > "$line"_Usage

done

3.2.1.3.3 Job Statistics Compiler

i. In case the Constant Monitoring script detects that a job has finished

execution and is no longer active, this script is executed to record the last

available job information into a common <output-file> and deletes the

temporary files created by the collection script.

SCRIPT

cat MasterJobID.txt | while read jobid;

do

if [ $jobid != "START" ];

then

active=`Qstat | grep $jobid | wc -l`

if [ $active == "0" ];

then

chkFlag=`cat JobUsage.txt | grep $jobid | wc -l;`

if [ $chkFlag == "0" ];

then

jn=`cat "$jobid"_Name | grep "job_name"`

js=`cat "$jobid"_Script | grep "script_file"`

ju=`cat "$jobid"_Usage | grep "cpu"`

jr=`cat JobResource.txt | grep $jobid`

echo $jobid $ju $jn $jr $js >> JobUsage.txt

echo "Completed Job : " $jobid

rm "$jobid"_Usage

rm "$jobid"_Name

rm "$jobid"_Script

fi

fi

fi

done

30 | P a g e

This mechanism assures that the total job usage is compiled rather than job usage at any

particular time.

3.2.2 Statistical-Data Analysis

The scripts were executed for a period of 4 weeks in two phases. This provided us with an

extensive data set reflecting usage of the Monash Sun Grid. The entire data set has been

included in the Appendix though a sample screen shot of the data set has been displayed in the

figure below.

FIGURE 5 - MONASH SUN GRID USAGE STATISTICS

3.2.2.1 DATA SET

The data set basically has fifteen columns. The first twelve columns are extracted from the

common <output-file> generated from the monitoring scripts, while the rest three require

some further extraction due to their format. We now describe these data columns along

with their format specification.

Column 1 JOBID

This column defines the job identification numbers for all the jobs for which the data

was collected and analysed.

Column 2 CPURAW

This column defines the CPU Usage of the particular job in the format

“cpu=days:hh:mm:ss”, where ‘days’ is the number of days, ‘hh’ is the number of

hours, ‘mm’ is the number of minutes, and ‘ss’ is the number of seconds for which

the job consumed the computing resource.

Column 3 MEMRAW

This column defines the Memory usage of the particular job in the format

“mem=XXXX”, where XXXX is the amount of memory used.

31 | P a g e

Column 4 IORAW

This column defines the Input/Output reads of the particular job in the format

“io=XXXX”, where XXXX is the amount of data reads.

Column 5 VMEMRAW

This column defines the Virtual Memory usage of the particular job in the format

“vmem=XXXX”, where XXXX is the amount of memory used.

Column 6 DATE

This column defines the Date on which the particular job was submitted.

Column 7 TIME

This column defines the Date on which the particular job was submitted.

Column 8 QUEUE

This column defines the Grid Node on which the particular job was executed.

Column 9 USER

This column defines the User who submitted that particular job.

Column 10 RAWSCRIPTNAME

This column defines the script name that constitutes the execution files for that

particular job.

Column 11 SCRIPT

At times, different users may submit jobs with the same script name. Therefore this

column is used to identify distinct job scripts. It has the format “<script

name>$BY$<user>”.

Column 12 HOST

This column defines the name of the node on which the job was executed. This is

extracted from QUEUE column which also has Queue names appended to it.

Column 13 CPU

This column filters out the RAWCPU to reflect the CPU USAGE in units of seconds

only.

Column 14 MEM

This column extracts the MEMORY USAGE from the RAWMEM column without the

keyword “mem” included.

32 | P a g e

Column 15 IO

This column extracts the IO USAGE from the RAWIO column without the keyword

“io” included.

Column 16 VMEM

This column extracts the VIRTUAL MEMORY USAGE from the RAWVMEM column

without the keyword “vmem” included.

3.2.2.2 RESULT SET

The purpose of the data set was to help us prove that jobs have in fact an affinity towards

specific computing resources. The results are quite positive and do portray a relation

between jobs and their preferred resources. It is very much evident that a job performs

better on a particular resource node as compared to others. This is exactly the aspect that

can be utilized to enable performance tuning in high performance computing scenarios.

Our result set included 1303 rows of job statistics and provided us with 50 odd cases where

in jobs performed way better on a particular computing node as compared to others.

SET 1

JOBID DATE TIME USER SCRIPT HOST CPU MEM VMEM

2679218 4 12 11 9:10:02 PM jpolak cde_model1.sh$BY$jpolak gn182 515302 109861.31814 303276.032



2679221 4 12 11 9:10:02 PM jpolak cde_model1.sh$BY$jpolak gn161 515448 108173.95007 313728









2679237 04/13/2011 12:41:02 AM jpolak cde_model1.sh$BY$jpolak gn168 503194 69308.27186 213967.872



TABLE 3 RESULT SET 1

The above data set reflects that node “gn168” is best in terms of CPU, MEMORY and

VIRTUAL MEMORY USAGE, while the rest appear to be comparatively expensive. This

optimization could exponentially increase if multiple instances of the job were executed on

the node “gn168” and not the least favourable resource “gn161”.

33 | P a g e

SET 2


2728451 04/15/2011

9:41:17 PM

thlam9 run7.spce.vbigl.msg.runscript$BY$thlam9

gn214 255886 401401.84710 1713373.184

2728452 04/15/2011

9:41:17 PM


gn214 255734 401164.73006 1713373.184

2728453 04/15/2011

9:42:32 PM


gn221 255794 401257.09791 1713373.184

2728454 04/15/2011

9:42:32 PM


gn221 255668 401060.76496 1713373.184

2728455 04/15/2011

9:42:32 PM


gn221 255703 401113.32532 1713373.184

2728456 04/15/2011

9:42:32 PM


gn221 255660 401047.49554 1713373.184

2728457 04/15/2011

9:50:02 PM


gn221 255450 400716.92944 1713373.184

2728458 04/15/2011

9:50:02 PM


gn221 255383 400612.70464 1713373.184

2728459 04/15/2011

9:50:02 PM


gn221 255035 400067.93326 1713373.184

2728460 04/15/2011

9:50:02 PM


gn221 255109 400184.14265 1713373.184

2728461 04/15/2011

9:56:02 PM


gn201 254093 398588.60487 1713373.184


The above data set also reflects that node “gn201” is best in terms of CPU and MEMORY,

though VIRTUAL MEMORY USAGE is un-effected across the list of nodes. The statistics for

running this job on “gn221” are consistent throughout the run, reflecting that there is

indeed a relation between jobs and their preferred nodes.

34 | P a g e

SET 3


2760043 04/18/2011

9:59:10 AM

nbuchman TomoPIV-batch.job$BY$nbuchman

gn101 39556 206555.74927 5887754.24

2760030 04/18/2011

9:59:10 AM


gn111 39532 206557.41986 5567938.56

2760046 04/18/2011

9:59:40 AM


gn111 39516 206353.04557 5900337.152

2760044 04/18/2011

9:59:10 AM


gn115 39539 204434.98526 5911871.488

2760045 04/18/2011

9:59:10 AM


gn115 39570 204111.50006 5932843.008

2760032 04/18/2011

9:59:10 AM


gn116 39515 206940.71583 5972688.896

2760037 04/18/2011

9:59:10 AM


gn116 39562 203595.11326 5485101.056

2760033 04/18/2011

9:59:10 AM


gn117 39553 203536.63614 5487198.208

2760038 04/18/2011

9:59:10 AM


gn117 39591 202904.55594 5462032.384

2760028 04/18/2011

9:59:10 AM


gn118 39540 205474.90884 6003097.6

2760029 04/18/2011

9:59:10 AM


gn118 39562 207728.24978 5597298.688

2760034 04/18/2011

9:59:10 AM


gn119 39602 205134.13222 5522849.792

2760040 04/18/2011

9:59:10 AM


gn119 39564 202970.06108 5907677.184

2760031 04/18/2011

9:59:10 AM


gn120 39486 204448.42910 1289748.48

2760036 04/18/2011

9:59:10 AM


gn120 39519 202535.95917 5924454.4

2760025 04/18/2011

9:57:55 AM


gn122 39068 205594.85627 5825888.256

2760026 04/18/2011

9:57:55 AM


gn122 39005 207345.13678 5858394.112

2760035 04/18/2011

9:59:10 AM


gn123 39525 207787.63909 5942280.192

2760041 04/18/2011

9:59:10 AM


gn123 39547 206593.04932 5906628.608

2760039 04/18/2011

9:59:10 AM


gn124 39569 203983.64176 5915017.216

2760042 04/18/2011

9:59:10 AM


gn124 39562 204882.95263 5937037.312

2760024 04/18/2011

9:57:55 AM


gn125 39631 210772.66986 6043992.064

2760027 04/18/2011

9:58:10 AM


gn125 39634 209806.33674 6019874.816


This is yet another data set that reflects the same result but with a different view. The node

“gn122” is apparently the best in terms of CPU Usage while not so well off in terms of

memory and virtual memory usage. “gn177” is the best in terms of memory usage while

“gn120” averages the best for virtual memory usage. This further adds to our hypothesis

that a resource node that is best in one aspect or metric may necessarily not be the best for

all other metrics as well.

35 | P a g e

3.2.3 DATA ANALYSER AND PERFORMANCE TUNER

This application utilizes the Experimental data in the form of training data set to model use

case scenarios. It also demonstrates the advantages of associating ontology with the

proposed model.

FIGURE 6 ONTOLOGY BUILDER

3.2.3.1 Components

3.2.3.1.1 Data Source

The experiment results act as the data source for this performance tuner application.

The main reason behind this setup is the fact that all the statistics from any high

performance computing system can be easily extracted and used as training data

sets for the optimization module that most probably fits best beside the job

scheduler.

The data sets were analysed in the form of Excel spread sheets and the same have

been linked with a User Data Source Name (DSN) to provide data access for the java

application.

The interface allows for easy integration with any database, as the application will

require only minor modifications in terms of data access definitions.

36 | P a g e

The schema for this database is quite simple with only one table.

RESULTS (JOBID, CPURAW, MEMRAW, IORAW, VMEMRAW, DATE, TIME,

QUEUE, USER, SCRIPT, HOST, CPU, MEM, IO, VMEM)

3.2.3.1.2 Ontology Builder

The objective of this module is to organize the data set in a form of an ontology that

can be utilized by the Scenario Builder to select resources. The module requires two

primary inputs.

GROUP FACTOR

This integer type value defines the number of quality levels that the

resources should be divided into. These quality levels are linked with a

metric of judgement i.e. CPU USAGE, MEMORY USAGE, I/O and

VIRTUAL MEMORY USAGE and not just the resource itself. So a

resource can be at Quality Level 1 for CPU USAGE indicating that it is

most favourable for that particular job, and can also be at Quality

Level 5 for MEMORY USAGE indicating that it is the least favourable

for that job.

PROCESS FLOW

Due to the limited scope of this thesis, we have considered each job

as a different job group.

The sorting and grouping of the data set is done through the following

process.

1. Select a Job Group.

2. Get all the resources on which this job set has been executed.

3. For each selected metric:

i. Define the most favourable value – LEAST VALUE

The most favourable value for a metric such as CPU

usage will be the least of the lot.

ii. Define the least favourable value – MAXIMUM VALUE

The least favourable value for a metric such as

Memory usage will be the maximum of the lot.

iii. Divide the range of available values that is the difference

between the MAXIMUM VALUE obtained in step ii and

the LEAST VALUE obtained in step I, with the grouping

factor. This gives us the slot width.

37 | P a g e

iv. This slot width is then used to define group delimiters.

For n groups, we have n limiting values which further

define group bounds.

4. A group or quality level is then allocated for each of the

resources available for this job group.

5. The same process is then repeated for all the job groups.

FIGURE 7 DATA ORGANIZATION PROCESS FLOW

At the end of this process, we will have a data set where in each

resource related to a job has a quality level associated with it. This is

then used by the Ontology Builder to build up the required ontology.

DATA SET

The data source mentioned in the previous section is another pre-

requisite for this module. The data set provides the various statistics

which will be used for grouping resources.

Get Job Groups

Get Resource

List

Metric Scan

Define Group

Bounds

Allot Groups to Resources

38 | P a g e

PROCESS FLOW

The organized data set is then scanned through the job sets to build

up a Job first tree of the data set. The Job groups form the head node,

followed by selected metrics. The metrics are then descended by the

important quality levels which further connect with the resources at

the bottom of the tree. The process for building up is therefore as

follows.

1. Start with a root node : JOB GROUPS

2. For Each Job Group

i. Add Job Group node to root.

ii. For Each Metric : Add Metric node to Job Group Node

A. For Each Quality Level defined for this metric under this job

group : Add Quality Level Node to Metric Node

a. For Each Resource under this Quality Level : Add

Resource Node to Quality Level

3. Repeat until end of data set

The end of this process provides us with a Job First model of the experimental data that can

be easily queried to request for the most suitable resource for a particular job with the help

of a depth first search on the required job group.

39 | P a g e

3.2.3.1.3 SCENARIO BUILDER

The objective of the scenario builder is to compare the standard resource allocations, and

the optimized resource allocations. To run a particular simulation, the user must select the

job definition, and define the required quality levels for each of the available metrics. The

scenario builder demonstrates two simulations in parallel.

FIGURE 8 PERFORMANCE TUNER - SCENARIO SIMULATOR

SIMULATION ONE: STANDARD SCENARIO

The standard scenario selects random resources from the list of available resources for the

selected job definition and allocates them as per the number of instances specified by the

user. The time of execution on each resource is picked up from the training data set.

SIMULATION TWO: OPTIMIZED SCENARIO

The optimized scenario considers resources that match the selection criteria, defined in the

tuner interface. The selection criterion is basically the definition of quality levels that a

resource must conform to before selection. So if a user demands for Quality Level 1

resource, only the best quality resources will be selected for that job.

40 | P a g e

RESULTS

TABLE 6 SIMULATION RESULTS

We have selected three typical cases from the simulation results.

i. SET OF ROWS IN SHADES OF BLUE: The simulation run for job named job1.sh

provides us the positive result that for ten instances of this job, if the user opts

for the best CPU resource as compared to a second best CPU resource, he will be

able to save around four hours of CPU resource. Simultaneously, the case also

reflects that while trying to opt for the best CPU resource, the job tends to use

up more memory, in this particular case, it uses almost four gigabytes of

memory more if forced to execute on the best CPU resource, hence illustrating

the idea of trade off.

ii. SET OF ROWS IN SHADES OF GREEN: This simulation run for job named aof.sh

covers up five simulations that reinforce the points stated in the previous result

set. In all five runs were simulated for ten and fifteen instances each.

The first run for ten instances requires a CPU resource of quality level three, and

a top quality level MEMORY resource. And if the same run is simulated with a

requirement of best quality CPU resource , not only the CPU usage improves, the

MEMORY usage drops considerable as well, highlighting that performance tuning

options for one of the metrics can automatically enable the optimization for

another linked metric as well. This does not reflect that the job required less

memory in the optimized scenario but highlights that optimized scheduling can

41 | P a g e

impact the overall memory demands for a more ideal set of jobs running on a

given node.

And there are two possibilities for such a case. Consider CPU requirement to be

a major requirement, and the remaining or other linked metrics to be minor

requirement (MEMORY in this case).

POSSIBILITY 1. The major requirement (CPU) is optimized

but the minor requirements are increased, reflecting that

the selected resource though is helpful in optimizing the

major requirement but only at an increased cost of minor

requirement (MEMORY).

POSSIBILITY 2. The major requirement (CPU) is optimized

and the minor requirement is also optimized to the best,

reflecting that the chosen resource is the most optimized

resource for this particular job.

Therefore, if quality requirements are waived off from one of these metrics, the

remaining metric achieves maximum optimization, but the metric that has been

left out, takes a bad hit. So performance tuning options can not only help in

achieving optimized resource utilization, but can also help the meta schedulers

realize the minimum resource requirements for a particular job and make

necessary allocations. Such simulation results can very well supplement scenario

evaluations for considering metrics such as yield and stretch discussed earlier in

chapter two.

iii. SET OF ROWS IN SHADES OF RED: This simulation run for job named ‘script.job’

covers up two sets of simulations that highlight one of the drawbacks of this

system. At times, the system may encounter jobs or job groups which do not

have any optimizations available, and for such cases, the optimization scenario

will not be much different from the standard scenario, thereby rendering the

processing efforts useless.

JOB CLASSIFICATION

The following result sets provide an overview of the optimizations that can be achieved with

a generic class of jobs and not merely for instances of the same job. It should be noted that

suitable classification of jobs into practical categories is beyond the scope of this thesis, and

the results below have been provided to reflect the validity of the hypothesis. The

categories chosen below are only for illustration purposes and do not imply in practice.

42 | P a g e

SET A

JOBID DATE TIME USER SCRIPT HOST CPU MEM

2770018 04/29/2011 17:34:43 hanshang simple.sh$BY$hanshang gn131 229240 40.98309





















2776910 5 12 11 14:38:16 chethiya job.sh$BY$chethiya gn134 298983 20797.57387







2785603 05/21/2011 16:50:23 dduke relation.s$BY$dduke gn131 447071 3993.58094




TABLE 7 JOB TYPE: SHELL SCRIPTS: RESULT SET

The result set above presents the SHELL SCRIPT category of jobs (jobs executed in the

form of shell scripts) that were executed on the Monash Sun Grid. The rows highlighted

with the same colour represent instances of the same job. So in all, there are four

distinct jobs in the above result set. It is interesting to see that all these jobs performed

the best on compute node ‘gn131’. This leads us to believe that compute node ‘gn131’

is quite efficient for jobs executed as shell scripts.

43 | P a g e

SET B

JOBID DATE TIME USER SCRIPT HOST CPU MEM

2773417 5 06 11 23:51:35 ngjason b1000.job$BY$ngjason gn127 150894 567.87507




2787125 05/24/2011 16:53:24 ngjason b1000.job$BY$ngjason gn137 505234 92634.83492













2771310 5 03 11 16:18:20 ngjason mpling1.jo$BY$ngjason gn127 59308 41.91964









2788590 05/27/2011 9:33:19 ngjason mpling1.jo$BY$ngjason gn137 2042298 4205.49090






2787617 05/25/2011 12:20:40 ngjason mpling2.jo$BY$ngjason gn110 162417 8432.47423

2788829 05/27/2011 17:02:20 ngjason mpling2.jo$BY$ngjason gn147 1334013 2732.48920 TABLE 8 JOB TYPE: GLOBUS JOB: RESULT SET

The result set above presents the ‘JOB’ category of jobs (GT4 Jobs) . This result set also

follows the same format as in SET A. There are five distinct jobs in this result set and

most efficient compute node in this case is the ‘gn127’, thereby stating that ‘gn127’ is

best for GT4 job types and also supplements our validation from the previous result set.

Therefore, a comparison between the scenarios above provides a perspective about

the amount of resource saving that can be achieved with performance tuning options.

44 | P a g e

4.1 Optimizer Integration

Performance Tuning is most suitable just before the jobs are scheduled on a high

performance computing system. By placing an optimization module before the scheduling

process, the system can judge the requirements of the job and the user, before allocating

the perfect resource for it. The various stages of job processing in a HPC have been defined

below along with the possible integration of an Optimizer.

FIGURE 9 JOB PROCESSING - LEVEL 1 DATA FLOW DIAGRAM

The data flow defined in the diagram is sequential and reflective of the major actions

executed in a distributed computing system.

45 | P a g e

1) The users submit the jobs to desired HPC through an appropriate interface.

a) After receiving the job, the underlying security mechanism verifies the

authenticity of the user and the associated resource certificates.

2) The job after a genuine validation reaches the job scheduler. The job scheduler then

queries the resource manager for a status update on the required resources. The

required resources are mostly determined by the user, as a user has access to only a

specific number and type of resources.

3) In a standard scenario, once the scheduler has the updated status of the required

resources, it starts allocating those resources to the job immediately. Whereas in an

optimized scenario, this availability list will be passed on to the optimization module

along with the required quality level as defined in the previous section.

4) The optimization module then replies back with a list of resources sorted on the

basis of quality levels, so as to make sure that the allocation of jobs happens only

across the resources that are at par with the user’s requirement.

5) The job scheduler then resumes with its responsibilities of resource allocation with

the optimized selection of resources.

Considering the existing scenario of Nimrod, the placement of the optimization module

would be best suited beside the Job Scheduler. This is necessary because the role of the

optimization module is critical prominently before the jobs are actually sent to resource

queues. Another requirement for the optimization module is the training data, which

basically assist the optimizer to make decisions regarding the quality level of resources. The

optimization module requires periodic updates from the Globus actuator to maintain its

training data. The Globus actuator generally handles the responsibility of maintaining job

states across the distributed resources and also initiating additional tasks to support the

process. The Globus actuator of nimrod has an additional responsibility of running a Prepare

task that collects information about the resource node. This includes various usage statistics

such as CPU usage, IO usage, MEMORY usage, number of jobs executed along with job

details, and much more. All this information is exactly the requirement of building up the

training data required by the optimizer module.

46 | P a g e

FIGURE 10 OPTIMIZATION MODULE ON NIMROD (NIMROD)

4.2 Advantages

The Optimizer module therefore has few distinct advantages.

Performance Tuning

The support of an optimizer enables a high performance computing system to have

performance tuning options. The system can allow for trade-offs between desired

metrics and improvise upon on of the metric by cutting down on another metric.

For instance, a user may want to complete an experiment with least CPU usage at

the cost of increased memory consumption or vice versa. Table 6 SIMULATION

RESULTS on page 39 clearly verifies this point.

Efficient resource utilization

By providing more control over the selection of resources and also the amount of

consumption, the optimization module allows the system managers to build up

system rules to have an efficient utilization of resources. This has not only been

validated for single parameter optimization but for multiple parameters as well.

Commercial Benefits

Most of the high performance computing systems employ pay per use mechanisms

for charging the users. This is often covered up by research grants, but with the

advent of paradigms such as cloud computing, having a quality aspect associated

47 | P a g e

with usage will provide additional leverage while charging customers. Consumption

of high quality resources will have higher usage rates as compared to lower quality

resources. This will not only provide commercial benefits for the resource owners,

but also act as an incentive for them keep their resources maintained at higher

quality levels, thereby providing them with the ability to have assured service and

quality agreements.

4.3 Drawbacks

The system even though is quite promising on optimizing resource utilization and also

efficient job execution, has a couple of drawbacks or better described as overheads.

Processing Overheads

Since the optimizer will build up system ontologies to query for resources with

acceptable quality levels, the process will have some processing overheads. This

processing might induce some delays but considering the amount of saving the

module indicates, this overhead will be very much worth the efforts.

Training Data Maintenance

The training data forms a critical component of the optimizer and requires constant

maintenance. It needs to be updated quite regularly so as to keep the optimizer up

to date with the requirements of the entire computing system. Every job that is

executed on a resource has the capacity to bring about some opportunities for

performance tuning on that particular resource and this is the key idea behind

tapping the computing power intelligently.

48 | P a g e

5.1 Research Summary & Contributions

High Performance Computing environments are highly dynamic with resources having

varied availability constraints and jobs having varied complexities. This is primarily due to

dependence on a number of distributed components that include networking, data access

mechanisms, job scheduling procedures, and resource managers, though not all of these

components can provide leverage for optimizing system usage. We focused on the resource

allocation functionality which forms the key responsibility of a job scheduler. Fundamentally

standard procedures of resource allocation are not only inefficient but also incorporate

decent amount of wastage which otherwise could be preserved. From the comprehensive

experimentation on resource utilization that was carried out for this thesis, we can establish

a key relation between jobs and their corresponding resources. Therefore, this thesis

uncovers an aspect of high performance computing systems that can be exploited for

efficient resource utilization and enabling performance tuning options in a high

performance computing system.

In this research, we have presented two things, a basic mechanism for extracting resource

utilization statistics in a high performance computing system and a performance

optimization and tuning component that utilizes these statistics. The performance tuning

component has been developed so as to demonstrate the effective utilization of gathered

statistics. Our proposed mechanism is adapted from the common paradigm of machine

learning.

The performance optimization and tuner application illustrates the feasibility of our

proposed concept. The various sub-components of this application have been defined in

detail along with the underlying algorithms. Further we have also defined and justified its

placement in high performance oriented distributed system.

The Performance Tuner based Scenario Simulator component also provides a clear

comparison between the resource utilization with the traditional approach of allocation on

basis of availability and the optimized approach of allocation on the basis of suitability.

The contributions of our research are therefore as follows.

We have analysed the resource utilization across Monash Sun Grid for a period of six

months and gathered usage statistics during the same period. We also explored a

number of ways of gathering these statistics before narrowing down to the simplest

49 | P a g e

and most effective methodology. Traditional approaches even though monitor

system utilization but do not make use of such usage statistics to support the job

scheduler in taking resource allocation decisions.

The usage statistics gathered from the grid very well support our hypothesis of a

relationship existing between Jobs and resources, thereby indicating that resources

in fact are inclined towards certain type of jobs. So this research has also enabled us

to state that there is a strong affinity between jobs and resources.

We have also explored the various data organization models that can be utilized to

make a supporting ontology. The resulting ontology very neatly organizes the

gathered usage statistics. This ontology helps in efficiently modelling a data structure

that can very well integrate system description along with its usage. The ontology is

very well capable of describing the various resources being used by the system as a

unit, links them with jobs these resources executed and most importantly their

associated performance levels or quality levels as defined earlier. The ontology

further forms the critical requirement for the performance tuning and scenario

illustrating application.

We have developed a heuristics based performance tuner simulator that illustrates

the utilization of such ontology to enable the job scheduler in taking effective

decisions. This application not only builds up the required ontology from the training

data set but also uses the same to present a comparison between the traditional

approach of resource allocation and the approach suggested in this research. The

application allows the user to define the required performance levels and select the

job that can be run for simulation.

Lastly, we have also discussed the placement of such an optimization tool in a real

world distributed system. This model is not only limited to the conceptual phase but

is very much practical and feasible indeed.

In summary, this research identifies the key relation between jobs and underlying resources

on which these jobs are executed. It also provides a sample application that can be scaled

up for deployment in real life high performance computing systems.

We now outline the future research scope for this project.

50 | P a g e

5.2 Future Work

Two extensions of this project have been identified to be potential areas for extended

research.

The current implementation of the optimization framework does not integrate data

collection as a part of the performance tuner. It would though be interesting to

explore the various options for achieving the integration. This could possible use the

capabilities of Mobile Agents or Web Services for enabling collection of usage

statistics through the push or pull method.

Secondly, the implementation section highlights the importance of job groups. The

job groupings will have to be done on the basis of a pre-submission analysis of the

job. This Artificial Intelligence based job analyser will evaluate the job in terms of its

requirements of a particular system metric. For instance, it will predict that the job

requires the highest quality level.

51 | P a g e

5.3 BIBLIOGRAPHY

Al-Azzoni, I., & Down, D. G. (2008, August). Dynamic scheduling for heterogeneous Desktop Grids.

9th IEEE/ACM International Conference on Grid Computing, 2008 , 136 - 143.

Amarnath, B. R., Somasundaram, T. S., Ellappan, M., & Buyya, R. (2009). Ontology-based Grid

resource management. Softw. Pract. Exper., 39, 1419-1438.

Azad, H., Khaoua, M., & Zomaya, A. (2005). Design and performance of networks for super-, cluster-

and grid computing. Journal of parallel and distributed computing, 65, 1119-1122.

Balaji, S., Patnaik, L. M., Jenkins, L., & Goel, P. S. (1992). S-Nets: A Petri Net Based Model for

Performance. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING(15), 225-237.

Baratloo, A., Dasgupta, P., Karamcheti, V., & Kedem, Z. (1999). Metacomputing with MILAN.

Proceedings of the 8th Heterogenous Computing Workshop (p. 169). IEEE Computer Society .

Bender, M., Chakrabarti, S., & Muthukrishnan, S. (1998). Flow and stretch metrics for scheduling

continuous job streams. Proceedings of symposium on discrete algorithms, (pp. 270-279).

Bosilca, G., Delmas, R., Dongarra, J., & Langou, J. (2009). Algorithm-based fault tolerance applied to

high performance computing. Journal of Parallel and Distributed Computing Systems(69),

410–416.

Bruneo, D., Scarpa, M., & Puliafito, A. (2010). Performance Evaluation of gLite Grids through GSPNs.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 21, 1611 - 1625.

Buttner, L., Nolte, J., & Schrode-Preikschat, W. (1999, June). Arts of Peace-A High-Performance

middleware layer for Parallel Distributed Computing. Journal of Parallel and Distributed

Computing, 59, 155-179.

Casanova, H., Legrand, A., Zagorodnov, D., & Berman, F. (2000). Heuristics for scheduling parameter

sweep applications in Grid environments. Proceedings of the Ninth Heterogeneous

Computing Workshop, (pp. 349-363).

Caymes-Scutari, P., Morajko, A., Margalef, T., & Luque, E. (2010). Scalable, dynamic monitoring,

analysis and tuning environment for parallel applications. Journal of Parallel and Distributed

Computing( 70 ), 330-337.

Chang, F., Karamcheti, V., & Kedem, Z. (2000). Exploiting Application Tunability for Efficient,

Predictable Resource Management in Parallel. Journal of Parallel and Distributed Computing(

60), 1420-1445.

Cirne, W. (2002). Using moldability to improve Performance of Supercomputer Jobs. Journal of

Parallel and Distributed Computing(62), 1571-1601.

Cruz, J., & Park, K. (1999). Toward performance drive System support for distributed computing in

clustered environments. Journal of parallel and distributed computing(59), 132-154.

Dinda, P. (2002). Online Prediction of the Running Time of Tasks. Cluster Computing, 5(3), 225-236.

52 | P a g e

Foster, I., & Kesselman, C. (1997). Globus: a metacomputing infrastructure. Internation Journal of

Supercomputer Applications(11), 2.

Foster, I., & Kesselman, C. (2004). The Grid: Blueprint for a New Computing. Los Altos, California:

Morgan Kaufmann Publishers.

Foster, I., Roy, A., & Sander, V. (2000). A quality of service architecture that combines resource

reservation and application adaptation. Proceedings of the Eighth International Workshop on

Quality of Service 2000, (pp. 181 - 188).

Frey, J., Foster, I., Livny, M., Tannenbaum, T., & Tuecke, S. (2001). Condor-G: A Computation,

Management Agent for Multi-Institutional Grids. Madison: University of Wisconsin.

Grimshaw, A., & Wulf, W. (1996). Legion—a view from 50,000 feet. Proceedings of 5th IEEE

Symposium on High Performance Distributed Computing, 1996, (pp. 89 - 99 ).

Han, L., Potter, S., Beckett, G., Pringle, G., Welch, S., Koo, S.-H., et al. (2010, July). FireGrid: An e-

infrastructure for next-generation emergency response support. Journal of Parallel

Distributed Computing, 70, 1128-1141.

Harchol-Balter, M., & Downey, A. (1997). Exploiting process lifetime distributions for dynamic load

balancing. ACM Trans. Comput. Syst., 253-285.

Ito, T., Ohsaki, H., & Imase, M. (2005). On parameter tuning of data transfer protocol GridFTP for

wide-area grid computing. 2nd International Conference on Broadband Networks, (pp. 1338 -

1344 Vol. 2 ).

Kissel, E., Swany, M., & Brown, A. (2010, August). Phoebus: A system for high throughput data

movement. Journal of Parallel and Distributed Computing, In-press.

Korkhov, V. V., Krzhizhanovskaya, V. V., & Sloot, P. (2008). A Grid-based Virtual Reactor: Parallel

performance and adaptive load balancing. JOURNAL OF PARALLEL AND DISTRIBUTED

COMPUTING(68), 596-608.

Leea, H., Chungb, K., China, S., Leea, J., & Leea, D. (2005). A resource management and fault

tolerance services in grid computing. Journal of Parallel Distributed Computing(65), 1305 –

1317.

Li, K. (2005). Job Scheduling and processor allocation for grid computing on metacomputers. Journal

of Parallel and Distributed Computing(65), 1406-1418.

Liu, C., & Baskiyar, S. (2009). A general distributed scalable grid scheduler for independent tasks.

Journal of Parallel and Distributed Computing(69), 307-314.

Mi, N., Casale, G., Cherkasova, L., & Smirni, E. (2008). Burstiness in multi-tier applications symptoms,

causes and new models. Proceedings of 9th ACM/IFIP/USENIX international conference on

middleware, (pp. 265-286).

Monash Sun Grid. (n.d.). Retrieved June 1, 2011, from

http://www.monash.edu.au/eresearch/services/mcg/msg.html

53 | P a g e

Monash Sun Grid Resource Utilization. (n.d.). Retrieved June 1, 2011, from

http://msg.its.monash.edu.au/ganglia/

Monash Sun Grid Usage Guide. (n.d.). Retrieved June 1, 2011, from https://confluence-

vre.its.monash.edu.au/display/mcgwiki/MSG+User+Guide

Nimrod. (n.d.). (Monash University) Retrieved June 1, 2011, from Nimrod:


Papakhian, M. (1998). Comparing job-management systems: the user’s perspective. IEEE

Computational Science & Engineering, 4 - 9 .

Peng, D. T., & Shin, K. G. (1993). A New performance measure for scheduling independent real time

tasks. Journal of Parallel and Distributed Computing(19), 11-26.

Raicu, I., Zhao, Y., Dumitrescu, C., Foster, I., & Wilde, M. (2007). Falkon: A Fast and LightWeight Task

Execution Framework. Proceedings of the 2007 ACM/IEEE Conference on Supercomputing,

2007. SC '07. , (pp. 1 - 12 ).

Ravindran, B., Devarasetty, R. K., & Shirazi, B. (2002). Adaptive Resource Management Algorithms for

Periodic Tasks in Dynamic Real-Time Distributed Systems. Journal of Parallel and Distributed

Computing(62), 1527–1547.

Stillwell, M., Schanzenbach, D., Vivien, F., & Casanova, H. (2010). Resource Allocation algorithms for

virtualized service hosting platforms. Journal of parallel and distributed computing(70), 962-

974.

Urgaonkar, B., Shenoy, P., & Roscoe, R. (2002). Resource Overbooking and Application profiling in

shared hosting platforms. Sigops operating system review(36), pp. 239-254.

Urgaonkar, B., Shenoy, P., Chandra, A., Goyal, P., & Wood, T. (2008). Agile dynamic provisioning of

multi tier internet applications. ACM Trans. Auton. Adapt. Syst.(1), 1-39.

W.Schroder-Preikschat. (1994). The Logical Design of Parallel Operating Systems. NJ.

Weissman, J. B., Abburi, Ã. L., & England, D. (2003). Integrated scheduling: the best of both worlds.

Journal of Parallel and Distributed Computing(63), 649–668.

Wu, M., & Sun, X.-H. (2006). Grid harvest service:A performance system of grid computing. J. Parallel

Distrib. Comput., 1322-1337.

Yang, L., Liu, C., M. Schopf, J., & Foster, I. (2007). Anomaly detection and diagnosis in grid

environments. Proceedings of the 2007 ACM/IEEE conference on Supercomputing, (pp. 1-9).

54 | P a g e

Due to extensive nature of the data set and corresponding software developed for the

purpose of this thesis, the following items have been provided in the digital copy of this

project:

6.1 DATA SET

6.2 SOURCE CODE

6.3 SCRIPTS

[PLEASE REFER TO THE DIGITAL COPY FOR ADDITIONAL RESOURCES]

performance tuning for high performance computing systems

Documents