a general-purpose distributed computing …tnaughton/pubs/varasto/nuim-cs-tr-2002...client-server...

60
A general-purpose distributed computing environment with application to DNA analysis Richard Allen, 1 Thomas Keane, 1 Thomas J. Naughton, 1 and John Waldron 2 1 Department of Computer Science, National University of Ireland, Maynooth, Ireland. 2 Department of Computer Science, Trinity College, Dublin 2, Ireland. Corresponding author: [email protected] Date: 2 May 2002 Technical Report: NUIM-CS-TR-2002-03 Key words: distributed computing, MIMD emulation, Java, DNA analysis Abstract This report describes the full design and development of a general-purpose programmable distributed environment. The aim of the system is to provide developers with a quick and easy platform for implementing distributed computations in the context of a MIMD architecture (multiple instruction, multiple data). The model underlying the system is a combination of the client-server model and the pipeline processor model. The design and implementation of the system is based on an early version of the Java Distributed Computation Library by Fritsche, Power, and Waldron. The distinguishing feature of the system is its ability to dynamically change the algorithm sent to clients. We have demonstrated the functionality of our system by solving a problem from the field of DNA analysis. Our system was evaluated over a local area network of approximately two hundred computers. This report contains a description of the JDCL and its modifications, a user manual, and the code for a sample distributed application.

Upload: others

Post on 02-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

A general-purpose distributed computing environment

with application to DNA analysis

Richard Allen,1 Thomas Keane,1 Thomas J. Naughton,1 and John Waldron2

1 Department of Computer Science, National University of Ireland, Maynooth, Ireland.2 Department of Computer Science, Trinity College, Dublin 2, Ireland.

Corresponding author: [email protected]

Date: 2 May 2002

Technical Report: NUIM-CS-TR-2002-03

Key words: distributed computing, MIMD emulation, Java, DNA analysis

Abstract

This report describes the full design and development of a general-purpose programmabledistributed environment. The aim of the system is to provide developers with a quick and easyplatform for implementing distributed computations in the context of a MIMD architecture(multiple instruction, multiple data). The model underlying the system is a combination of theclient-server model and the pipeline processor model. The design and implementation of thesystem is based on an early version of the Java Distributed Computation Library by Fritsche,Power, and Waldron. The distinguishing feature of the system is its ability to dynamicallychange the algorithm sent to clients. We have demonstrated the functionality of our system bysolving a problem from the field of DNA analysis. Our system was evaluated over a local areanetwork of approximately two hundred computers. This report contains a description of theJDCL and its modifications, a user manual, and the code for a sample distributed application.

Page 2: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

2

Introductory Note

This technical report contains extracts from the documentation of a final year B.Sc.(Computer Science and Software Engineering) project, carried out jointly by Richard Allenand Thomas Keane during the academic year 2001/2002.

The report is intended primarily to supplement a forthcoming publication,1 by supplyingfurther details that, for space restrictions, could not be included in that paper. It contains fulldescriptions of the design of the JDCL and its refinements, an explanation of ourimplementation, a user manual for the system (including installation and configurationinstructions, and information on programming the system), and the code for a completesample distributed application that uses the system. This simple, but general, application canbe used as a template to design more complicated applications, by showing exactly how onlytwo classes need be extended to fully program a distributed system.

The authors wish to acknowledge the Department of Computer Science, NUI Maynooth, forthe use of its computer laboratories, technicians Michael Monaghan, Patrick Marshall, andJames Cotter for their assistance, Prof. Ronan Reilly for background information ondistributed computing with Java, and Dr. James McInerney from the BioInformatics andPharmacogenomics Laboratory, Department of Biology, NUI Maynooth, for his biologyexpertise.

1Thomas Keane, Richard Allen, Thomas J. Naughton, James McInerney, and John Waldron,“Distributed computing for DNA analysis,” in Principles and Practice of Programming inJava, Dublin, Ireland, June 2002. To appear.

Page 3: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

3

Table of Contents

Introductory Note 2

Chapter 1. Introduction 41.1 Aims 41.2 Overview of Technical Area 51.3 Overview of Technical Report 5

Chapter 2. System Overview 72.1 Requirements 72.2 Requirements Analysis 82.3 The Java Distributed Computation Library (JDCL) 92.4 System Specification 10

Chapter 3. Theoretical Foundations: The Computational Model 123.1 Introduction 123.2 Details of Theoretical Model 12

Chapter 4. Design 154.1 Introduction 154.2 Limitations of / Problems with JDCL 164.3 Final Design of Project 194.4 Communications Protocol 30

Chapter 5. Implementation 325.1 Development Language 325.2 Deployment of Application 335.3 Our Deployment of the Client Software 34

Chapter 6. Evaluation 366.1 The Ordered Delta-Max Problem 366.2 DNA Analysis 36

Chapter 7. Application 397.1 What is DNA? 397.2 Why Analyse DNA? 397.3 What Do We Do with Our System? 407.4 Search Strategy 417.5 Results Interface 42

Chapter 8. Discussion 438.1 Achievement of Aims 438.2 Future Development 43

References 45

AppendicesA Design of the Java Distributed Computation Library 47B User Manual 50C Example JDCL Code: the Delta-Max Application 55

Page 4: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

4

Chapter 1Introduction

The original project description was as follows:A programmable Distributed Computing Environment is to be built. It will consistof a server capable of accepting an algorithm and a data set, and capable ofpartitioning the computational load into variable sized segments. Client softwareto request a segment (algorithm and data) must also be written. The server willcombine the results from numerous clients and produce the overall result.

Our main aim was to build this “programmable distributed computing environment.” Withinthis task, we set ourselves several smaller aims that we wanted to achieve during this project.

1.1 AimsDistributed computations are usually characterised by many spatially separated independentmachines working together to perform a task. Any environment to manage a computation ofthis scale should be relatively easy to use from a user’s point of view. This means that theinterface with the user must be easily understood and practical to use. Our primary aim was tomake the user’s interaction with the system as minimal as possible and to remove many of thecomplexities of setting up and administering a distributed computation.

Another major aim was to develop client software that integrated seamlessly with the clientmachines. The nature of any distributed computation is that many client machines arenecessary to perform one overall computation. These computations can take a number ofweeks or months; therefore the software that is run on the client machines should only causethe minimum of interruption to the donor of a client machine. We wanted to use up their‘spare’ clock cycles but only in a way that would not affect the donor of a client machine inany way. In this way, we hoped to gain acceptance for our client software within theDepartment and within the University.

We proposed to evaluate our system with at least one major application (along with variousother test problems). DNA analysis was chosen.

1.1.1 MotivationThe motivations behind this project hinged on one major point. Most of the current distributedcomputation systems have one major flaw. Each is designed with one application in mind.That is, there are only a few systems out there that can be programmed by different users toperform arbitrary distributed computations. Some good examples of the current limitedapplications would be the Great Internet Mersenne Prime Search (GIMPS) [1] andSETI@Home [2]. Each of these systems was designed to perform one specific task. GIMPSinvolves finding large prime numbers and SETI analyses data from a radio telescope.

We wanted our system to have the abil ity to dynamically change the task (algorithm) that theclient machines were running. The main advantage would be that once a person had obtaineda copy of our client software, the server could be re-configured to solve different problemswithout the need to change the client software, i.e. the client would download the newalgorithm and data from the server. Again this feature is lacking in most of the currentdistributed systems.

Page 5: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

5

To construct our system, we adopted the Java Distributed Computation Library (JDCL) [3].This is a system comparable to PVM [4, 5] or MPI [6] used for more tightly coupled parallelcomputation.

1.2 Overview of Technical AreaDistributed computing refers to the concept of parallelising a computation over stand-alonemachines connected by a local area network (LAN) or wide area network (WAN) [7].Organisations with large projects to complete have attracted Internet users from across theworld to donate their computer' s time. SETI@Home [2] — by far the most popular with users— analyses data from a radio telescope, searching for potential signals from extraterrestriallife. RC5-64, a project from Distributed.net [8] is testing out 72 × 1015 keys to unlock a 64-bitencryption code. The Great Internet Mersenne Prime Search [1] sorts through bill ions ofnumbers for large primes.

Distributed computing’s primary advantage over traditional supercomputers is its frugality,potentially making use of every spare moment that your computer' s processor is idle for. Thelatest microprocessor could sit unused most of the time while your monitor flashes a screensaver or while the keyboard records a user’s typing. These basic functions use very littleprocessing power, while the rest goes to waste. Distributed computing can take full advantageof a computer' s capabili ties by keeping it busy.

It is very financially economical. If enough donors sign up, these linked computers — oftenreferred to as virtual parallel machines — can surpass the fastest supercomputer by as muchas four times for a fraction of the supercomputer' s cost. By fully utilising an organisation’s ITresources, existing investments are enabled to become a distributed computing network. Morepower for less money — what scientist or engineer with a large, overwhelming project couldresist the concept of more power for less money? The current applications of distributedcomputing range from financial services [9] and medical testing [9] to the search forextraterrestrial life [2].

1.2.1 Other Programmable Distributed EnvironmentsJavaSpaces [10] is a new distributed object system being proposed by Sun. It provides adistributed, persistent object system that is roughly modelled after earlier shared memorysystems, such as LINDA [11]. The distributed application paradigm supported by JavaSpacesis one in which remote agents interact with each other indirectly through shared data objectspaces. This paradigm goes against some of the principles that our system is based on. In oursystem, no client should be aware of another client and we are not trying to model a sharedmemory like in JavaSpaces.

Joone [12] is a neural net framework to create, train and test neural nets. It is based onJavaSpaces technology [10]. The purpose of the application is to provide users with aplatform to develop powerful and reliable AI applications. The idea is that by using Joone,one can write new modules to implement new algorithms and architectures.

1.3 Overview of Technical ReportThe technical report can be summarised as follows. Chapter 2 provides a system overview,including requirements and final specification of the system. Chapter 3 explains thecomputational theory of the system. Chapter 4 includes the complete design for the system.

Page 6: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

6

The system is modularised into 4 main parts. They are server specific, client specific, andcommon components of the system, and the communications protocol. The purpose,attributes, and methods of each module are detailed in this chapter. Chapter 4 also containsthe design we used to represent pipeline processing as client-server computing (thecomputational theory of which was given in Chapter 3). Chapter 5 gives details of the actualimplementation of the design. There is a discussion about the programming language chosen,the possible system deployment options, and the option we chose for our application. Chapter6 contains details of some of the evaluation we performed of the system. We describe theproblems that were used to confirm the generality and power of the system. Chapter 7introduces the main application that we used the system for – DNA analysis. There is anintroduction to the problem and then an explanation of how we applied our system to solvingit. Chapter 8 contains a discussion of possible future enhancements of the system. AppendicesA and B contain a description of the original JDCL, and a user manual for the current system,respectively. Appendix C contains the code for the Delta-Max problem.

Page 7: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

7

Chapter 2System Overview

In this chapter we outline the requirements for this project, explain the starting point of thedevelopment (the JDCL) and detail our specif ication for an improved system.

2.1 RequirementsThe requirements were as follows.

Requirement 1: Platform and NetworkThe overall system should be platform independent and network independent.

Requirement 2: CommunicationThere should be a well-defined protocol for communication between the server and clients –this protocol must be adhered to for all communications.

Requirement 3: Logs/RecordsBoth the cli ent and server should maintain full logs of all events as they happen in the system.These should consist of both error and event logs. These should be available for inspection onthe machine on which the software is running.

Requirement 4: Target User AudienceIn order to program the distributed environment with a given problem, basic programmingskills are required. It is acceptable to assume that the user will have such skil ls. Furthermorewe will let the user specify how the data is to be split up into work units for the cli ents.

Requirement 5: Server5.1 One of the server’s main goals is to split up the data set into units (according to a

scheme specified by the user) and distribute them with an algorithm to the clientmachines upon request.

5.2 The server should be able to handle simultaneous requests from multipleclients.

5.3 The server needs to be able to resend a unit (to a different client, if necessary)if a client fails to send results.

5.4 The main goal of the server is to collect all resul ts from clients and outputthem to a formatted file.

5.5 The server should have an easy-to-modify structure for the user to tailor to theirparticular needs.

Requirement 6: Client6.1 There should be two main options for starting the client software. The user

should be able to explicitly start the client program and there should also be thecapability to automatically start the client when the machine is booted.

6.2 The essence of the client is that it should be able to accept any arbitraryalgorithm and execute it on a given data set – returning the results to the server whenfinished.

6.3 One of the goals of the client software is to only run a segment (process data) whenthere are sufficient resources free on the client machine, e.g. CPU clock cycles. The

Page 8: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

8

aim here is for the client not to interfere with any work that the donor of the clientmachine may be doing.

6.4 There should be sufficient security mechanisms built into the client so that thedownloaded task cannot subvert the client machine or client software.

6.5 There should be a suitable user interface to inform the donor as to the state of theprogram.

2.2 Requirements AnalysisAfter we had decided on the requirements of the system we then examined other existingsystems to see which one would best meet the requirements. The first step was to determinewhat types of distributed computing models are currently being used.

We found that there are two main distributed computing models:• Client-Server• Peer-to-Peer

2.2.1 The Client-Server ModelIn this model, the distributed application is divided into two parts, one part residing on each ofthe two computers that will be communication during the distributed computation. The clientside of the application resides on the machine that initiates the distributed request and receivesthe benefit of the service. The server side of the application resides on the machine thatreceives and executes the distributed request. In this model, two different sets of code areproduced – one that runs as a client, the other as a server.

This approach to distributed computing leads to a number of typical features.• Centralised - The server controls all the operations on the network.• Communication - The clients initiate communication with the server and cannot

communicate with each other.• Resource sharing - The server utilises the computing power of the clients to solve a

problem.

Advantages of Client-Server• Security - Since the clients can only contact the server it is very difficult for anyone to

compromise the network.• Network Maintenance and Support - In the Client-Server model it is not too

important if a number of the client machines fail. The only machine that must bemaintained is the server.

• Bandwidth - The Client-Server model uses very little bandwidth. The clients connectto the server to get some information and then disconnect while they process it.

DisadvantagesThe Client-Server model is very good for distributed computing but there are some problems:

• Robustness - If any individual client machine on the network becomes unavailablethere is little or no consequence for the computation. However, loss of the server canbe catastrophic.

• Scalability - In a Client-Server environment the addition of clients reduces the abilityof servers to the point where the network will collapse unless more server power isadded.

Page 9: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

9

• Version control - Proliferation of newer versions of the client application is verydifficult, as the server doesn’t have control of the clients.

2.2.2 The Peer-to-Peer ModelPeer-to-Peer was first implemented when two computers were connected together in such away that each could gain access to the other' s resources [13]. It differentiates itself from otherdistributed computing models by the principle that participating computers have equal statusand resources are distributed rather than centralised.

This approach to distributed computing leads to a number of typical features.• De-centralised - The network does not rely on individual central resources for its

operation.• Equality - Participants can act as both clients and servers depending on need.• Resource sharing - Disk space and / or processing power can be shared.

Advantages of Peer-to-Peer• Robustness - If any individual machine on a Peer-to-Peer network becomes

unavailable there is little or no consequence for the operation of the network.• Scalability - As machines are added, the processing power and disk space available to

the network increases.• Self-Optimising - Resources get distributed around the network, as demand requires.• Data Redundancy - Resources are unlikely to be lost / damaged as they can be

replicated all over the network, and the network remains available even whenindividual components become unavailable.

DisadvantagesAt first glance Peer-to-Peer would appear to be superior to the Client-Server model, but thereare major problems.

• Security - With no central point of access or management it is difficult to preventpenetration of the network.

• Version control - Proliferation of files and applications, which can be updatedwithout central management, makes it very difficult to maintain consistency.

• Network Maintenance and Support - As has been seen, the Peer-to-Peer model hascertain self-managing properties, but these are not perfect or sufficient. Whenproblems occur in a chaotic model, it is very difficult to track down causes and resolvethem.

• Bandwidth - The ' conversational' n ature of Peer-to-Peer networks mean that machinesmust talk to each other regularly in order to establish where resources can be locatedand who is active [13]. This means that each machine may have to handle massivevolumes of data passing on its way to other machines on the network.

During the course of our investigations we encountered a previously implemented Client-Server distributed system called the Java Distributed Computation Library [3].

2.3 The Java Distributed Computation Library (JDCL)The JDCL [3] was written as an easy-to-use platform for developers who want to quicklyimplement a distributed computation in the context of a Single Program, Multiple Data

Page 10: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

10

(SPMD) architecture. The JDCL aims to solve some of the major drawbacks of currentdistributed computing systems, namely:

1. They do not have an easy platform for development2. Client behaviour cannot be changed dynamically3. They do not promote easy deployment of software

From the aims above it would seem that the JDCL could be a good attempt at a programmabledistributed environment. From the detailed description of the system in appendix A, it can beseen that the JDCL does indeed provide much of the functionality that we are looking toincorporate into our system.

The overall modular design of the JDCL makes it a very suitable starting point from which tobase our system. It was decided that instead of designing a system from scratch we wouldconcentrate our efforts on improving on the design and implementation of the JDCL in orderto create our system. Even though we planned to use the JDCL as a starting point for oursystem, as can be seen from section 4.2, many issues and problems had to be addressed duringour system development.

2.4 System Specification

Communication: All communication is to take place through the specified systemcommunications protocol. The essence of this is that all communication is to be initiated bythe client software. All communication will take place in sessions.

Logs/Records: All events that take place (from software initialisation to network connectionsbeing established) will be recorded on the local machine on which the program is running.This will be done via physical files. This will provide users/administrators of the system withvital information on the state of the program.

Initialisation Variables: In order to start the program there will be several system variablesthat will have to be set by the user of the system. These will cover issues such as the server IP(internet protocol) address, server port, where to print event information to, and what level ofdetail and what format is required in the log information.

Server:• When the server initialises it will load in a number of user defined classes, which

contain the task to be done, how the data is to be split and how to format the results.• The server will be able to communicate with multiple clients simultaneously.• The server must maintain a list of all the units that are currently being processed by

the clients. If a client fails to return a result within a specified time or request anextension the server must redistribute the unit to another client.

• The server waits for the clients to initiate communication and then handles theirrequests for services. These services include updating a task, handling the results froma unit and sending out a new work unit.

• The server should have the ability to send the results from one process out to a clientfor further computation.

• The server will have the ability to log error messages from the client.

Page 11: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

11

Client:• The client application will be in the form of an executable file. There will be a few

options as to how to run the software. The two main choices will be to either start theapplication explicitly or to have it start automatically (via the OS) and run in thebackground.

• The client software should run as a low priority thread so it doesn’t interfere with anyother work that the user is doing on the client machine.

• As mentioned above, the client will initiate all communications – this will mean thatwhen the software is started up the client will contact the server and request both analgorithm and a data set. Once the client has finished processing the data “unit” – itwill send the results back to the server (again the communication is initiated by theclient).

• There will be an overall security policy included in the client software. This can besplit into two main sections:1) Client Software and external Client Machine:

There will be a Security Manager put in place to limit the interactions betweenthe client software and the donor machine. This wil l include issues such as diskaccess, network access etc.

2) Client Software and Downloadable Task:The downloaded task will be loaded into the client software in such a way thatit can only communicate with the client software through a well-definedinterface. This interface will incorporate a monitor to ensure that thedownloaded task cannot subvert the client software in any way.

• The client user interface wil l provide the user with various pieces of informationregarding the running of the current Task – e.g. the current unit progress. This wil l bedone via a graphical output.

• Once the client is started at a machine, there should never be any need to restart orreplace the client software.

Page 12: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

12

Chapter 3Theoretical Foundations: The Computational Model

3.1 IntroductionWhen designing an information processing system, Marr [14] has proposed three levels atwhich the system must be understood:

1. Computational Theory2. Representation and Algorithm3. Hardware implementation

At the top level (level 1) is the abstract computational theory of the system, in which a set ofbasic (abstract) concepts and a specification for the system are defined. The system isexplained in terms of what it does and this characterisation comes in a mathematical form or aformal algebra. At the middle level is the choice of representation for the data and thealgorithm for its manipulation. This level describes how the results predicted by thecomputational theory would be obtained. At the bottom level are the details of how thealgorithm is realised physically.

Although the three levels are needed for an understanding of the system, it is the top level thatis the most important. The reason for this is that an algorithm is more easily understood byunderstanding the nature of the problem or solution than by examining the code itself or thehardware on which it is implemented.

In this chapter, the computational theories of Client-Server distributed computing and ofpipeline processing are presented. Client-Server computing is a classic example of the SIMD(single instruction, multiple data stream) paradigm [15]. Pipeline processing is an example ofthe MISD (multiple instruction, single data stream) approach [15]. We show that a system thatcan be characterised as a pipeline computation can be expressed in terms of a Client-Servercomputation. As such, an algorithm exists to simulate MIMD processing within the client-server framework. The design of our system to represent pipeline processing as client-servercomputing is given in chapter 4.

3.2 Details of Theoretical ModelThis section deals with the two main computational models that are used in our system. Thefirst, the Client-Server model, is the model that we used to design our system on. The second,pipeline processing, can be used within the server to control the distribution of the units.

3.2.1 Client-Server ModelClient-Server model is a concept for describing communications between computingprocesses that are classified as service consumers (clients) and service providers (servers).The basic features of a Client-Server model are:

1. Each Client-Server communication is established when one module (client) initiates aservice request and the other (server) chooses to respond to the service request. For agiven service request, clients and servers do not reverse roles (i.e., a client stays aclient and a server stays a server).

2. Information exchange between clients and servers is strictly through messages (i.e., noinformation is exchanged through global variables). The service request and additionalinformation is placed into a message that is sent to the server. The server' s response is

Page 13: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

13

similarly another message that is sent back to the client. This is an extremely crucialfeature of the Client-Server model.

The following additional features, although not required, are typical of a Client-Server model:3. Messages exchanged are typically interactive. In other words, the Client-Server model

does not support an off-line process.4. Clients and servers typically reside on separate machines connected through a

network. Conceptually, though they can run on the same machine.

The implication of the last two features is that Client-Server service requests are real-time(synchronous) messages that are exchanged through network services. This feature increasesthe appeal of the Client-Server model (i.e., flexibility, scalability) but introduces severaltechnical issues such as portability, interoperability, security, and performance [16]. Figure3.1 shows the interrelationships between distributed computing and Client-Server models.Conceptually, the Client-Server model is a special case of distributed-computing model.

Fig. 3.1: Interrelationships between Computing Models [16].

A Distributed Computing System (DCS) is a collection of autonomous computersinterconnected through a communication network [16]. Technically, the computers do notshare main memory so that the information cannot be transferred through global variables.The information between the computers is exchanged only through messages over a network.

The restriction of no shared memory and information exchange through messages is of keyimportance because it distinguishes between DCSs and shared memory multiprocessorcomputing systems. This definition requires that the DCS computers be connected through anetwork that is responsible for the information exchange between computers. The definitionalso requires that the computers have to work together and cooperate with each other tocomplete the overall computation.

Formal Description of Client-Server computingConsider an input X, and a computation on that input C(X) that returns some result r. Wecould say that

)(XCr = . (1.1)In client-server computing, the server will have the ability to partition the input data into nsegments

∑−

==

1

0

n

iixX (1.2)

such that the following transform

ComputingModels

Distributed ComputingModel

TerminalHost Model

File TransferModel

Client-ServerModel

Peer-to-PeerModel

Page 14: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

14

iii rxCx =⇒ )( (1.3)

can be applied to each data segment xi. Each segment is sent to one of a set of clients, whichperforms the transformation, and returns the corresponding ri. The server will have the abilityto reconstruct the original result by combining these partial results

� 1

0

)()(−

=

==n

iixCXCr (1.4)

where denotes the combination operation.

3.2.2 Pipeline ProcessingIn pipeline processing, the problem is divided into a series of tasks that have to be completedone after another (see Fig. 3.2). Each of these tasks of called a pipeline stage and is executedby a separate process or processor. Each stage will contribute to the overall problem and passon information that is needed for subsequent stages.

Fig. 3.2: Pipelined processes, where P0 through P4 are the processes.

The problem is divided into separate functions that must be performed on the data, but in thiscase the functions are performed in succession. The input data is often broken up and passedsequentially through the system.

Given that the problem can be divided into a series of sequential tasks, the pipelined approachcan provide increased speed under the following three types of computation:

1. If more than one instance of the complete problem is to be executed2. If a series of data items must be processed, each requiring multiple operations3. If some of the data can be passed forward to the next process before the current

process has completed working on all of the data.These three types of computation are performed frequently on distributed computing systems[15]. This can be seen from the example given in section 6.1. In the example there aremultiple units, which must first be sorted. When two or more units are sorted they can then bemerged until all the data is sorted. Then the data is split into units and the largest differencesbetween two numbers are computed. The three stages of this computation use all the threetypes of computation that a pipeline processor is designed for.

Formal Description of Pipeline ProcessingIn pipeline processing, the ability will exist to decompose a computation into m smallertransformations that each act on the result of the previous transformation,

( )( )( )( ), )( 0121 �� XccccXCr mm −−== (1.5)

where X is the input. A recursive definition of this concept could be written as follows,

>=

=− , 0 if)(

; 0 if)(

1

0

jrc

jXcr

jjj (1.6)

where 1−= mrr can be regarded as the seed to the recursion and defines the final result. The

first clause in Eq. 1.6 is the terminating condition (passing the input to the firsttransformation) and the second clause describes how the result of any one transformation

P0 P1 P2 P3 P4InputData

Output (Results ofcomputation)

Page 15: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

15

depends on the preceding transformation. We use the following compact notation to representthe recursive definition of Eq. 1.6,

, )()(1

0∏

=

==m

jj XcXCr (1.7)

where ∏ denotes the operation to appropriately deal with intermediate results. Equation 1.7describes passing the complete input X to transformation c0(), and passing the result totransformation c1(), and so on. In pipeline processing, the ability will also exist to partition theinput into n segments as described in Eq. 1.2. In this case, each of the n segments could bepassed in turn through all m transformations and the partial results combined as in Eq. 1.4.Equation 1.7 could then be written as

. )()(1

0

1

0

� −

=

=

== ∏

n

i

m

jij xcXCr (1.8)

ProcessorsData

C(X)p0 p1 pm-1. . .x0x1xn-1 . . .

Fig. 3.3. An illustration of MISD pipeline processing.

The advantages of this include the ability to arbitrarily change the granularity of the datathroughput (some transformations may have restrictions on the size or representation of theirarguments) and also to permit parallelisation of the computation. The computation depicted inFig. 3.3 could be described by rewriting Eq. 1.8 as

( ) , )(2

0 0

� −+

= =−∏==

nm

i

i

jjij xcXCr (1.9)

where processor pj is responsible for transformation cj().

3.2.3 Combining ModelsIt is possible to combine both the client-server and pipeline models. This is important if wewant to allow our clients to effect arbitrary transforms rather that all performing the same cj().In this case, the server divides the computation as well as the data. It distributes to the clientsa description of a transformation cj() as well as a data segment xi. Since the partitioning shownin Eq. 1.2 is possible, there will not be any interdependencies between different parts of thedata stream. Equation 1.7 could be rewritten as

, )()(1

0

1

0∏

=

=

==m

j

n

iij xcXCr

� (1.10)

which describes transforming all of the data segments with cj() before applying cj+1(), and soon. Since Eqs. 1.9 and 1.10 describe the same computation, this shows that the order in whicheach cj(xi) is effected is unimportant, and so the highly structured pipeline processor modelcould be simulated by the loosely-coupled client-server model. The algorithm for thissimulation is given as part of the system design in chapter 4.

Page 16: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

16

Chapter 4 Design

4.1 Introduction

Several major software engineering principles such as modularity, abstraction, generality andincrementality [17] were followed during the design of this project. We have provided adetailed UML style diagram of the dependencies within the system. These diagrams and ourtextual description of each module’s functionality should give a good overview of our designof the system. They should also make it easier to understand the textual description of eachcomponent.

4.1.2 System DivisionThe system modules have been divided up into three main groups:

• Server Specific Modules• Client Specific Modules• Common Modules

This is the way that the system was designed and should help the reader identify differentparts of the overall system. The common modules are parts of the overall system that aresimilar on both the client and server. The main areas covered by these components are themessages used to communicate, the mechanisms used to perform the network communication(including network socket type), the module used to manage the LogFiles and the timermechanisms in both pieces of software. As the client and server are distinct pieces ofsoftware, they also have their own individual design sections in this chapter.

4.1.2 Communications ProtocolAll communication within our system follows a certain protocol. As with most client-serversystems (see chapter 2), all communication is initiated by the client. All communication isdivided up into sessions (details included in section 4.4).

4.2 Limitations of / Problems with JDCLWhile working with the JDCL several problems were identified through careful analysis andtesting of the code. These are broken into three sections: server specific, client specific, andcommon issues. Although each problem in itself might seem small, identification of the causeof these problems required significant effort. Combined, the problems severely limited theoperation and extensibility of the system, and in section 4.3 it will be seen how a carefullyintegrated solution was required to elegantly refine and add to the system to bring it to itscurrent level of functionality.

4.2.1 Server Specific IssuesThe server specific issues include coping with client failure and in the feedback of resultsfrom clients.

a) Client FailureOne of the main aims of the JDCL was that it would be able to handle the case where a clientfailed to return results, for whatever reason. Although the code to do this was in the JDCL itwas not functioning properly. There were two reasons for this:

Page 17: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

17

1. The server maintains two TaskLists. The first, PendingTasks, contains all the units that arecurrently being processed by the clients. Each unit on the PendingTasks list has a processingtime associated with it. If the client fails to return a result or request an extension within thistime limit the unit is moved to the second list, called ExpiredTasks, so that the server canredistribute the unit. The cause of the problem was that when the unit, for which the clientfailed to return a result, expired it was not moved to the ExpiredTasks list so the server didnot know that it needed to redistribute the unit.

2. When the server has no more units to send, and a client requests a unit, the server issupposed to send an empty (or blank) unit. This indicates to the client that there is nothingmore to process so it can shut down. The problem here was that these blank uni ts were beingadded to the PendingTasks list. Since the clients shut down when they receive a blank unit,these units were being left on the PendingTasks list. This resulted in the server waitingindefinitely for results that would not be returned.

b) Feedback of Results

The JDCL class diagram in appendix B shows the class layout for the JDCL. TheServerEngine creates the ParameterGenerator (PG) and ResultHandler (RH) and manages allthe data sent to and from them. This design works well in the case where there is one task andthe data only needs to be distributed once. However it is extremely complicated to get the PGand RH to communicate with each other, which is necessary for more complicatedcomputations to be performed.

Such computations could include some or all of the following:

• the clients having the abili ty to perform multiple tasks,• the results of a computation being combined and redistributed for further calculation,

and• if there are two tasks to be performed the PG might try to send out the data for the

second task before the first task is fully completed.

The problem is best illustrated with an example. Consider the following problem.“Given a list of unsorted numbers find the two numbers that are beside each other in anordering of the list and that have the largest numerical difference”E.g. given the list: (12, 30, 7, 25, 1, 3) the answer would be (12, 25)

This example has two tasks to be performed: first sort the list and then calculate the largestdifference. In sorting the list, the PG will break up the list into small parts and distribute themto the clients. When the sorted sub-lists are received it would be more eff icient, for a largenumber of units, to send the sorted sub-lists to the clients for merging rather than have theserver do it. When the file is finally sorted then the second task should begin.

In calculating the largest difference the PG must now distribute small units, with someoverlap, to the clients for them to calculate the largest difference. The RH must keep theresults of this task and return the pair with the largest difference to the user.

From this example it can be seen that it is necessary for the PG and RH to communicate.There are two lines of communication:

• the RH has to pass the results to the PG so that they can be redistributed (this couldhappen several times during the sorting stage), and

Page 18: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

18

• the PG and RH must also “ let each other know” what stage of the computation iscurrently being done. In particular, is the PG sending out the initial unsorted units, is itsending out the larger units, or is it sending out the units for the second task to becalculated?

4.2.2 Client Specific IssuesThe client specific issues include a lack of security features, distribution limitations, the lackof a user interface, and the single problem limitation of clients.

a) Lack of Security FeaturesEssentially the JDCL was designed to be run in a very controlled environment where all usersand systems are trusted. At the core of this distributed environment is a JAVA class file,which is downloaded and dynamically loaded into the client application. There are obvioussecurity implications for the client machine. There must be protection mechanisms put inplace to ensure that the class file is restricted in what it can do. This is a huge deficiency inthe JDCL client code.

b) Unwieldy Client InitialisationAs it stands the JDCL client software consists of an initialisation file that containsapproximately 15 different options. Although this might offer a high level of configurability,the initialisation file is too unwieldy for most users. We want the user interactions with theclient application to be as minimal as possible. Realistically there are only two settings thatneed to be configured: the server IP address and server port. In a situation where the client isbeing distributed within an organisation, the system administrator can predetermine thesesettings. In these sorts of situations, no interaction is needed between the user of the clientmachine and the client application, and therefore this large initialisation file is not needed.

c) Impractical to DistributeThe JDCL client consists of approximately 10 classes and one initialisation file. It states in theJDCL documentation that the client application was to be distributed as a directory of files.This is not very practical because loss of any one of the files in this directory would lead to anon-functioning client. There are several other options for a more practical distributionmethod, e.g. a JAVA Jar file.

d) Lack of User InterfaceThe JDCL client is essentially a command line application. All that the user interface consistsof is internal system messages flashing on a terminal window. This does not give the user anysort of useful information such as how many units the client has processed, what unit numberis currently being processed and so on.We propose to remove the terminal interface and replace it with a GUI GUI (graphical userinterface) that would provide the user with some useful information. This interface does nothave to be mandatory - it should be possible to ‘switch off’ this interface in situations where itis not required or is a distraction.

e) Single Problem Limitation of ClientThe way that the JDCL client is designed means that it is limited to only being able to accepta single algorithm in its lifetime. This is a fairly narrow minded view because for anyone thatgoes to the trouble of setting up a distributed computation, they would most likely want tohave the ability to send another task to a clients after it has finished the first. From looking atthe design documents of the JDCL it may have been the intention to have this capabil ity but it

Page 19: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

19

has not been effected. We propose to design clients that have the ability to accept an arbitrarynumber of different tasks over time from the server. Once the client software has beeninstalled on the system, it should never need to be replaced, altered or restarted. This raisesthe issue of error handling. The client must have the facility to handle as many types of erroras possible to avoid terminating unnecessarily.

4.2.3 Common Issues for Server and ClientAs there is a lot of overlap between the server and client software, we noticed several issuesthat affected both pieces of software. Some of these issues are noted below.

a) Log FilesWhile programming the JDCL we observed that when the server and client were running, thelogs were not constantly being written to. Rather, the output was buffered and written to thefiles when the buffer was full. While the logs are not critical to the client (unless we aredebugging) they are vital on the server to keep track of exactly what is happening during thecomputation and can be useful after to work out the life cycles of the units and the time takenfor the computation. If there was a power loss to the server or the application wasunexpectedly terminated, without log files the user would be unable to determine how far thecomputation had progressed and subsequently would have to start the computation again.

b) Unnecessary CodeWhile we were studying the code for the JDCL we came across a lot of unnecessary code.These included redundant messages being printed to the screen and large portions of the codecommented out without explanation.

c) Processing ErrorSince the user of the system has to program the task it is conceivable that an error could occurduring its execution. This could be caused by either a programming error by the user or anunexpected value in the data. If this were to happen then the client application would crashand the server would redistribute the unit eventually killing all the clients. There is nothing ineither the client or the server to handle this.

4.3 Final Design of ProjectThis section details the design of our system, which is a combination of project specificationand the limitations of the JDCL (listed in section 4.2).

4.3.1 Server DesignIn this section, we explain the design of the server by describing the responsibilities,attributes, and functionality of the main modules in its design. For consistency, and to aid thereader in understanding the final implementation-level code, we use actual variable andmethod identifiers in our explanations. The relationship between modules in the server designis shown in Fig. 4.1.

Page 20: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

20

Task

Message

MessageHandler

ConnectionManager

TaskList

Scheduler

LogFile

ServerMessageHandler

ServerEngine

implements

SchedulerThread

extends

Aggregation ( "has a" relationship )

Generalisation

Key

Bucket

DataHandler

Fig 4.1: Final server design.

ServerEngineThe ServerEngine manages various server-side data structures and classes.It is responsible for:

• Reading in runtime parameters from a user-defined initialisation file.• Creating the LogFile’s.• Loading the user defined classes (i.e. the DataHandler and Task).• Running the ConnectionManager and passing it the necessary information so that

it can listen for connections.

The ServerEngine has the following attributes and methods.Attributes:

TaskList pendingTasks, expiredTasksDataHandler dataHandlerLogFile errorLog, systemLogSchedulerThread timeoutThread

Constructor: The constructor reads the parameters from a user defined initialisation file,creates the log files and reads in the user defined classes

Methods/Functions:− TimeoutAction: Checks the PendingTasks list for expired units and moves them if

necessary.

Page 21: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

21

− GetNextParameterSet: When a client requests a unit this method first checks theExpiredTasks list. If there is a unit on the list it is returned. Otherwise it gets one from theDataHandler.

− GetTaskByteCodes: Returns the user-defined algorithm.− GetExtension: Checks if the unit is on the PendingTasks list. If so then calls

ResetTimeOutStatus().− HandleResults: Checks if the results are in the PendingTasks list. If so then it sends them

to the DataHandler.− RunServer: Starts the SchedulerThread and ConnectionManager. Periodically

calls TimeoutAction() and resets the SchedulerThread.

ConnectionManagerThe ConnectionManager is responsible for listening on the server socket for newconnections and creating a new ServerMessageHandler to handle each connection. It isalso responsible for closing the server socket when a computation is complete. TheConnectionManager also keeps a list of all the active ServerMessageHandler’s.

Attributes:ServerSocket listenSocketVector activeHandlerList

Constructor: Initialises the attributes of the ConnectionManager.

Methods/Functions:− RemoveHandler: Removes a ServerMessageHandler from the activeHandlerList.− AddHandler: Adds a ServerMessageHandler to the activeHandlerList.− Size: Returns the number of ServerMessageHandlers on the list.− Close: This method is only called if the server is shutting down. It sends a terminate

communications message to all the clients on the list and then closes the socket.− Run: Listens to the server socket and creates a ServerMessageHandler to handle

each new connection.

ServerMessageHandlerThe ServerMessageHandler class is responsible for communication with the client. It doesthis by waiting for the client to request a service, and then sending the appropriate response.All communication is achieved by the transmission of Message objects.

Attributes:ConnectionManager connManager

Constructor: Sets up the socket.

Methods/Functions:− SendTask: Sends the algorithm to the client.− SendParameterSet: Gets a unit from the DataHandler and sends it to the client.− SendExtensionGrant: Sends an "extension granted" message to the client.− SendError: Sends an error message to the client.− SetConnectionManager: Sets the connection manager.

Page 22: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

22

− Run: Checks the message type received from the client and takes the appropriate action.− Shutdown: Closes the socket.

TaskListThe TaskList stores each unit and the time allocated for its completion.The ServerEngine has two TaskLists:

1. PendingTasks keeps track of all units that (a) have been sent to clients but for whichno results have been received as yet, and (b) that have not exceeded their allocatedtime.

2. ExpiredTasks keeps track of all units that have exceeded their allocated time.

When a client requests a unit it will be given one from the ExpiredTasks list if there are any init. If the ExpiredTasks list is empty then a new unit is created and added to the PendingTaskslist.Attributes:

Hashtable tasklist

Constructor: Initialises the hash table.

Methods/Functions:− DoesTaskExist: Checks if a unit is on a list.− GetTimeoutForTask: Returns the timeout for a particular task.− SetTimeoutForTask: Sets the timeout for a task.− GetParameters: Returns the data from a unit.− There are also methods for adding and removing units from the list.

DataHandlerThe DataHandler class manages all the data, which is sent to and received from theclients. It uses the Bucket class to simulate a pipeline processor machine. The class has aBucket array, which represents one processor of a pipeline processor machine. The userspecifies in what order the buckets are to be polled. The class is abstract and must be extendedby the user. This is to ensure that the users DataHandler conforms to a standard interfaceso that it can interact with the ServerEngine correctly.

Attributes:LogFile errorLog, systemLogBucket[] stage

Constructor: Initialises the log files.

Methods/Functions:− Init: This is where the bucket array is created (if it’s needed) and user variables are

initialised.− GetNextParameterSet: This method should return a data set to the ServerEngine.− HandleResults: This is where the user specifies what is to be done with the results.

BucketThis Bucket class is used to simulate the intermediate storage that is required at each stageof the pipeline. This class stores the units which have not been sent to clients or which have to

Page 23: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

23

be resent for further processing. There are two distinct forms of data in the bucket class. Thedata that arrives to be added to the bucket is referred to as a job. Each entry in the bucket cancomprise of several of these ‘ jobs’ (exact number of jobs per bucket entry is specified in theconstructor). The information is stored in each bucket by using the Java Vector class. AVector can store any (Java) data type.

Attributes:String nameVector jobs

Constructor: Sets the number of objects that are to be in each unit and gives the bucket aname.

Methods/Functions:− AddJob: Adds a unit to the vector.− RemoveJob: Removes a unit from the vector.− IsBucketEmpty: Tests if there are any units left.− IsPartialUnitAvailable: This is called when there are no complete units in the Bucket.

It checks if there is a partial (or incomplete) unit in the bucket.− AddData: Puts the data into a temporary array. If the array is full (i.e. there are maxJobs

data objects in the temporary array) then it calls the addJob method to add the full unit tothe Vector.

− GetUnit: Returns a unit from the Vector.

Server Initialisation FileThe user modifies the server initialisation file every time they want to run a computation. Itwill have three main sections.

The first section contains server specif ic information:• Port number – This is the port that the server will listen on for new client connections.• Server Timeout – This is how often the SchedulerThread flags the

ServerEngine to check the PendingTasks list for expired uni ts.

The second section contains details of the user-defined classes:• The name of the two user-defined classes.• The processing time allowed for each unit.

The final section contains the logging information for all three log files:• Whether to display the information to screen only, file only, or both.• The level of detail for the files.

4.3.2 Client DesignAn explanation of the purpose and functionality of each module in the client is given in thissection. The design of the client is shown in Fig. 4.2.

Page 24: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

24

ClientSecurityManager

Scheduler

ByteArrayClassLoader Task Message SchedulerThread

ClientMessageHandler

ClientFrame ClientEngine

LogFile

MessageHandler extends

implements

Key

Aggregation ( "has a" relationship )

Generalisation

Fig. 4.2: Overall client design.

An explanation of the purpose and functionality of each module follows:

ClientEngineThe ClientEngine class is responsible for initialising and starting the client software. Thispart of the software runs at all times.

The main tasks involved here will be:• Taking in the external parameters (e.g. server details)• Setting up the GUI interface• Starting the security manager• Creating and initialising the LogFile• Beginning the communication between the client and the server

Attributes:LogFile errorLog, systemLogString server_ipint server_portClientFrame CFClientmessageHandler CMH

Constructor: This sets up all of the above variables - they will already have been read into thesystem by the main() method of this class. The server details will be got from the initialisationfile and passed to the ClientEngine.

Methods/Functions:− Shutdown: This method closes the logs and shutdown the overall system gracefully.

Page 25: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

25

− RunClient: This method is called after a ClientEngine object is created (via theconstructor). It begins the work of the client system by starting theClientMessageHandler.

− Main: The client software begins here. The main purpose here is to read in the serverdetails and perform some initialisations on the client.

ClientSecurityManagerThe ClientSecurityManager is a piece of software that will monitor all interactionsthat the client has with the system (the donor' s system the system that the client is runningon) and the network. Essentially, the purpose of this is to ensure that any subversive Taskthat is downloaded cannot corrupt/attack the donor' s machine in any way.The main items to be monitored here will be:

• Any network connections made to the outside world by the client software• Any reading or writing to files• Any deletion of files on the client system

The security module of the system should be initialised at the very beginning of the programand run for the entire lifetime of the client. The security policy that is being implemented inthe client is that if there is a security violation then the client software will completely resetand restart.

Attributes:String server_ipint server_port

Constructor: This receives details of the server IP address and the port number. It alsoreceives pointers to the systemLog and the errorLog so that it can write to these in theevent of a security breach.

Methods/Functions:− CheckOp: There are various CheckOp methods that are implemented in the security

manager. The purpose of these is to check the validity of certain operations that the clientis trying to perform. They range from checking file reading and writing, checking anynetwork connections created, and various other operations that might pose a security risk.

ClientFrameThis will be the GUI (graphical user interface) part of the client software. The purpose of thisis to display some basic information on the client progress, e.g. number of units processed,client system state, and so on. The GUI is for information purposes only and requires nointeraction from the user. It will display some basic statistics on the client’s progress and havea read-out of some of the system events. The GUI will be an optional part of the client – it canbe enabled/disabled in the initialisation file.

Attributes:int currentUnitint totalUnitsProceessed

Constructor: This will set up the GUI window and its components.

Page 26: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

26

Methods/Functions:− SetCurrent: The purpose of this method is to change the ID (identification number) of the

unit that is currently executing on the client machine (all work/data units will have aunique number).

− Increment: This increments the total number of units processed by the client.− AppendT: This method appends new text (i.e. system event) to the GUI window.

TaskLoaderThe distinguishing feature between this system and other pre-existing distributed systems isthat arbitrary tasks/algorithms can be securely loaded into the client system dynamically asthe client executes. This module will perform the loading of the task into the client system.Also some sort of cache of already loaded classes is maintained so that each task is loadedonly once.

This module performs the loading of the task into the client system. Also some sort of cacheof already loaded classes is maintained - so every different task is only loaded once.

Attributes:HashTable ht

Constructor: This initializes the hashtable to be empty and then attempts to load the giventask into the system by calling some of the other methods.

Methods/Functions:− InsertNameOfTask: This method puts an entry into the hashtable in order to update the

cache of already loaded tasks.− LoadTask: This method will load the class into the client software.

ClientMessageHandlerThe communications protocol is implemented and managed in theClientMessageHandler module. All messages that are received from the server arepassed to this module where the appropriate response/action is taken.

This module is also responsible for starting the task and receiving the data from theMessageHandler. The scheduling/timekeeping of the task will be implemented here – atiming thread will be started when the task begins executing and will set a flag when thecurrent task has run out of time.

Attributes:TaskLoader loaderSchedulerThread stClientFrame cfint currentExecutingIDint TimeoutStatus

Constructor: The main function here is to give values to variables in this class. TheTaskLoader is initialised by calling its constructor. Variables such asSchedulerThread and currentExecutiongID are not initialised until needed.

Page 27: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

27

Methods/Functions:− Get/Set/Reset TimeoutStatus: The main function of these is to get/set a value for the

variable TimeoutStatus. This defines what state the client is currently in.− TimeoutAction: This method performs some action (defined in the communications

protocol) when a timeout for a task occurs.− Isterminating/isError: These two methods check a message to see if it is a terminating

message or error message (received from the server).− GetTask: Contacts the server and requests a task definition.− LoadTask:This method is passed the task definition. It gets the TaskLoader to load the

task into the system.− GetUnit: Contact the server and requests a data unit− Run: This method does the main work of this class – it manages and implements the

communications protocol and manages the running task.− Shutdown: This method is called when the ClientMessageHandler is shutting down

– i.e. when the client is finished all processing.

Initialisation FileThere are certain external values that must be passed into the client. These cover areas such asthe server IP address and the server port. As the GUI component of the client is optional, thedesigners feel that the option to hide the GUI should be included here also. Since these cannotbe coded into the client (as they may vary from computation to computation), it was decidedthat these should be stored in an external file. The client will read this file and extract theappropriate information.

4.3.3 Common System ModulesAs with standard client-server systems, there is some overlap between the client and server. Inour system design, this occurred in the areas listed in this section.

Log FilesEach system (client and server) will keep two distinct logs: system logs and error logs. Thesystem logs will record system events as they happen. In the event of some catastrophic event(e.g. power loss), it may be possible that these logs could be used to restart the client at asimilar position to where it left off.

Attributes:File outputFile;boolean logToScreen, logToFile, attachTime, attachDate;

Constructor: The main task here is to set up the logging facilities. Optional details arewhether to print everything to the command line, whether to attach time/date information, andwhether to log everything to the file.

Methods/Functions:− Write: This method writes the passed-in text to the appropriate places – screen, file and/or

GUI interface (client only). The text will be formatted with the date/time and the specifiedlevel of detail.

− Close: The main purpose of this method is to close the files that the logs were beingwritten to.

Page 28: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

28

MessageHandlerThe main purpose of this is to send and receive messages between client and server. This willperform the opening and closing of network sockets.

Attributes:String destAddressint destPort

Constructor: The attributes are initialised.

Methods/Functions:− OpenSocket: A socket is opened to perform a communication session.− CloseSocket: This method closes the socket after the communications session is over

between the client and server.− SendMessage: This method sends the message.− RecvMessage: This method waits for a message.− TerminateCommunication: This method sends a terminating message – this is used to

signify that the communications session is over.− There are also two abstract methods, Run and Shutdown. The run method details the

types of messages that can be received and what action to take for each. The shutdownmethod closes the socket.

MessageThis is a common module on the server and client. It is the basic unit of communicationbetween them. It should be extendable so that a message can contain items such as data,algorithms, information on client status, and so on. The type of message that can be sent byeither side is varied – from requests, terminations of communication, error messages, etc. Thisclass is serialised so that Messages can be sent over the network and reconstructed at theother side. It is important that this class be identical on both the server and client (any changeswill result in all communication breaking down).

Attributes:String MessageTypeString ContentTypeObject[] content

Constructor: The constructor initialises the attributes listed above, however since there is anumber of message types, not all of the attributes have to contain information.

Methods/Functions:− There are methods to get and set each of the above attributes in a message object.

Task/AlgorithmThis is the interface between the client system and the downloaded task that is executed.Every task should conform to a specific interface to be able to interact with the data units thatare downloaded from server to client.

Attributes:Object[] parameterList, returnListScheduler owner

Page 29: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

29

Constructor: Sets the owner of the task.

Methods/Functions:− SetParameters: This method sets the parameterList variable with an object. This would be

the data that the task is to operate on.− GetResults: The client system would call this method to obtain the post-processed results

to be sent back to the server.− EndProcessing: This method is called at the end of any task run method – to indicate to

the client system that the task is finished executing and GetResults can be called to returnthe data to the server.

− Run: This is where the actual work of the task would be defined in a downloaded task,which was created by the user. This method should always end with a call toEndProcessing.

− ExceptionInProcessing: This is called if an exception occurs during processing. It takes inthe error description, as a string, and returns it to the server.

SchedulerThe Scheduler an interface class implemented on both the client and server. It specifies theaction to be taken when a timeout occurs. The following methods are implemented in theclasses that will implement this interface.

Attributes:int S_NULLEVENT = 0int S_TIMEOUT = 1int S_TASKCOMPLETE = 2int S_EXCEPTION = 3

Methods/Functions:− SetTimeoutStatus: Sets the timeout for a unit.− GetTimeoutStatus: Checks whether a timeout has occurred.− ResetTimeoutStatus: Resets the timeout if an extension has been granted.− SetException: This method is only implemented on the client. It is called if an exception

occurs during the processing of a unit.− TimeoutAction: Specifies what action is to be taken if a timeout occurs.

SchedulerThreadThe SchedulerThread is used to provide scheduling services to classes, which need toperiodically perform some task. Initially its owner passes the length of the timeout used to theSchedulerThread. By setting a flag, the SchedulerThread then indicates that atimeout has occurred thus alerting its owner.

Attributes:long secs_to_sleepScheduler owner

Constructor: Sets the attributes.

Methods/Functions:

Page 30: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

30

− SetTimeout: This method sets a value for the length of time this thread should “sleep” for.This is the time allowed for the cli ent to process a data set.

− Run: This method has only one statement in it – a wait – that waits the specified amountof time (contained in sleep). It then notifies its owner when the time has elapsed.

4.4 Communications Protocol

4.4.1 Message TypesThe messages that are sent between client and server can be divided into three main types:

• Request Messages• Response Messages• Special Messages

Request Messages

Request Task Definition:This is the message that is initially sent out by the client when it is started up. Essentially it isasking the server to send over the current task definition.

Request Parameters:This message is sent by the client to ask the server for a unit of data on which to execute thetask.

Request TermCommunications:This is a message sent by either client or server to indicate that the current communicationssession is over.

Request ExtensionGrant:This message is sent by the client when a task times out and it isn’t finished processing thedata – the client is asking the server for an extension to the task timeout so it can complete thework unit. The server checks periodically for expired tasks (this value is user definable in theinitialisation file).

Response MessagesResponse Results:This message is sent by the client and includes in it the results of processing a data unit. It wil lalso contain information that will help the server identify the results that are being sent back(a unique identifier created by the server when the original data was sent out).

Response ExtensionGrant:The server sends this message to the client stating the response to the client’s request for anextension to the current executing task.

Error MessagesError:This message is to indicate to the server that some error has occurred in the task executing onthe client. An exception in the client results in the client stopping the current task andnotifying the server via this message.

Page 31: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

31

4.4.2 Communications SessionsThe basis of all communications in this protocol is that the client instigates allcommunications. The communications protocol is specified in terms of “communicationsessions.” A session consists of the client opening a socket to the server, the communicationtaking place, the client sending a terminating message and the socket being closed (by theclient). In exceptional circumstances, the server may also send a terminating message (thisonly happens if the server is shutting down and it has client connections stil l open).

Session 1: Task Request and Data RequestTime Sender Message0 Client Sends Request Task Definition1 Server Sends current Task Definition2 Client Sends Request Parameters3 Server Sends Data Unit or NULL4 Client Sends Request TermCommunications

Session 2: Send Results and Request DataTime Sender Message0 Client Sends Response Results1 Client Sends Request Parameters2 Server Sends Data Unit or NULL3 Client Sends Request TermCommunications

Session 3: Request DataTime Sender Message0 Client Sends Request Parameters1 Server Sends Data Unit or NULL2 Client Sends Request TermCommunications

Session 4: Extension RequestTime Sender Message0 Client Sends Request Extension1 Server Response ExtensionGrant or Special Error2 Client TermCommunications

Session 5: Error ReportTime Sender Message0 Client Sends Error Message1 Client Sends Request Parameters2 Server Sends Data Unit or NULL3 Client TermCommunications

Page 32: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

32

Chapter 5Implementation

5.1 Development LanguageThe development language for the system is Java. The main reason for choosing Java overany other development language is the fact that it is both platform and network independent.This will be extremely important for any distributed system because it is expected that theclients will be running simultaneously on numerous different platforms. Essentially anapplication in Java runs on any operating system that supports the Java Runtime Environment(JRE). This means we will not have to create distinct Windows, Macintosh, and Unix versionsof our system.

The designers of the Java platform believed in the importance of networking and designed theJava platform to be network-centric [18]. From our point of view, Java makes it unbelievablyeasy to work with resources across a network and to create network-based applications. Thisis especially important in our distributed system as it is based around using the network togain the extra computational power.

Another key benefit of Java is its security features [18]. Both the language and the platformwere designed from the ground up with security in mind. The Java platform allows users todownload un-trusted code over a network (i.e. a task) and run it in a secure environment inwhich it cannot do any harm: it cannot infect the host system with a virus, cannot read orwrite files from the hard drive, and so forth.

The network-centric design of the Java platform also means that a Java application candynamically extend itself by loading new classes over a network. An application that takesadvantage of these features ceases to be a monolithic block of code. This means that we canhave a client application running on a machine that can dynamically download a task(algorithm) to run. By using Java, it meant that we did not have to send uncompiled code andcompile or interpret it at the client. This saves client resources, reduces the size of the clientsand significantly reduces the size of the project.

The fact that the JDCL was available for us to modify was an added reason to choose Java,but not an influential one. Existing distributed communication systems (such as PVM [5],MPI [6] etc.) exist as models for language independent distributed computing environments.

5.1.1 How the System is Implemented with JavaServerThe server will be a Java application that is run from the command line by the administratorof the system. The compiled task to be run by the system must be included in the directorythat the server is run from. There are several configurable runtime options such as what to dowith the LogFile details, which machine port the server is using, task timeout details andvarious other options (Full details of how to run the server are included in the user manual).

ClientThe full client application is bundled into one executable JAR file (~20 kbytes). There is oneinitialisation file with three options that must be configured (namely the server IP address,server port and whether or not to display the GUI). The client application requires that eachclient machine has the JRE 1.2, or greater, installed to run on. The client is meant to be an

Page 33: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

33

almost autonomous piece of software that runs with no intervention required from the user ofthe machine. The client is meant to run on a target system taking spare clock cycles of themachine. Hence the user at the system shouldn’t see any degradation in performance whilethe client is running.

5.2 Deployment of ApplicationWith any distributed application like this, there will be several different client deploymentoptions. As we were testing the system, we examined several of these options.

5.2.1 Client Installation OptionsWhen it comes to distributing and running the client software, there are a few differentinstallation options available:

1. The donor of the client machine can just manually run the client software when theywish by double-clicking on the JAR file. This leaves full control, of when to run theclient software, up to the donor of the machine. Although this could lead to issuessuch as users ‘ forgetting’ to run the software and result in a lack of client machines forthe distributed computation.

2. A better solution (which we used while analysing a DNA sequence) is to have theclient run as an operating system (OS) level service at the user machine. For mostmodern operating systems (e.g. NT4, Win 2000, Solaris etc.), it is possible to haveprograms set up to run as services. This means that even if there is nobody logged onat the machine the client software can be running in the background with nointeraction with the screen. There are obvious issues such as needing administratorrights to set this up but in the long run this is far more effective for distributedcomputations. This sort of solution would work best in a networked environment suchas an office block, i.e. where the client software can be deployed via a ghost image.

3. A third alternative on the above solution would be to have the client softwarescheduled to run at off-peak times, e.g. at night time in an organisation. As far as theauthors are aware, there are facil ities to set this up in Windows 2000, NT4 and UNIX.This solution would be more suitable to a situation where the donors of the machinesare very sensitive to the idea of having the client running while they are working onthe PC.

5.2.2 Data Migration (Problem Specific Issue)There are two main options for data access in possible applications of this distributed system.These would be either storing the problem data locally with each client and/or to migrate thesource data over the network to each client. These options are explained further as follows.

1. The data can be sent out via the network with the data units that are sent to the clients.This is most suitable for computations that don’t have a lot of source data. A goodexample of this would be finding factors of numbers. The full source data could bejust contained in two integers, i.e. each client searches a range numbers using the twonumbers as boundaries. From looking at those two numbers, the client has got all its‘data’ to process.

2. It may be impractical to send all the data over the network, so the data can bedistributed with the client software (e.g. in a file). If the data is very large andunchanging for the lifetime of the client, it could be bundled with the client software.An example of this is illustrated in our application of the system (chapter 7). In thissituation, one must be careful that the security policy of the client is not violated (i.e.having the data as a file on the client machine). This option would go against the

Page 34: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

34

‘programabil ity’ property we desired for our system, hence would only be employedwhere there were obvious efficiency benefits.

5.3 Our Deployment of the Client SoftwareFrom the beginning of the project one of our aims was to implement a major application withthe system (see chapter 7). This raised the question of how we would deploy our software in apractical environment.

5.3.1 Department of Computer ScienceThere are three computer laboratories in the Department of Computer Science, NUIMaynooth. We were given the opportunity (with the assistance of Computer Sciencetechnicians J. Cotter, M. Monaghan, and P. Marshall) to use two of these labs (approximately200 PCs) for the purpose of running our software. One lab housed 100 Dell Optiplex GX1machines (with Pentium 600MHz processor, 128Mbytes memory, and 6 Gbytes storage) andthe other housed 80 Dell Optiplex GX110 machines (with Pentium 1000MHz processor,256Mbytes memory, and 20 Gbytes storage). As all of the PCs run Windows NT4, we choseto install the client as a low priority background NT service. This would mean that we couldhave our client software running 24 hours a day with no discernable impact to users at thesemachines.

5.3.2 Turning our Java Application into an NT ServiceThe problem now became how to turn our Java application from a program that was explicitlystarted by the donor, to a program that would run as an operating system level service. Wewanted our software to start up automatically on the PCs when they were booted. While wewere investigating how to accomplish this, we came across a third party application thatautomates this process - JVMI2 [19]. As the designers of the JVMI2 application state, theywere seeking to address the same problem that we had come across:

“As users of Java app' s ourselves, we still saw a need for a single tool for NT platforms thatcould run java from the command line but also offer the option to self-install as an NT service(or services.) Having some experience running Java-powered services on NT boxes, wedecided to produce an easy to use service installer, specifically for Java applications, thatrequired no programming or compil ing. The result is JVMI2.” (Quote from [19].)

We chose to use this tool as it offered us an easy interface for setting up our software as a lowpriority background NT service.

5.3.3 Success of our DeploymentOnce we had secured a permanent server to run our server software from, we set aboutinstalling our client software throughout the Department. After a few initial problemsinstalling and configuring the JVMI2, we had approximately 200 PCs ready to act as clientsfor our system. Our server resided on a Dell Optiplex GX110 machines (with Pentium1000MHz processor, 256Mbytes memory, and 20 Gbytes storage).

As we would not be doing distributed computations all the time, the cli ent software was set upto wait dormant on the PC’s until the server had been given a problem to distribute. Thepolicy that was implemented in the client software was that whenever the client could notconnect to the server, it simply waited and re-tried to connect to the server every hour or so.While it was dormant, the cli ent does not take up any resources on the client machine.

Page 35: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

35

By the end of the project, we had the client software running constantly for about threemonths. We estimate that we consistently performed over three Pentium-years of processingeach week. In that time, we did not receive one complaint about the software interfering withpeople’s work at the PC’s. The client software ran as a low priority background process,hence only executed when the CPU was not in use.

Page 36: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

36

Chapter 6Evaluation

6.1 The Ordered Delta-Max ProblemAs part of the testing phase of our project (verification and validation) [20, 21, 22, 23], weattempted to solve a suff iciently diff icult problem with the system. The problem had to begeneral enough so that the resulting application would test all of the capabil ities of oursystem. We chose the following problem:

“Given a list of unsorted numbers, find the two numbers that are beside each otherin an ordering of the list and that have the largest numerical difference.”

We used this example, on a list of 100,000 numbers, because it fully tests the abil ity of theserver to send the results of one computation as the data for another. It also shows the optionof having a number of different computation options in the client’s downloaded task. Theproblem was also used to test the error handling on the client because it was easy to generatean exception during the processing.

The algorithm for the computation is as follows (the code is included in appendix D):• Create an array with three buckets for the three stages of the computation:

1. Sequential sort (on individual units)2. Merge Sort (all of the individually sorted units)3. Compute the Maximum Difference (over the completely sorted list)

• Break up the data set and fil l bucket 1 with unsorted units.

• When a client requests a unit start at bucket 3 and work up the array of buckets until aunit is found. This is how the pipeline processor is simulated – data is constantlymoving toward the end of the computation.

• If there are no full units, start at the start of the array, Sequential sort, and move up thearray taking the first partial unit available.

• When the results of the units are received, unless they are from the last bucket, addthem to the next bucket in the list.

• Results from the last bucket are for the Maximum Difference stage; so simply storethem in a temporary variable, comparing them with the other results as they arrive tofind the Maximum Difference.

6.2 DNA AnalysisAs a further test for our system, we analysed a small portion of DNA. We wanted tounderstand how the system would react to an increased number of clients. To do this we usedthe algorithm detailed in section 7.4 but we limited the number of units that are distributed tothe clients.

Each client has two wait times coded into it. The first is the number of seconds a client mustwait, before retrying, if it cannot connect to the server. The second wait is the amount of timeit must wait after receiving a null (blank) unit before asking for another. We ran two test eachwith different null wait times.

While we were initially examining the JDCL system we observed that Windows NT 4.0Workstation only allows a new connection to a socket every four seconds, and a maximum

Page 37: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

37

queue length of 5 connection requests per network socket. (A server installation on Solaraissuffers no such restrictions.) We decided to use this delay to simulate a large amount of databeing transferred between client and server, or to simulate far more clients trying to connectto the server than our testbed of 90.

For the first test we set the connection retry time to 10 seconds and the null wait time to 15minutes. We then configured the server to send out only 100 units, each taking approximately6 minutes to compute. As can be seen from Figs. 6.1 and 6.2 we ran the test with differentnumbers of clients. The increase in time as the number of clients approached 90 was causedby the fact that the null wait time is less than the unit processing time. All the clients that hadreceived null units were trying to get new units at the same time that some of the clients weretrying to return results. The server could not cope with the number of requests and so theoverall computation time suffered.

100 tasks, task length ~6mins, retry wait 10s, null wait 5minsNo. clients Start1 Stop1 Mins1 Start2 Stop2 Mins2 Average

1 20:30 07:32 11:01:47 0:00:00 11:01:475 22:01 00:15 2:13:14 15:05 17:18 2:13:18 2:13:1610 19:27 20:34 1:07:37 16:23 17:31 1:07:35 1:07:3620 18:46 19:22 0:35:56 13:15 13:50 0:35:52 0:35:5430 16:17 16:44 0:27:48 22:54 23:22 0:27:52 0:27:5040 17:55 18:19 0:23:10 12:47 13:11 0:24:02 0:23:3650 17:54 18:15 0:21:35 15:50 16:11 0:21:07 0:21:2160 21:20 21:41 0:21:00 13:19 13:40 0:21:31 0:21:1670 18:18 18:39 0:20:11 21:55 22:17 0:21:58 0:21:0480 18:58 19:20 0:22:29 22:23 22:45 0:21:50 0:22:1090 15:39 16:09 0:29:43 20:49 21:14 0:25:02 0:26:38

Fig. 6.1: Data for graph in Fig. 6.2.

0 10 20 30 40 50 60 70 80 900

20

40

60

80

100

120

140100 tasks @ ~6 mins, retry wait 10s, null wait 5 mins

Number of processors

Tim

e (m

inut

es)

Fig. 6.2: First test using DNA analysis.

Page 38: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

38

0 10 20 30 40 50 60 70 80 900

20

40

60

80

100

120

140100 tasks @ ~6 mins, retry wait 10s, null wait 15 mins

Number of processors

Tim

e (m

inut

es)

Fig. 6.3: Second test using DNA analysis.

For the second test we increased the null wait time to 15 minutes. From Fig. 6.3 we can seethat as the number of PC’s approaches 90 the time did not increase. This shows that whenconfiguring the system, suitable times should be chosen otherwise unnecessary delays couldbe caused.

Page 39: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

39

Chapter 7Application

7.1 What is DNA?Packed tightly into every one of our body’s cells is a complete copy of the human "genome" -all the genes that make up the master blueprint for building a man or woman. The thousandsof genes contained in the nucleus of each cell are parcelled among the 46 sausage-shapedgenetic structures known as chromosomes. Each chromosome is made up of a tightly foldedstring of DNA [24].

DNA is material that governs inheritance of eye colour, hair colour, stature, bone density andmany other human and animal traits, not to mention number of legs, shape of bones, structureof respiratory system and structure of the brain. DNA is a long string of molecules callednucleotides (each composed of between 13 and 16 atoms). DNA is actually a double strandedstring containing (essentially) the same information in each strand.

There are only four, different nucleotides. These are: adenine, guanine, cytosine and thyminebut are usually referred to these using the four letters: A, T, G, and C. The sequence ofnucleotides (letters) can code for many properties of the body' s cells. The cells can read thiscode. Some DNA sequences encode important information for the cell. Such DNA is called,not surprisingly, "coding DNA." Our cells also contain much DNA that doesn' t encodeanything that we know about. If the DNA doesn' t encode anything, it is called non-codingDNA or sometimes, "junk DNA". Both coding and non-coding DNA may vary from oneindividual to another. These DNA variations can be used to identify people or at leastdistinguish one person from another.

7.2 Why Analyse DNA?Gene mutations probably play a role in many of today' s most common diseases, such as heartdisease, diabetes, immune system disorders, and cancer [25, 26]. These diseases are believedto result from complex interactions between genes and environmental factors. When genes fordiseases have been identified, scientists can study how specific environmental factors, such asfood, drugs, or pollutants interact with those genes.

Once a gene is located on a chromosome and its DNA sequence worked out, scientists canthen determine which protein the gene is responsible for making and find out what it does inthe body. This is the first step in understanding the mechanism of a genetic disease andeventually conquering it. One day, it may be possible to treat genetic diseases by correctingerrors in the gene itself, replacing its abnormal protein with a normal one, or by switching thefaulty gene off. There are many moral implications at this stage of the research, but beforethis happens there may be many other benefits of passively examining the structure of DNA[27, 28]:

• Improved diagnosis of disease• Earlier detection of genetic predispositions to disease• Rapid detection and treatment of pathogens in medicine• Assess health damage and risks caused by exposure to mutagenic chemicals and

cancer-causing toxins

Page 40: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

40

• Compare breakpoints in the evolution of mutations with ages of populations andhistorical events (for example, examining the descendents of survivors of catastrophicdiseases).

7.3 What Do We Do with Our System?As mentioned in section 7.1, a DNA sequence can be treated as a long sequence of characters.This sequence contains coding DNA and junk DNA. Through reading about the subject anddiscussing the area with Dr. James McInerney of the Department of Biology, NUI Maynooth,we learned that the coding DNA is often repeated a number of times in a single DNAsequence. Hence to find the frequency and positions of the coding DNA we used our systemto perform pattern matching on a sequence of DNA.

To analyse the DNA we had to perform four separate searches of the DNA because slightlydifferent strings of DNA can have the exact same functionality. These were:

1. Exact matches.2. Insertions.3. Deletions.4. Mutation.

7.3.1 Exact MatchesThis was the simplest case. In this situation, we generate sub-strings from the DNA andsearch for other exact matches of these sub-strings. The strings had to match character forcharacter. Dr. McInerney had told us that repeated strings of length 14 characters, or greater,would be significant statistically.

7.3.2 InsertionsThis is a little bit more complicated than exact matching. In this case, we were searching forvariations on the sub-string. These variations were very specific. For a particular sub-string,we were searching the DNA for this sub-string with extra characters inserted in the sub-string(up to a maximum number of insertions).

For example, if we started with the sub-string AGCTAC, and assuming there is a limit of oneinsertion allowed there is five possible matches, some of which are listed:

AGCTTAC, AGCCTAC, AGCTAAC

For our search we allowed for six character insertions into the substring.

7.3.3 DeletionsThis search is similar to insertions but this time we were searching for matches of the sub-string with characters deleted from it (again up to a maximum number of deletions).

For example, if we started with the sub-string AGCTAC, and assuming there is a limit of onedeletion allowed there is four possible matches, some of which are listed:

ACTAC, AGCAC, ACTAC

Again we allowed for six character deletions in possible matches.

Page 41: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

41

7.3.4 MutationsThis is the most computationally difficult case. Here we are searching for the sub-string withindividual characters changed for one of the other possible characters. The string length staysthe same but the number of characters altered could differ (up to a maximum number ofmutations - defined by a percentage).

For example, if we started with the sub-string AGCTAC, and assuming there is a limit of onemutation allowed there is six possible matches, some of which are listed:

AGGTAC, CGCTAC, AGCAAC

For this search, we let the number of characters that could differ be equal to 10% of the stringlength. The results for searching with mutations were not available at the time of writing thisreport.

7.3.5 Recording ResultsIn each case, we record the exact number of successful matches we find in a databasecombined with the index of the first occurrence of that substring. The positions of the matchesin the DNA can be easily found by a linear search of the DNA string.

7.4 Search StrategyThe algorithm we used for searching the data was as follows.

• The server sends two numbers to a client. The first is an index into an array that containsthe data. The second number is a range of indices that the client must start searching from.

• The client takes the n characters, which is the smallest string length of interest (length 14in our experiments) from the first index to be the initial search string. It then searchesthrough the rest of the data for repetitions of the string. If a match is found, the index atthe start of the match is added to a list. When the end of the data is reached, the length ofthe list of indices is equal to the number of occurrences of the search string. The clientstores the start index of the search string, the string length, and the number of occurrencesof the string. The client also stores a separate list L, of the indexes at which repeatedversions of this string length occurred.

• The string length is then incremented to n+1. Again the first n+1 characters from the startindex are taken as the search string. The client only searches L for occurrences of the newstring. If a match is not found at a particular index then the index is removed from L.

• The client continues incrementing the string length and searching the list until the lengthof L is one (i.e. the only occurrence of the string is at the initial search index). This meansthat the client has finished searching for strings at the first index. The client then startssearching at the next index.

• The client continues doing this until it has searched all the indices in the range that it wasgiven. Then it returns all stored results to the server.

• The server combines the results from multiple clients in a single repository.

We ran the client software over a number of days in a lab in the Department of ComputerScience, which contained approximately 100 PCs. The computation for exact matches took 28hours while we estimate that it would have taken 130 hours on a single PC. The computationfor insertions took 41 hours while we estimate that it would have taken 1790 hours on a singlePC. The computation for deletions took 35 hours while we estimate that it would have taken

Page 42: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

42

1670 hours on a single PC. For this set of problems, our system achieved an average speedupof 32 times, and a maximum speedup of 47 times (in the case of deletions). The speedup forexact matches is not as high for insertions and deletions because the unit size was set toosmall, resulting in a lot of excess network traffic at the server.

7.5 Results InterfaceTo search the results we imported the output of our system into a database application. Wethen used the SQL Database Query Language to extract the information we required. Some ofthe queries we used and the outputs are listed below.

1. Find the maximum string length (of a string that occurs at least twice) and display thenumber of occurrences of this string in the DNA file. The query we used is:

SELECT *FROM ExactWHERE StringLength =

(SELECT MAX(StringLength) FROM Exact)

And it returned the following:StringLength Index Num_Occurrences1526 400211 2

2. Find all the strings of length greater than 14 with a number of occurrences greater than150. The query we used is:

SELECT *FROM ExactWHERE StringLength > 14 AND Num_Occurences > 150

And it returned the following:StringLength Index Num_Occurences15 132158 151

We also ran this query for other combinations of string length and number ofoccurrences. Our results required 145Mbytes for exact matching, 487 Mbytes forinsertions, and 523 Mbytes for deletions. Both of the above queries took up to 5minutes to complete.

Page 43: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

43

Chapter 8Discussion

8.1 Achievement of AimsFrom the start of the project, one of the major aims was to produce a system that could solvevarious different problems amenable to distributed computing. One major component of thiswas to develop an interface that is easy to understand and use. Essentially we had to reducethe complication of setting up a distributed computation. We managed to engineer aninterface with the server such that the user of the system is only required to extend two classesin order to set up a distributed computation. Consequently a user of the system is onlyrequired to have a basic knowledge of the theory of distributed computing and basic Javaprogramming skil ls to operate the system effectively.

A major issue that we had to overcome was how to get the client software to integrateseamlessly with the client machines. This was a practical issue for us because we needed togain the use of the PC labs in the college to run our application of the system. As mentionedin chapter 5, we decided to run the client software as an operating system level service. Thisproved very successful for us. We had approximately 200 PCs running the client software 24hours a day. There was no discernable impact to users at the PCs because the software wasrun at low priority. This was a major success for our system.

We proposed to run one major application of our system - DNA analysis. As detailed inchapter 7, this proved to be a satisfactory test of our system. By running this application onour system, we not only managed participate in an interesting area of research, but we also gotthe opportunity to get some data on the actual speed-up that our system offers compared tousing a single PC to tackle the problem.

A paper has been submitted to the “Inaugural Conference on the Principles and Programmingin Java” [29] describing our enhancements to the JDCL.

8.2 Future DevelopmentAs it stands, we have succeeded in developing a working general-purpose programmabledistributed environment. However, there are a few areas that we see as possible future workfor anyone that might be continuing from where we are now with the system. These areas canbe broken down into two main topics.

8.2.1 System EnhancementsThese enhancements are areas of the system that could be improved and upgraded to offernew or better functionality.

1. A remote interface to the server could be developed. The purpose of this would be tooffer real-time details of the server’s status. This could be done via a basic GUIapplication (similar to the client GUI). The advantage of keeping this interface off theserver is that it should not result in any performance loss on the server (i.e. if it is onlycontacting the server a fixed number of times per interval).

Page 44: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

44

2. The size of data units that the server sends out could be dynamically modifiedaccording to how long it is taking current units to come back with results to the server(similar in mechanism to the sliding window of the TCP network protocol).

3. It has been proposed that for the Microsoft Windows operating systems, there could bea Tray Icon displayed when the client is running in the background of a client PC.This icon could offer options such as to pause the software. The idea here is that thedonor of the machine may wish to temporarily switch off the software while they aredoing some intensive processing at the PC, or are short of memory.

4. As it stands the code is relatively robust to errors and exceptions. Before the systemcould be used in a practical environment, it would have to be made even more robustto errors and exceptions. There should be few conditions where the client softwareshould shut down. As this is a complex system, this would be a difficult objective toachieve (not even in principle).

5. At the moment we are installing the client software on Windows NT machines as aservice. To do this we are using a third party piece of software. It would be possible toinstall and run the client application as a service without the use of this third partysoftware. This should be investigated.

6. Currently all of the communication is performed without any type of compression ofmessages. Java has the ability to compress objects that it sends over a stream. Thiscould allow for more information to be sent back and forth without generating extranetwork traffic.

8.2.2 DNA AnalysisThe whole area of DNA analysis is a huge area of research in Bioinformatics. The researchhas uncovered some very interesting findings about the tuberculosis bacterium, which to ourknowledge have not been published before. These included finding an exact match repeatedstring of nucleotides of length 1526, beginning at index 400211 in the sequence. A SQLinterface to the processed data has been used to answer complicated string matching queriesin seconds, that would take weeks on a standalone PC. The findings are currently beinginterpreted by Dr. J. McInerney in the Bioinformatics and Pharmacogenomics Laboratory,Department of Biology, NUI Maynooth. A publication is in preparation [30].

From the problems that we were given to solve, we see enormous potential for research in thearea of Bioinformatics. There is a whole theory behind developing more efficient distributedalgorithms to perform DNA analysis. The pinnacle of this research would be to developalgorithms to perform this type of analysis on the Human Genome. Through examining theresults we produced from this project, it is clear that this is a realistic objective for the future.

We hope that our tool will prove very useful in DNA research, and with the proper assistance(and biological knowledge) there is huge potential to investigate further how information isencoded in DNA.

Page 45: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

45

References

[1] G. Woltman, "Great Internet Mersenne Prime Search," 2002.http://www.mersenne.org

[2] SETI@Home - Search for Extraterrestial Intelligence at Home, 2002.http://setiathome.ssl.berkely.edu

[3] Karsten Fritsche, James Power, and John Waldron, "A Java Distributed ComputationLibrary," Proceedings of the Second International Conference on Parallel andDistributed Computing, Applications and Technologies (PDCAT2001), TamkangUniversity, Taipei, Taiwan, pp. 236-243, July 2001. Additional information in: J.Waldron, K. Fritsche, J. Power, "User Manual of A Java Distributed ComputationLibrary," Technical Report TCD-CS-2001-01, Department of Computer Science,Trinity College Dublin, January 2001; J. Waldron, K. Fritsche, J. Power, "FunctionalSpecification of A Java Distributed Computation Library,” Technical Report TCD-CS-2001-02, Department of Computer Science, Trinity College Dublin, January 2001.

[4] V. S. Sunderam, `` PVM: a framework for parallel distributed computing,' 'Concurrency: Practice and Experience, Volume 2(4), Pages 315-340 (December1990)

[5] G. A. Geist and V.S. Sunderam, `` Network based concurrent computing on the PVMsystem,' ' Concurrency: Practice and Experience, Volume 4(4), Pages 293-311 (June1992)

[6] Message Passing Interface Forum. MPI: “A Message-Passing Interface Standard” .International Journal of Supercomputer Applications and High PerformanceComputing, Volume 8(3/4) (1994)

[7] A. Silberschatz and P. Galvin, Operating System Concepts, 5th Edition, John Wiley &Sons (1999)

[8] Distributed Computing Technologies Inc., 2002. http://www.distributed.net[9] United Devices Inc., 2002. United Devices™ is the leading provider of edge

distributed computing software and services. http://www.ud.com[10] Sun MicroSystems, 2002, JavaSpaces Technology.

http://java.sun.com/products/javaspaces/[11] Nicholas Carriero and David Gelernter, The S/Net’s LINDA Kernel, ACM Press,

New York (1986)[12] VA Linux Systems, Inc, 2002. Joone - Java Object Oriented Neural Engine.

http://sourceforge.net[13] Andy Oram, Peer-to-Peer: Harnessing the Power of Disruptive Technologies, O’Reill y

and Associates (March 2001)[14] David Marr, Vision: A Computational Investigation into the Human Representation

and Processing of Visual Information, W.H. Freeman and Company (1982)[15] Barry Wilkinson and Michael Allen, Parallel Programming, Prentice Hall (1999)[16] Network Computing, Network Design Manual: Client-Server Fundamentals (Feb 8

1999) http://www.networkcomputing.com[17] T. Dowling, "Lecture Notes for SE112 Software Engineering II ", Department of

Computer Science, Maynooth (1999)[18] D. Flanagan. Java in a Nutshell (2nd Edition), O’Reill y (1997)[19] Giel, B., 2002. KC Multimedia and Design Group. http://www.kcmultimedia.com[20] John D. McGregor, David A. Sykes, A Practical Guide to Testing Object-oriented

Software, Addison Wesley (1999)[21] Stephen J. Andriole, editor, Software Validation, Verification, Testing, and

Documentation, Petrocelli Books, Princeton, NJ (1986)

Page 46: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

46

[22] Will iam E. Perry, A standard for testing application software, Auerbach Publishers,Boston (1992)

[23] IEEE, Standard Glossary of Software Engineering Terminology (2002)[24] Human Genome Project, Introduction: Primer on Molecular Genetics

http://www.ornl.gov[25] Gerardo Jimenez-Sanchez, et al. "Human Disease Genes," Nature 409, 853-855 (15

February 2001)[26] P. Andrew Futreal, et al., "Cancer and Genomics," Nature 409, 850-852 (15 February

2001)[27] Svante Pääbo, "The Human Genome and Our View of Ourselves," Science, 1219-1220

(16 Feb 2001)[28] James M. Jeffords and Tom Daschle, "Political Issues in the Genome Era," Science,

1249-1251 (16 Feb 2001)[29] Thomas Keane, Richard Allen, Thomas J. Naughton, James McInerney, and John

Waldron, “Distributed computing for DNA analysis,” in Principles and Practice ofProgramming in Java, John Waldron and James Power, Eds., Dublin, Ireland, June2002. To appear.

[30] R. Allen, T. Keane, T. Naughton, and J. McInerney, “Nucleotide Analysis ofTuberculosis Bacterium”, BioInformatics, in preparation.

Page 47: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

47

APPENDIX ADesign of the Java Distributed Computation Library

The Java DCL was written as an easy-to-use platform for developers who want to quicklyimplement a distributed computation in the context of a Single Program, Multiple Dataarchitecture. This summary is taken largely from [3].

Design of JDCL ServerThe Server consists of the following parts:

1. Server Engine – the server engine manages all the server side data-structures, classesand log files. It retrieves all its parameters from a developer defined initialisation fileand is responsible for loading the developer-defined classes.

2. Communication – the communication classes handle all the communication betweenthe server and clients.

3. Developer defined classes – these classes are application specific classes, which areused to define what computation is to be done, how the data is to be processed andhow the results are output.

TaskList

ParameterGenerator

LogFile

ResultHandler

ServerMessageHandler

Task

Message

ConnectionManager

Engine

MessageHandler

Scheduler SchedulerThread

ServerEngine

implements

extends

extends

Aggregation ( "has a" relationship )

Generalisation

Key

Fig. A.1: Server Class Layout

Each of these three parts contains the following classes, which are connected as shown in Fig.A.1:

1. Server Engine• ServerEngine – this is the main class of the server.• Engine – a super class of ServerEngine.

Page 48: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

48

• SchedulerThread – provides scheduling services to classes, which need to periodicallyperform some task. The SchedulerThread notifies the server when and action needs to beperformed.

• Scheduler – the Scheduler class provides the methods, which are executed when theSchedulerThread indicates that something needs to be done.

• LogFile – provides logging facilities to the system for recording errors, results andinformational messages.

• TaskList – the server uses this class to keep track of pending and expired tasks. Thepending task list contains a task, its unique ID and a computation time. If a client doesn’treturn the results for a task or request an extension within its computation time the task ismoved to the expired tasks list.

2. Communication• ConnectionManager – the ConnectionManager listens on the server socket for incoming

connections and creates a ServerMessageHandler to deal with each connection. It keeps alist of all the ServerMessageHandler’s, which it adds to and deletes from as connectionsare opened and closed.

• ServerMessageHandler – the ServerMessageHandler creates the messages that are sent tothe client and performs any tasks that the client requests (e.g. granting a time extension).

• MessageHandler – this class is responsible for communication with the client throughmessage objects. It waits for the client to request a service and then sends the appropriateresponse.

• Message – all communication uses message objects, which are sent, via TCP sockets.Each message consists of a message type and a content type. It can also contain data, acomputation timeout and a parameter set ID

3. User Defined Tasks• Task – this class contains the actual computation, which is to be done. Its overall function

is to process parameter sets and produce results• ParameterGenerator – the purpose of the ParameterGenerator is to generate successive

parameter sets, which are sent to the clients for processing.• ResultHandler – the ResultHandler receives the results computed by the clients and

outputs them to a formatted file.

Design of JDCL ClientThe JDCL client can be split up into two main components. These are the ClientEngine andthe ClientMessageHandler (see Fig. A.2). All of the rest of the modules support thefunctionality of these main components.

1. ClientEngineThe ClientEngine is the main entry point for the application. It performs the initialisation ofthe client software. It takes in the external parameters and sets up the internal structures. TheClientEngine contains the following classes:• LogFile - The client application maintains logs of all events that take place while the

software is running (event Logs and error Logs).• ClientMessageHandler - This performs all of the managerial tasks of the client software,

from implementing the communications protocol and getting the data from the server tomanaging the running task.

Page 49: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

49

2. ClientMessageHandlerOnce the application has been initialised, this module takes over the overall running of theapplication. It does this via the following components.• ByteArrayClassLoader - This is the module that is used to load in the external class that is

downloaded over the network. The downloaded class is of type Task.• SchedulerThread - The application uses this to keep track of timeouts for the current

running unit. Each unit is given a certain amount of time to run by the server and after thishas expired, the server will resend out the unit to another client.

• Task - All algorithms that are downloaded to the client software extend this class. Thisclass is the same at both the server and client. Essentially the client creates an instance ofthe class that is downloaded from the server.

MessageHandler

Engine

LogFile Scheduler ClientEngine

ByteArrayClassLoader

Task Message

ClientMessageHandler

SchedulerThread

implements

extends

extends

Key

Aggregation ( "has a" relationship )

Generalisation

Fig. A.2: JDCL Client Design

Page 50: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

50

Appendix BUser Manual

The user manual accompanies the General-Purpose Distributed Computing Environment byRichard Allen and Thomas Keane. It includes technical requirements of any system hostingthe environment, installation instructions, an explanation of initialisation files, andinformation on programming the system. As the environment can be tailored to a researcher’sneeds, information regarding programming the system is at a suitably technical level.

1 Technical Requirements1.1 ServerIn order to run the server, the Java Runtime Environment (JRE) 1.2 or greater must beinstalled.

To program the server with a problem, the Java Development Kit (JDK) 1.2 or greater isrequired.

1.2 ClientAny machine that is to run the client software is required to have the JRE 1.2 or greaterinstalled.

2 Installation2.1 ServerThe server consists simply of a single directory of Java files. This is copied onto the serverand run from the command line with the standard Java start-up command:

java ServerEngine <initialisation-file>2.2 ClientThe client consists of a single executable JAR file and an initialisation file (client.ini). Theapplication can be started by simply double-clicking on the JAR file or run from thecommand line with the standard Java JAR file command:

java -jar client.jar

There are other (more practical) options for deploying the client software, including making ita background OS service (more details - see chapter 5 of this report). These would be aimedat system administrators who would be deploying the system on a large scale.

3 Programming the SystemTo program the system with a given problem, there are two Java classes that must beextended. These are the DataHandler class and the Task class. Each of these parent classes arepart of the system. The Task class exists at both the server and client.

The developer must extend these two classes, overwriting and implementing certain methodsin order to set up a distributed computation.

3.1 DataHandlerThe main purpose of the extended DataHandler is to manage all of the data relating to thecurrent problem. The DataHandler can be split into three main parts.

Page 51: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

51

The first section of the DataHandler is the init() method. The main function of this is toinitialise whatever data structures are necessary for the overall computation. This can involvethings like setting up readers of files, initialising arrays, giving variables values etc. The ideahere is that any data structures set up here are available to both the getNextParameterSet() andresultsHandler() methods.

The next section is the getNextParameterSet() method. This is where the pre-processed dataunits are generated to be sent to the clients. This method is called every time a client requestsdata to process. The return type is an Object array. Since all Java classes inherit from theObject class, this method can return any data type supported by Java. The task running at theclient receives this Object array as its data. Therefore, it is usual for this Object array to beexplicitly typecasted at the client (via Java’s safe typecasting mechanisms).

The final section of the DataHandler is the resultsHandler() method. This method is calledevery time valid results are received by the server. The parameters to this method are anObject array and an ID. So just as with the getNextParameterSet() method, any Java data typeis supported here (via Java’s type casting mechanisms). The ID that is passed to this methodis really only for information purposes. It is the original data unit ID that the correspondingparameters were sent out in (generated internally by the server).

3.2 TaskThe extended task class describes how the data received by the clients is to be processed (i.e.the algorithm to execute). There is only one method that must be extended in the task - therun() method. There are no explicit parameters passed to this method however the pre-processed data that is sent by the server in a data unit is available through the parameterListvariable. This is of type Object array (matching the type of the data generated by thegetNextParameterSet() method in the DataHandler). Again the elements of this array can betypecasted to whatever type they were originally in the DataHandler. The general format ofthe run() method is as follows:

public void run(){try{

Object[] array = parameterList;

<actual work of task here>

returnList = new Object[];returnList[x] = // results of computation here

}catch( Exception e ){exceptionInProcessing( e );

}endProcessing();

}

The whole method is encompassed in one large try-catch statement. The purpose of thisstatement is to handle any exceptions the run() method can generate. All exceptions will becaught and this information will be fed back to the server [via the exceptionInProcessing()call].

Page 52: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

52

The post-processed data is returned via the returnList Object array. Just as with the otherObject arrays, the elements of this can be of any type supported by Java. At the bottom of therun() method is the endProcessing() call. This tells the client that the task has finishedprocessing the data and that the post-processed data is contained in the returnList variable.The post-processed data is then sent back to the server by the client.

As will be explained next, in the case of a pipeline process, each task can consist of severalsub-tasks (where each sub-task performs some processing on data from a certain stage of theoverall computation).

3.3 Bucket ClassThere is one final (optional) dimension to programming the system - the simulated pipelineprocessor. The idea behind this feature is that the developer can set up the server to act like apipeline processor (Fig. B.1) with several different intermediary stages in the distributedcomputation.

Fig. B.1: Pipelined processes, where P0 through P4 are the processes.

The pipeline is simulated by using the bucket class to represent the storage required at eachstage of the process. There are two distinct forms of data in the bucket class. The data thatarrives to be added to the bucket is referred to as a job. Each entry in the bucket can compriseof several of these ‘jobs’ (exact number of jobs per bucket entry is specified in theconstructor, e.g. in a mergesort several sub-sorted units would make up one unit to bemerged). The information is stored in each bucket by using Java Vectors. A Vector can storeany data type, so in keeping with the DataHandler each entry in a bucket is an Object array.

The developer specifically creates each bucket during the init() method of the DataHandler sothat it is available to both the getNextParameterSet() and resultsHandler() methods. There areseveral methods included in the bucket class that can be used to manipulate each bucket. Hereare the public methods that should be called from the DataHandler to manipulate a bucket:

public boolean isBucketEmpty ()This method checks if the bucket is empty (i.e. checks only if there are any full Vector entriesready to be sent out).

public boolean isPartialUnitAvailable()This method checks if there is a partial unit (i.e. a unit that comprises of several jobs but doesnot have enough jobs yet to make up a full bucket entry).

public String getName()This method returns the name of the bucket.

public void addData ( Object data )This method is called when the DataHandler wants to add data (a job) to the bucket.According to the way the bucket was constructed, several of these jobs can make up onebucket entry.

P0 P1 P2 P3 P4InputData

Output (Results ofcomputation)

Page 53: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

53

public Object[] getUnit ()This method returns a full bucket entry. Again this can be made up of several of the data unitsthat are passed to the bucket by the addData method.

In order to simulate a pipeline process, it is suggested that the DataHandler should be codedso that it checks the last bucket (final stage of computation) first for a full unit and workbackwards to the first bucket (where the data is initially stored). In this way, the pipeline willbegin producing final results as soon as possible. Sample DataHandler code to do this is asfollows:

// start at the final stage and return a full unitfor ( int i = ( numStages - 1 ); i >=0; i-- ) {

if ( !( stage[ i ]. isBucketEmpty() ) ) {return stage[ i ]. getUnit();

}}

In the case where there are no full units available in any of the buckets, it is suggested that theDataHandler should begin with the first bucket and check for a partial unit (and move ontosubsequent buckets if there is no partial unit available in the first bucket). SampleDataHandler code to do this is as follows:

// if no full units available then start// at the first stage with a partial unit and return a// partial unitfor ( int i = 0; i < numStages; i++ ) {

if ( stage[ i ]. isPartialUnitAvailable() ) {// keep adding 'null' as data until there is // a'full unit' availablewhile ( stage[ i ]. isPartialUnitAvailable() ) {

stage[ i ]. addData( null );}// return a partial unit to be processedreturn stage[ i ]. getUnit();

}}

4 Initialisation FilesBoth the server and client have certain values that have to be explicitly programmed into thesystem. These include such details as what server port to use, the server IP address etc. Thesevalues are entered into the system via external initialisation files. The files have a certainformat. All entries consist of pairs of values, separated by an equals sign, e.g.

server.ip=127.0.0.1

All comments begin with a hash (#) symbol.

4.1 ServerThe server initialisation file can be broken into three main areas. These are server details, taskdetails, and log files.

Included in the server details are the server port and the server timeout. The server port is theport on the server machine that the server listens on for new client connections. The servertimeout is the amount of time (in milliseconds) between the server checking its lists to findout what work units have expired (timed out) and have not requested extensions. A work unit

Page 54: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

54

would expire when a client application is abruptly stopped or terminated. This could happen ifa client machine is powered off abruptly or rebooted.

The task details contain information about what task is to be run by the server. Firstly thename of the Java class file that contains that task is specified (without the .class extension).The next entry is the time (in milliseconds) that is being given to each data unit to execute ateach client. Clients can, of course, request extensions to this time as they execute a given task.The last entry is the name of the Java class file that contains the DataHandler for the giventask (without the .class extension).

The last area covered in the server initialisation file is details about the log files. The serverkeeps two main logs (system logs and error logs). For each of these, there are four mainoptions to be specified. These cover areas such as write to file, echo to screen, format of thelog and what level of detail to include.

4.2 ClientThe client initialisation file is much shorter than the server initialisation file. It consists ofthree entries - server IP address, server port, and whether or not to display the GUI. Theserver IP address tells the client what machine to connect to. The server port is the port on theserver machine that the server is listening on for new connections. The last option depends onthe way in which the client is being deployed. If it is intended that the client should run as abackground service, then it would be best to turn off the GUI. If it is intended that the user ofthe machine will explicitly start the client application, then they might want to have access tothe GUI to monitor the client’s progress.

Page 55: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

55

Appendix CExample JDCL Code: the Delta-Max Application

DataHandlerimport java.io.*;import java.math.*;import java.util.Random;

public class SortingDataHandler extends DataHandler {

int data[]; int marker; int granularity;

// used to tell when finished int totalUnits; int currentUnits;

// to store results of difference int[] finalResults;

// streams for the resultHandler to write to File outputFile; FileOutputStream fos; PrintStream ps;

String outputFileName = "output.txt";

// vars for using the buckets int numStages;

public void init() {

marker=0; granularity = 1000;

try { // read the data into an array BufferedReader in = new BufferedReader ( new FileReader ("random.txt") );

String line = in.readLine(); data = new int[ Integer.parseInt(line) ];

for (int x = 0; x < data.length; x++) { line = in.readLine();

data[x] = Integer.parseInt(line); }

in.close();

System.out.println ("DATA: length="+data.length);

// create a print stream to the output file outputFile = new File( outputFileName ); fos = new FileOutputStream(outputFile); ps = new PrintStream(fos, true);

}

Page 56: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

56

catch (Exception e) { errorLog.write ("Exception caught while paramgen is init-ing: " + e.toString(),2); System.exit(1); }

// crate three buckets numStages = 3;

stage = new bucket[ numStages ]; stage[ 0 ] = new bucket( 1, "SequentialSort" ); stage[ 1 ] = new bucket( 3, "MergeSort" ); stage[ 2 ] = new bucket( 1, "Difference" );

// fill the first bucket with units int[] temp = getParameters(); while ( temp != null ) { stage[ 0 ].addData( temp ); temp = getParameters(); }

finalResults = new int[ 3 ]; finalResults[ 0 ] = 0;

}

public Object[] getNextParameterSet () {

Object[] temp = new Object[2];

// start at the last stage and return a full unit for ( int i = ( numStages - 1 ); i >=0; i-- ) { if ( !( stage[ i ].isBucketEmpty() ) ) { return stage[ i ].getUnit(); } }

// if no full units available then start at the first// stage and return a partial unit

for ( int i = 0; i < numStages; i++ ) { if ( stage[ i ].isPartialUnitAvailable() ) { while ( stage[ i ].isPartialUnitAvailable() ) { stage[ i ].addData( null ); } return stage[ i ].getUnit(); } }

// if no units left return null; }

public boolean handleResults (Object[] results,int ID) {

String s = ( String )results[ 0 ]; int[] d = ( int[] )results[ 1 ];

// if not finished the merging if ( d.length < data.length ) {

Page 57: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

57

// if the unit is returned from the sequential sort // stage if ( s.equals( "SequentialSort" ) ) {

stage[ 1 ].addData( d );

} else { // if the unit is returned from the merge sort stage if ( s.equals( "MergeSort" ) ) { stage[ 1 ].addData( d ); } else {

// the unit is returned from the difference stage if ( d[ 0 ] > finalResults[ 0 ] ) { finalResults[ 0 ] = d[ 0 ]; finalResults[ 1 ] = d[ 1 ]; finalResults[ 2 ] = d[ 2 ]; }

currentUnits++;

if ( currentUnits == totalUnits ) { handle_results( finalResults ); return true; } } } } else { int mark = 0;

// finished merging => add units to "difference" stage for ( int i = 0; i < ( d.length / granularity ); i++ ) { if ( ( d.length - ( mark + granularity + 1 ) ) > 0 ) { int[] temp = new int[ granularity + 1 ];

for ( int j = 0; j < ( granularity + 1 ); j++ ) { temp[ j ] = d[ mark + j ]; } mark += granularity + 1;

stage[ 2 ].addData( temp ); } else { int[] temp = new int[ d.length - mark ];

for ( int j = 0; j < ( d.length - mark ); j++ ) { temp[ j ] = d[ mark + j ]; }

stage[ 2 ].addData( temp ); } }

totalUnits = d.length / granularity; currentUnits = 0; } return false;

}

private void handle_results ( int[] output ) {

Page 58: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

58

for ( int i = 0; i < output.length; i++ ) { ps.println( output[i] ); } }

private int[] getParameters() {

if ( marker < data.length ) { int[] temp = new int[ granularity ]; for ( int i = 0; i < granularity; i++ ) { temp[ i ] = data[ marker + i ]; } marker += granularity;

return temp; } return null; }

}

Page 59: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

59

Taskpublic class SortingTask extends Task {

public void run() { String stage = ( String )parameterList[ 0 ];

returnList = new Object[2]; returnList[ 0 ] = stage;

if ( stage.equals( "SequentialSort" ) ) { int[] data = ( int[] )parameterList[ 1 ]; SequentialSort( data ); returnList[ 1 ] = data; } else { if ( stage.equals( "MergeSort" ) ) { int[] d1 = ( int[] )parameterList[ 1 ]; int[] d2 = ( int[] )parameterList[ 2 ]; int[] d3 = ( int[] )parameterList[ 3 ];

// if the second element = null => only one element in array if ( d2 == null ) { returnList[ 1 ] = d1; } else { // if the third element = null, merge the first two if ( d3 == null ) { int[] data = new int[ d1.length + d2.length ]; MergeSort( d1, d2, data ); returnList[ 1 ] = data; } else { // merge all three int[] temp = new int[ d1.length + d2.length ]; MergeSort( d1, d2, temp ); int[] data = new int[d1.length+d2.length+d3.length]; MergeSort( temp, d3, data ); returnList[ 1 ] = data; } } } else { if ( stage.equals( "Difference" ) ) { returnList[ 1 ] = Difference( (int[])parameterList[ 1 ] ); } } } endProcessing(); }

public void SequentialSort ( int[] data ) {

for (int i=0; i<data.length; i++) { for (int j=0; j<i; j++) { if (data[i]<data[j]) { int temp = data[i]; data[i]=data[j]; data[j]=temp; } } } }

public void MergeSort ( int[] d1, int[] d2, int[] data ) {

Page 60: A general-purpose distributed computing …tnaughton/pubs/varasto/NUIM-CS-TR-2002...client-server model and the pipeline processor model. The design and implementation of the system

60

int counter1 = 0; int counter2 = 0;

boolean finished = false;

while ( !( finished ) ) { // if reached end of both arrays if ( ( counter1 == d1.length ) && ( counter2 == d2.length ) ) { finished = true; } else { // if reached end of first array if ( counter1 == d1.length ) { while ( counter2 != d2.length ) { data[ counter1 + counter2 ] = d2[ counter2 ]; counter2++; } finished = true; } else { // if reached end of second array if ( counter2 == d2.length ) { while ( counter1 != d1.length ) { data[ counter1 + counter2 ] = d1[ counter1 ]; counter1++; } } else { if ( d1[ counter1 ] <= d2[ counter2 ] ) { data[ counter1 + counter2 ] = d1[ counter1]; counter1++; } else { if ( d2[ counter2 ] < d1[ counter1 ] ) { data[ counter1 + counter2 ] = d2[ counter2 ]; counter2++; } } } } } } }

public int[] Difference ( int[] data ) {

// the first element in the array is the difference between the // second and third int[] nums = new int[ 3 ]; nums[ 0 ] = 0;

for ( int i = 0; i < ( data.length - 1 ); i++ ) { if ( ( data[ i + 1 ] - data[ i ] ) > nums[ 0 ] ) {

nums[ 0 ] = data[ i + 1 ] - data[ i ];nums[ 1 ] = data[ i ];

nums[ 2 ] = data[ i + 1 ]; } } return nums; }}