portable tools for fortran parallel programming

CONCURRENCY PRACTICE AND EXPERIENCE, VOL. 3(6). 559-572 (DECEMBER 1991)

Portable tools for Fortran parallel programming SWARN P. KUMAR AND IVOR R. PHILIPS Boeing Computer Services PO Box 24346. MIS 7L-22 Seattle, WA 981244346. USA

SUMMARY This paper presents a survey of some of the tools, techniques, and constructs for the development of portable, multitasked Fortran programs. The study mainly focuses on existing software tools that Implement different approaches to achieving portability of multitasked Fortran programs for local and shared memory multiprocessor computers. However, some proposed approaches are also included. It appears that while each approach enjoys some advantages and suffers some disadvaptages, at present, the development and use of portable multitasking tools is in its infancy, and thus no one system is clearly superior. Indeed, we expect that, for the foreseeable future, these and perhaps other techniques will all be actively pursued.

1. INTRODUCTION

The emergence of a wide variety of multiprocessor computer systems (e.g. CRAY X/Y-MP[S], Alliant FX/Series[2], Sequent Balance and Symmetry Systems[l7], Intel iPSC[ll]. BBN Butterfly[4], Connection Machine[lO]) is accompanied by the need for suitable programming tools and models to make efficient use of these systems. Because the technology of parallel computation is so new, most commercial systems have been delivered with very little software support other than the addition of a few extensions to the standard programming languages Fortran and C.

Two of the main issues in parallel computation arc:

Complexity: parallel programs are much more difficult to design and debug than their scalar counterparts. This is due to the complex synchronization and data scoping issues involved. User’s needs range from taking an existing scalar code and, with no code changes, have it run efficiently on a parallel architecture, to designing parallel code by explicitly employing multi-tasking (and vector) concepts. Transportability: a wide variety of parallel architectures are now available com- mercially, ranging from mini-supercomputers to full supercomputers. From a user’s viewpoint, it is often desirable to have a multi-tasked code developed on one computer type run efficiently, with no coding changes, on a different computer type.

There is need for as much automatic parallelism as possible. A large class of users- generally, those who are primarily scientists or engineers-wish to concentrate on the technology aspects of their particular problems; not on expending considerable programming effort explicitly multi-tasking their code. For such users, the tools must not become more important than the problem being solved. The Alliant FX/Series currently caters to this class of users by automatic vectorization and multi-tasking of Do loops in a

104CL3 108/9l/060559-14$07.oO 01991 by John Wiley & Sons, Ltd.

Received 23 October 1990 Revised4June 1991

560 SWARN P. KUMAR AND IVOR R. PHILIPS

Fortran code. Kuck and Associates[l4] have developed an automatic Do loop parallelizer, called KAP, for the Sequent computer systems.

It is clear, however, that further large increases in speed are possible if code, originally designed for scalar architectures, is reworked to take explicit advantage of parallelism. A significant class of users is willing to expend further effort to gain such increases in speed. They are willing to exploit parallelism by explicitly reworking their Fortran codes but would like to do it in such a way that a program that has been multi-tasked on one computer will execute efficiently on a different parallel architecture. This requirement for portability immediately precludes the use of vendor-specific parallel extensions to Fortran 77.

This study mainly focuses on the existing software tools that implement different approaches to achieving portability of multi-tasked Fortran programs. They are:

(1) The SCHEDULE[7] package that allows users to smcture Fortran programs as a system of statically or dynamically allocated processes by explicitly inserting calls to SCHEDULE subroutines in their code.

(2) The FORCE[12] package of macros that are inserted into Fortran code and are then interpreted by a preprocessor to produce standard Fortran 77 enhanced by machine-specific compiler directives.

(3) A conceptual package named Parallel Fortran[3] of compiler directives, called the Language Layer for Concurrency, that would help compilers identify parallel constructs that currently resist automatic analysis.

(4) The Parallel Computing Forum (PCF)[lS], a mixed organization of parallel computer vendors and others interested in parallel constructs are developing a common set of parallel Fortran extensions.

( 5 ) The Linda[l,9] set of operators that conceptually could be added to many programming languages-Fortran or C, for examploto turn them into parallel programming languages.

(6) The Strand[8,18] parallel programming environment based on parallel semantics (concurrent logic programming) that allow parallel streams to bc mapped directly into executable Strand code to take advantage of the available parallel hardware.

The rest of the paper is divided as follows: Section 2 describes the SCHEDULE package. An overview of the FORCE package is given in Section 3. Section 4 discusses the Parallel Fortran compiler directives for multi-tasking. Section 5 contains a discussion of the PCF. Section 6 contains a discussion of the Linda operators as applied to Fortran. Section 7 contains a discussion of the Strand parallel programming tools. Scction 8 concludes the document.

2. SCHEDULE

SCHEDULE is a package of Fortran-callable subroutincs designcd to pcrmit the writing of portable, multiprocessing programs. The SCHEDULE package was dcveloped by Jack Dongarra and Danny Sorensen[7] of Argonne National Laboratory.

The SCHEDULE package is designed for parallelism at the subroutine level. Thus, Fortran DO loops cannot be parallelized without significant modification of the original program. However, offsetting this disadvantage, a SCHEDULE program can call existing,

PORTABLE TOOLS FOR FORTRAN PARALLEL PROGRAMMING 56 1

unmodified Fortran subroutines. This protects investment in previous (scalar) software. The SCHEDULE package also has the capability to log library events with a trace file

created at the end of a run. This trace file can be used to create a graphical representation of the multi-processing activity performed and debugging.

To implement the concept on a particular computer, a machine-specific version of SCHEDULE must be made available as a part of the operating system. SCHEDULE is at present implemented on the Alliant FW8, CRAY-2, Sequent Balance and Symmetry, Sun, and Encore computer systems.

2.1. The SCHEDULE approach

SCHEDULE requires the user to define tasks (consisting of subroutines that are independently executable) and the data dependencies between the tasks. SCHEDULE, in conjunction with the local operating system, then schedules the tasks for parallel execution appropriately.

The design of a SCHEDULE program thus proceeds in several stages:

Identify parallel structures. That is, identify subroutines that can be executed in parallel with other subroutines. Identify data dependencies. That is, for each subroutine from step (a), determine how its inputs depend on the prior completion of the other subroutines, and which of its outputs are required as inputs for the subsequent execution of other subroutines.

This set of subroutines and data dependencies determines the parallel execution of the program. It may be that an existing program is already structured in such a way that a large degree of parallelism can be obtained with no essential change of the underlying algorithms or code. However, in many cases, algorithms and code will have to be significantly modified to take advantage of parallelism. This is true for all scalar code being executed in a parallel environment using the various extensions to Fortran described in this paper. Write a Fortran code using the SCHEDULE extensions. This is done by calling SCHEDULE subroutines to define tasks. Each task consists of a user subroutine name, from step (a), and its actual arguments, togclher with the data dependency information from step (b). The programmer must also provide a unique positive number for each task.

As a result of steps (a), (b), and (c), a program will be produced that can be run in either serial or parallel mode.

SCHEDULE operates by considering each task as a node in a dependency graph. Thus, in Figure I, tasks TI, T2, and T4 have no dependencics on prdccessor tasks and could begin executing immediately. Task T3 cannot stan until tasks TI and TZ complete. Task T5 cannot start until tasks T3 and T4 complete. Tasks Tg and T7 cannot start until task T3 completes.

SCHEDULE will not start executing a task until all the predecessor dependencies have been satisfied-that is, all thc inputs from predecessor tasks have been calculated. Initially, in a well-formed program, it is clear there must be at least one task that can

562 SWARN P. KUMAR AND NOR R. PHILIPS

Figure 1 . Example of a data dependency graph

start executing. When SCHEDULE completes a task, the predecessor dependencies of all task nodes that depend on it are decremented by 1.

When a task is complete, SCHEDULE then selects one of thc remaining tasks that now have no unsatisfied predecessor dependencies, continuing until all tasks have been executed. In a well-framed program this process will always be possible.

The number of virtual processors to work on a program segment may be set dynamically. Generally, it is advisable to set the number of virtual processors so as to not exceed the number of actual pmessors available on a system. Both static and dynamic task scheduling is possible and tasks may also spawn one sublevel of processes. The reader is referred to Reference 7 for further information.

2.2. Comments on SCHEDULE

SCHEDULE is a portable parallel programming tool suitable for both local and shared memory multi-processor systems. Some other advantages of the SCHEDULE package are:

0 Both static and dynamic scheduling of processes are allowed. 0 It has the capability of executing parallel programs in serial mode. 0 It can spawn parallel processes at the nested level other than DO loops. 0 Existing (scalar) libraries of subroutines can be called in parallel without any

0 It supports both functional and data decomposition parallelism. 0 Graphical display tools for debugging purposes are provided.

modifications.

The SCHEDULE tool however, is not a final solution to the portable parallel programming problem. This has been stated by the authors of the package. SCHEDULE basically provides users a portable way of structuring Fortran programs as a system of statically or dynamically allocated processes. Parallelism is only permitted at the subroutine level. Fine-grained parallelism is not possible. The burden of specifying the communication between processes, by the data dependency graph, is on the user. Process management is also explicitly done by the user. The SCHEDULE approach also involves a complicated set of primitives. Too much responsibility for managing the parallel execution may not be attractive to many engineers and scientists in the computational community.

FORTABLE TOOLS FOR FORTRAN PARALLEL PROGRAMMING 563

The basic concept behind SCHEDULE has provided a good starting point for the development of other such tools. Mark Seager and his collegues at Lawrence Livermore National Laboratory started the software package libCray.a[6] and GMAT[18], Cray- compatible tools for the Alliant FX/8 computer, with SCHEDULE-based concepts.

3. THE FORCE

The FORCE is a macro preprocessor that provides a set of directives to Fortran. The directives permit small- and large-grain parallelism in a shared-memory multiprocessor environment. The FORCE was designed by Harry Jordan and his students at the University of Colorado[ 121.

In the FORCE environment, parallelism is achieved by defining a fixed number of processes at the beginning of program execution (the fixed number could be one). All processes then are active from the beginning to the end of the program with each process having access to the complete program. Each process, however, can perform different parts of the total work with various synchronization construcis controlling their flow.

The FORCE is implemented at present in a Unix-based environment on the Alliant FX/8, Multimax Encore, Sequent Balance, and the CRAY-2 parallel/vector computers.

3.1. The FORCE approach

At the beginning of a FORCE program, all processes start to execute the program. To control their flow, barriers and critical regions are inserted into the program. Unlike SCHEDULE, the processes are anonymous to the programmer and have no individual identification. Thus, there is no direct process synchronization or management required of the programmer. A barrier or critical region can be inserted around any section of executable code that is contained within a single program unit.

The code within a barrier does not execute until all the processes reach the beginning of the barrier. When this occurs, one process is arbitrarily selected to execute the code within the barrier. When the selected process completes execution of the code within the barrier all processes then start executing the code following the end of the barrier.

The code within a critical region must be executed by each process, one at a time. In a FORCE program, variables and COMMON blocks may be declared to be either

PRIVATE or SHARED. The default is PRIVATE. Each process gets its own Copy of PRIVATE variables and COMMON blocks. Consequently, when a process changes a PRIVATE entity the change is confined to that process. The values of the corresponding entity in the other processes remains unaltered. By contrast, shared variables are available to all processes. If any process changes a shared variable that change affects all other processes subsequently referencing that variable. The concept of shared and p r i v a t e variables assists in keeping the processes anonymous to the programmer- variables do not have to be explicitly associated with processes.

There are several other FORCE directives, controlling the flow of processes. added to standard Fortran 77. The parallel CASE statement defines blocks to be executed in parallel by arbitrary processes. There are two kinds of parallel DO statements. The first form assigns processes to perform the calculation within the DO systematically, based on the DO indices. This form should be used when the times needed to complete the calculations are essentially independent of the value of the DO index. The second form

564 SWAKN 1’. KUMAR AND IVOR R. PHILIPS

is driven by the availability of processes. This second form involves more processor synchronization activity and should be used when the time to perform to calculations within the DO block depends strongly on the value of the DO index. Each of the parallel DO statements may be nested two deep.

Not all of the FORCE’s constructs have been given here. For more information the reader is referred to Reference 12.

The FORCE may be readily transported to new shared-memory parallel environments. Mutual exclusion and process counting are sufficient to implement most of the FORCE’S parallel constructs.

3.2. Comments on the FORCE

The FORCE is a portable (shared memory multiprocessor) programming tool for parallel execution of many processes all working to solve a single problem. The highlights of the package are:

Process management is invisible to the user. Proccsses are created and terminated at the top of the program hierarchy. Parallelism using a generic synchronization mechanism. Program independence of the number of processes specificd. It is highly suitable for tightly coupled parallel progrrlmming. The ease of programming (small set of primitives to learn). Parallel constructs are allowed at any level of the program hierarchy. Hence both coarse- and fine-grained parallelism are supported. Existing (scalar) libraries of subroutines can be called in parallel. Correctness of program execution can be tested with one process, independently of effects due to improper synchronization. It supports primarily data decomposition parallelism-as compared to functional decomposition.

A major weakness in the set of FORCE macros (it its prcscnt version) is that it does not efficiently support functional decomposition parallelism of a program-an often-desired feature at the upper levels of a program’s hierarchy. Harry Jordan and his group have suggested the macro Resolve (since the macro Pcase allows only one process to execute each of the parallel functions) which will resolve the Force into components executing different parallel code sections. The implementation of Resolve is complicated by the conflicting demands of generality and efficiency (that is, if complcte independence from the number of processes is required). It is our understanding that an implementation of the Resolve macro, which will produce process rescheduling at every possible deadlock point, and is still efficient when the number of proccsscs cxcccds thc number of components, is under development at the University of Colorado.

4. PARALLEL FORTRAN

Since Fortran is the first choice for a programming language in most physical science and engineering applications, vendors and researchers in thc supercomputing community are making a serious effort to provide parallel proccssing within Fortran. Unfortunately,

PORTABLE TOOLS FOR FORTRAN PARALLEL PROGRAMMING 565

many of the basic constructs of standard Fortran strongly conflict with ihe most obvious primitives of parallel processing. Clifford Arnold and his associates[3], have announced the development of a set of compiler directives to alter Formn for parallel processing without any formal language extensions.

The basic aim of this approach was to extend the Fortran standards informally by compiler directives without violating the ANSI language definition. The goal was to develop a strategy that allows the compiler to generate multitasking code for a larger context than DO loop nests.

4.1. The parallel Fortran approach

The proposed approach (for the solution of parallel computational problems) is that of the so-called language layer for concurrency. This high-level layer for concurrency is used to implement parallel concepts such as: do all, do parallel, common, task common in Fortran itself. It is claimed that the compiler directives used for development will be portable and will allow efficient code to be generated for a large variety of shared memory or other parallel architectures.

The goal is to help the compiler understand code sequences and generate multitasking code in situations that currently resist automatic analysis. Code segments that are pamllelizable, where the rate of computing to I/O is greater than 9 : 1, are the prime targets for this approach. The idea is to make maximum usage or the compiler's analysis capability while allowing a user to provide the extra information needed to assist in more complete parallelization.

The approach is based on a set of optional user directives for multitasking. The compiler interprets them in order to dispose of problems inhibiting automatic parallelization. If the directives are not interpreted, the code will run serially on one processor. If the directives are used but not all of barriers to parallelization are removed, the compiler will generate uniprocessor code with appropriate diagnostics about the remaining problems. These diagnostics will inform the user of the additional multitasking directives needed to generate parallelized code.

The general form of a multitasking directive used in this approach is:

C#MTL directive[(argl, _.. ,arg,)l

The set of directives includes: INITIALIZE. SETI'INGS, BEGIN PARALLEL, END PARALLEL, PARALLEL DO, END PARALLEL DO, SIIARE, GET, PUT, BARRIER, WAIT, BROADCAST, MULTIPLE BLOCK, NEXT BLOCK, END MULTIPLE BLOCK, GUARD, END GUARD.

The share directive informs the compiler of COMMON blocks that are to be shared among processes within a particular begin parallel.. .end parallel section. The parallel do directive instructs the compiler to partition candidate DO loops into tasks that will execute in parallel, end parallel do marks the end of a parallel do structure. The get directive tells the compiler to gel data from a portion of an array contained in a shared COMMON block and pul it in a task's local array of the same name. The put directive performs the reverse function. For a more detailcd dcscription of these directives see Reference 3.


4.2. Comments on parallel Fortran

This appears to be a good direction to investigate. it provides a set of multitasking directives that are clear, comprehensive, and portable. However, as yet, no implementations have occurred. Practical experience is needed before this proposal can be fully evaluated.

5. THE PARALLEL COMPUTING FORUM (PCF)

(PCF) was organized by Bruce Leasure of Kuck and Associates[lS]. It consists of members from various computer vendors and users interested in the development of a set of portable extensions, for parallel programming, to Forwan and C. The forum has been meeting regularly for several years to define such parallel extensions.

5.1. The PCF approach

PCF has extended Fortran 77 by adding parallel construcls. The set of parallel extensions to Fortran 77 described below is based on a PCF draft dated 21 January 1990. It should be realized that the draft standard is subject to change.

A program is executed, in parallel, by one or more processes working in collaboration. A single process (the base process) begins execution of a program. When the process encounters a parallel construct. the system may assign extra processes (a team) to enter the construct and assist in its execution. The base process may, at the operating systems discretion, assist in the execution of the parallel construct. The number of extra processes involved is determined by the operating system. The only control the programmer has is that a maximum number of processes may be specified. At any time, as long as there is work remaining to be done, the system may assign extra processes to the team executing a parallel construct. Each process in a team begins execution of the parallel construct with the first statement of the construct. There are special constructs called work sharing consuucts, except for the code in a work sharing construct, all processes in a team execute all parts of the parallel construct redundantly.

For a work sharing construct, the operating system determines which members of the teams will participate in its execution.

Each participating process performs some part of the required work. Note the distinction from a parallel construct where each process in a team will execute the code redundantly.

When a participating process completes its assignment it will wait at the end of the work-sharing construct for the rest of the participants to complete their work. However, if there is more unassigned work to be performed in the work-sharing construct then the process may be assigned to perform some portion of that work.

When all the work in a work-sharing construct is completed all the participating processes can then resume execution of the next statement in the code When all the processes in a team have completed their work, the base process,


which waits at the end of the parallel construct for this to occur, continues execution of the next statement. The processes in the team do not perform any further work. Nested parallelism may occur when the member of a team itself encounters a parallel construct. In this case, the above process is repeated recursively, with the team member that encountered the parallel construct itself becoming a base member for a team of associated processes. An object may be labelled private or shared. If an object is labelled private then each process in a team has its own copy of the object. If an object is shared then each process on the team shares the object’s storage with the base process. The default is shared. Thus, if a process changes the value of a private object, that is a local change for that process only and other processes in the team do not see this change in their copy of the object. The converse is true for shared objects.

Some of the parallel constructs in the language are LOCK, UNLOCK, CRITICAL SECTION, PARALLEL DO, PARALLEL SECTIONS , PRIVATE, SHARED. The PARALLEL DO and the PARALLEL SECTIONS are examples of work-sharing COnSttuCtS.

5.2. Comments on PCF

In September 1989 the PCF became an American National Standards Comrnittee-X3HS: Parallel Processing Constructs For High-Level Programming Languages.

The parallel computing extensions to standard Fortran 77 are entirely within the spirit of Fortran, giving the programmer detailed control over parallelism. However, the programmer must give meticulous attention to process synchronization and data-sharing issues. A parallel program will almost always be more complex to write and debug than its non-parallel equivalent.

6. LINDA

Linda is a parallel programming tool based on an associative object memory model. It provides parallelism in Fortran (and other programming languages) in a particularly simple manner by providing an additional set of four operations. A Linda program consists of a collection of objects occupying a region called fuple space. Programs write and read n-tuples (of various lengths) of data into tuple space. Tuple space is a shared environment accessible to all parts of a program. In addition, special n-tuples can be launched into tuple space that consist of a set of n functions hat are to be evaluated. This n-tuple becomes a separate task that, when it completes execution, resolves itself into an n-tuple of values in the tuple space. Arbitrary nesting of Linda operations is permitted. Thus, a Linda task can itself launch Linda tasks.

6.1. The Linda approach

In Linda. parallelism is achieved by creating a tuple space of data and processes. There are no explicit synchronization constructs in Linda. F’rocesses create and consume entities in tuple space. Once a tuple is in the tuple space there is no communication back to the process that created it.

568 S W A N P. KUMAR AND IVOR R. PHILIPS

There are only four constructs in Linda, two for adding cntities Lo tuple space and two for reading entities in tuple space.

The OUT(e1 ,e2, . . . ,en) consrruct evaluates el .e2, . . . ,en and then adds the resulting n-tuple of values (el ,e2, . . . ,en) to tuple-space, where el ,e2,. ..,en are each expressions. Thus OUT ( ‘po in t ’ , x, y) and OUT (x, SQRT (1. - x**2), l., ‘CIRCLE’, ’SHADE’ ) would add a 3-tuple and a 5-tuple, respectively, to the tuple space.

The EVAL(fi, fz, . . . , fn) construct adds a process to the tuple-space, where fi , . . . ,fn

are each user-defined functions. The process may be executed in parallel with any other process in the tuple space. The program does not wait for the process to complete but goes ontothenextstatement.Thus EVAL(‘bearing’, 2 3 , f ( x , y ) , g ( z ) ) , ( w h e r e f and g are user-defined functions, creates a separate process that, when it executes, will evaluate the four expressions in the tuple (the first two evaluations are trivial, being constants) and add the resulting 4-tuple to tuple space. The process will be added to the list of processes waiting to be executed. The order of sclcction of processes for execution is indeterminate.

The IN and RD constructs both read data from tuple-space. The only difference is that IN removes the item from the tuple space while RD leaves it behind for possible subsequent reading by other parts of the program. They both function by pattern matching. The form is IN(a1, . . . , an), where each C I , is either a constant or a variable name. The effect of this operation is to search the tuple space for an entity that matches the pattern specified (in position and value) by the constant o , s . When such an entily is found, the values of the variable a,s are set to the corresponding values in the item found in the tuple space.

The matching entity will then be removed from the tuple space. If such an entity is not found in the tuple space the processor executing thc I N opcration will suspend until success eventually occurs.

The RD consmct functions similarly, except that the matching entity, when found, is not removed from the tuple space.

There is no ordering in tuple space. So that the order of defining processes via EVAL does not specify an execution order. Also, the order of writing entities into tuple space bears no connection with the order in which they may bc found by an IN or a RD operation.

6.2. Comments on Linda

Linda was initially defined by David Gelerntcr el a/.[ I ] , and thcn subscquently refined by Gelernter, Carriero, Chandron, and Chang[l)].

When compared with the other methods prcscnted hcrc, Linda is striking in its simplicity-the programmer docs not have to considcr thc synchronization of processes. Observe that the Linda constructs cannot stand by thcinselvcs; thcy must be embedded in a high-level language-Fortran or C, for example.

An unanswered question is efficiency. Until wc havc access to a Foman version of Linda we cannot answer this qucstion.

7. STRAND

Strand is the most radical departure from thc Forwan siylc o f thinking of any of the


concepts presented here. It is based on concurrent logic programming[ 191. Because of its differences from Fortran, we will present the basic concepts of Strand in more detail than we have done for the other methods described in this paper. It is still, however, far from a complete description of Strand.

7.1. The Strand approach

The data structures in Strand are called terms. They are of four basic types:

(1) Numbers: integers or real. For example, 347, 23.96. (2) Strings: sequences of characters delimited by single or double quotes. For example,

‘cat’, ‘Dog*@!’’. (3) Variables: sequences of characters and numbers. They must begin with a capital

letter or an underscore (J. For example, Altitude, X-32. Unlike Fortran, a variable may only be assigned once.

(4) Structures: these are collections of data items (which could themselves be structures). There are two basic types of structures: (a) Tuples: these are n-ary trees, represented as a set of data items enclosed in

braces. For example, (‘abc’, (3, 4.2}, 79). This is a 3-aty tree containing a 2-ary tree as an element.

(b) Lists: these are binary trees in which each element is denoted by [HeadlTaill. For example, [‘abc I [{3, 4.2) 1 1791 [ I]]]. Note the empty list, [ I. For convenience, this can be written as [‘abc’, (3, 4.2}, 791.

A Strand program consists of a pool of interacting processes and a set of rules. Strand executes a program by randomly selecting a process from the pool of unexecuted processes and determines if it can be reduced (described below). If it can, it is removed from the pool of processes and the process commiu. This continues until no processes are left in the pool.

A process is defined as a term of the form

where p is the process name and TI, . . . , T, is the process data state (The Ts are similar to the Fortran concept of subprogram arguments.) A process receives a local copy of the data state. Note that, unlike Fortran, a process may bc referenced with different numbers of arguments. Thus the pair pln identifies the process dcfinition.

Programs are defined as a set of rules. Each rule dcfines a single action (depending on the initial data state). This is different from Fortran, whcre thc definition of a subprogram is contained in one program body. For example, a Fortran subprogram that had a two-way branch in it would be represented in Strand as two rulcs--one for each branch.

To define a rule, a concise notation is used:

H :- G I , ... ,G, 1 B1, .. . ,B,. m,n 2 0

where H is the rule head and has the same form as a process, :- is the implies operator, the Gs are the rule guard, and the Bs are the rule body.


Before a process can be reduced (removed from h e pool of processes) two preconditions must be satisfied. The first is mulching. That is lhe data structure of the process must match (in number, position, and data type) the data structure of some rule head. Secondly, each element of the rule guard specifies a logical condition that must be satisfied by the process data state before the rule will be executed. The guards correspond to Fortran IF tests.

If these two conditions are satisfied, the invoking process is removed from the pool of unexecuted processes, it changes srare to one of B1, . . .9, (selected randomly), and forks the remaining (n - 1) B processes (that is, it adds them to the pool of unexecuted processes).

The following simple example illustrates some of the suand concepts. This Strand program will find the maximum element in a list and put it in M . The program is intended to illustrate the Strand philosophy-not to demonstrate the most efficient way such a problem could be solved in Suand.

% R 1 I n i t i a l i z e w i t h t h e first e l e m e n t m a x ( [ X I X s l , M) :-

maxl (Xs, X, M I .

% R 2 R e p l a c e t h e m a x i m u m t o d a t e m a x l ( [ Y I Y s l I Q, M) : -

Q1 :=Y, m a x l ( Y s , Q 1 , M ) .

Y > = Q I

% R 3 E x a m i n e n e x t e l emen t maxl ( [ Y I Y s ] I Q, M) : -

m a x l ( Y s , Q I M) Y < Q I

% R4 S e t G l o b a l m a x i m u m maxi( [ I I Q, M) : - M := Q.

Table 1 shows a possible execution sequence for this program. It can be seen that a Strand program is driven by data availability, not process synchronization.

Table 1

Process Step chosen Result Process pool Comments

0 - - maxG" 1, 91, M ) Only R1 can be applied 1 1 Change state maxl([l, 91, 7, M ) Only R3 can be applied 2 1 Change state maxl([91. 7, M ) Only R 2 can be applied 3 1 Change state. & fork Q1:=9, maxl([ 1, Q1. M ) Either process can be selected 4 2 Change state & fork Q1:=9, M:=Q1 Apply intrinsic assignment 5 1 Terminate M:=9 Apply intrinsic assignment 6 1 Terminate Empty and M = 9 Program terminates

PORTABLE TOOLS FOR FORTRAN PARALLEL PROGRAMMING 57 1

If the process pool initially was max([7, 1.91, M l), max([3,6, 1.21, M 2). max([M 1, M 21, M), then the program could interleave the selection of the three sets of processes generated by each of the original max processes. The final result will. of course, be that M = 9.

Using the ability to define data structures, Strand processes can interact with existing code written in Fortran (and other languages). This permits the reuse, in the Strand environment, of the large investment already made in existing Fortran code.

We appreciate the assistance provide to us by Timothy G. Mauson of Strand Software Technologies Inc. Any errors in the description of Strand in this paper are. of course, the fault of the authors.

If multiple processors are available they may all be selecting processes from the pool of unexecuted processes. This is how parallelism may be achieved in Strand. In fact, Strand has specific directives for defining common structures for parallel architectures. Thus, a ring or a torus of processors may be defined and processes may be directed to execute on different processors. In the preceding second example above, two or more processors could be used to process the sets of max processcs being generated.

7.2. Comments on Strand

Strand is based on the work of Foster and Taylor[8] and is currently available from Strand Software Technologies Inc.[18].

While Strand, of the ideas presented here, is the furthest conceptually from Fortran, it is straightforward once the different philosophy is grasped. It has the advantage of permitting the use of existing code in other languages-Fortran and C, for example. It is available for Apollo, COGENT, Encore Multimax, Intel iPSC2 and iPSC860, Macintosh, Next, Sequent, Sun computers, and transputer-based systems.

8. CONCLUSIONS

An analytical overview of portable multi-tasking tools (presently available) for parallel Fortran programming is presented. The four most commonly used methods for developing portable multi-tasking programming tools: (i) a multitasking library, (ii) compiler directives, (iii) extending the Fortran language, and (iv) using Fortran as a subset of a higher level language, are covered.

At present, it does not seem possible to draw conclusions about which one of the techniques presented here will prevail. The main factor determining this will be the availability of the tool on a wide variety of multiprocessor systems, ease of use, and efficiency. We will address the last two issues in a subsequent paper. We expect all the approaches defined here to be vigorously pursued.

Finally, we note that there is ordinarily some performance degradation when using the portable multi-tasking tools described in this report-some performance is sacrificed for the sake of portability. This issue has not been addressed here and deserves careful investigation in the future.

REFERENCES

1. S. Ahuja, N. Carrier0 and D. Gelemter, ‘Linda and her friends’, IEEE Computer, No. 19, 26-34 (1986).

572 SWAKN P. KUMAR AND N O R R. PHILIPS

2. Alliant Computer Systems Corporation, Alliant FXiSeries, Product Summary, Acton, MA, June,

3. Clifford Arnold, ETA Systems Multiprocessing Library Specijrcatiom, ETA Systems Inc.

4. BBN Inc., The Bufterfy Parallel Processor, (BBN Report) May, 1986. 5. Cray Research, Inc., Multitasking User Guide, Ilocument #SN 0222, Cray Research, Inc.

(1986). 6. Kent Crispin and Robert Strout, NSYSLIB Library Reference Manuals, LCSD 912, Lawrence

Livermore National Laboratory (1985). 7. Jack Dongarra and Danny Sorenson SCHEDULE: 7001s for Developing and Anulyzing Parallel

Fortran Program. ANUMSC TM 86, Argonne National Laboratory. Math. and Computer Science Division (1986).

8. Ian Foster and Steven Taylor, STRAND: New Concepts In Parallel P r o g r m ’ n g , Prentice Hall, Englewood Cliffs, NJ (1990).

9. D. Gelemter, N. Carriero. S. Chandran and S. Chang, ‘Parallel programming in Linda’, Proceedings of fhe 1985 lntermtional Conference on Parallel Processing, IEEE Computer Society, 255-263 (1985).

1985.

(Internal Report) (1987).

10. W. D. Hillis, The C o w c f i o n Machine, The MIT Press, Cambridge, MA. 1 1 . Intel Scientific Computers Inc., The iPSC Multiprocessor. Beaverton, OK. 12. Harry Jordan ef al., Force User’s Manual, Dcpartment o f Elcctrical and Computer Engineering,

13. Bruce Kelly, MAT: Mulfifasking Analysis Tool, LCSD 347. Lawrence Livermore National

14. Kuck and Associates, W: KAPISequent User’s Guide. Kuck and Associates Inc., Champaign,

15. Parallel Computing Forum, PCF Fortran: Language Definition, version 1, R . Leasure, Editor,

16. Mark Seager et al.. Graphical Multiprocessing Analysis Tool (GMA’I’), Lawrence Livermore

17. Sequent Computer Systems, Guide fo Parallel Programming (1987). 18. Strand Software Technologies. A General Purpose Prograrnming System for ConcurreM

19. S . Taylor, Parallel Logic Programming 7echniques. I’rcntice H a l l . Englcwood Cliffs, NJ (1989).

University of Colorado (1986).

Laboratory (March, 1986).

IL (1989).

16 August (1988).

National Laboratory, documenl #ISCR 87 2 (1987).

Computers. Watford, Hertfordshire, U K .

portable tools for fortran parallel programming

Documents