[ieee second euromicro workshop on parallel and distributed processing - malaga, spain (january...

The Mathematician% Devil: An Experiment in Automating the Production of Parallel Linear Algebra Software

P. Milligan, R. McConnell” and T. J. G. Benslon

Department of: Computer Science, The Queen’s University of :Belfast, N. Ireland. *Parallel Computer Science, The Queen’s University of Bellfast, N. Ireland.

Abstract

The Mathematician’s Devil is a protoope environment for the development of sequenrfial linear algebra programs for execution on transpuiter based multiprocessor architectures. Programs are expressed using the SIMPL notation and the system will p(ara1lelize the input program and match the parallelized components with optimized parallel routines selected from! a library, The output takes the form of a list of recommendations, i.e. a list of parallel routines together with the required target topologies.

1: Introduction

The last decade saw the emergence of trends in hardware design that resulted in the development of computer systems that can be grouped under the broad title of multiprocessors, e.g. the IlvTEL iPSC, the Meiko Computing Surface and netwcrks of N O S transputers. These systems cost considwably less than the old supercomputers, i.e. the vectoir and array processors, and offered supercomputer performance levels. Hence they appeared to be a very attractive proposition to any user seeking high processing capability at a relatively low cost.

Furthermore the development of parallel computer architectures appears to continue almost unabated. The early systems composed of rellatively simple components such as transputers are being replaced by sophisticated architectures offering increased processing potential. Recently the major supercomputer manufacturers have announced new product lines, the so-called massively parallel processors, which promise achievable teraflop performance leveb.

While the hardware side of parallel computer systems appears to move from strength to strength the same cannot be said of the supporting software. On the earlier systems users were expected to adapt to environments such as the Transputer Development System (TDS) [l]. While this system offered little in the way of difficulty to users fkom a Computer Science background it proved to be a major

stumbling block to many traditional supercomputer users, e.g. theoretical physicists, numerical mathematicians, engineers, etc. One of the fundamental requirements of several generations of TDS was the use of Occam as a programming langauge. Unfortunately the vast majority of the user base was confUmed in the use of Fortran and there was a high degree of resistance to Occam and similar languages.

The objection to the use of a programming language other than Fortran was understandable for a variety of reasons:

(i) Fortran offered a stable programming environment that did not require the explicit expression of parallelism,

(ii) the existing supercomputers provided good ‘vectorizing’ compilers and as a result there was no need to understand any form of parallelization process,

(iii) a considerable amount of man years would be required to translate existing ‘dusty deck’ Fortran codes,

(iv) the lack of uniform or general environments for parallel programs meant users had to adapt their approach to progmm development and

(v) the majority of the users were domain experts rather than system experts, i.e. they were expert physicists or mathematicians buit had very little or no expertize in thefinetuningrequired to extract the best performances from parallel systems.

To overcome these and other problems requires the provision of an interactive, portable programming environment that will lbridge the gap between potential users and the ‘new’ parallel architectures. Early research in this area tended to follow one of three directions, namely:

(i) the provision of autoparallelization tools to automatically vectorize and parallelize the loop constructs of existing sequential programs,

(ii) the provision of parallel versions of existing sequential languages, e.g. Fortran [2] and C [31 and

385 0-8186-5370-1/94 $3.00 0 1994 IEICE

(iii) the development of libraries of parallel subroutines, i.e. subroutines specially optimized for specific system architectures [4,51.

It was against this background that work on the Mathematician’s Devil project commenced.

2: The Mathematician’s Devil

The term Mathematician’s Devil is coined from the term printers devil. In the early printing houses of Europe the printer was considered to be too important to undertake tasks such as mixing ink or cutting paper. These menial tasks were left to an assistant who was referred to as the printer’s devil. In the same way the work of the computational physicist or engineer should not be made more complex by the need to undertand the vagarities of different parallel architectures. The work of parallelizing a program should be the responsibility of an assistant, albeit a very expert assistant. The Mathematician’s Devil was intended to fill this role.

To attempt to develop a system that would be sufficiently general in that it would undertake the task of parallelizing any user program represented a major undertaking. Rather than pursue this course of action the development of a prototype was proposed. The prototype would concentrate on a specific area and would act as a test bed for experimenting with different development strategies.

3: The Prototype System

The prototype version of the Mathematicians Devil concentrates on the development of parallel linear algebra code for execution on networks of transputers. Linear algebra was selected as the target development area as it is rich in the use of matrix manipulation techniques which are central to the solution of many computationally demanding problems. Transputer networks were selected as the target topology as several such systems were readily available. In addition there was ongoing work concerned with the development of libraries of parallel routines for the transputer networks [6,7].

An important feature of the prototype was that it would not produce sowe code ready for compilation and execution on the target system. Instead the output would rake the form of a list of recommendtaions. A recommendation ties a component of the original users program to a specific parallel routine selected from a libmy. In addition the nature of the target architecture is suggested by the system, i.e. it will identify how many transputers should be used and how they should be

arranged. The system can be viewed as a number of distinct

components. There has to be a mechanism for programming the target topology. This is provided by a special purpose programming notation called SIMPL. Once a program is written it is converted into an intemal representation, transformed and mapped onto the target topology. The various strategies associated with these processes are described in the fouowing sections.

4: The User Interface

The developed prototype oflers very limited interface facilities. Only two components are available, namely a programming language and the recommendations list produced as output by the system.

The programming notation, referred to as SIMPL, offers a user the ability to input a precise definition of a solution to a linear algebra problem. Initially as is the case with most langauges SIMPL enabled a user to give data descriptions and associated data manipulations. In addition to these two standard components the langauge contained a third section, namely a topology description section. This third element is necessary as the process of selecting appropriate matrix manipulation subroutines from a library will be driven in part by the nature of the target topology.

The programming and data structuring facilities of SIMPL are a subset of Pascal, apart from a few notable exceptions. In general, Pascal programs are made up of type definition and variable declaration sections followed by code description sections (procedures, functions and main bodies). SIMPL programs are described in a similar manner. They are regarded as a composite of three distinct sections, namely a definition section, a code section and a topology section.

As the name suggests the definition section will contain all of the necessary constant, type and variable declarations required within a specific program. The more elaborate type description facilities of Pascal (e.g. amy, set, record and pointer types) are not supported in SIMPL. In keeping with the chosen application area a new type, matrix, is introduced together with a broad range of matrix classification primitives, e.g. diagonal, tridiagonal, sparse, symmetric, skewsymmetric, positive definite, etc.

By combining these primitives a user is able to give precise definitions of any matrix and where the laws of linear algebra preclude certain types of matrix, e.g. symmetric uppertriangular, syntax checking within the SIMPL compiler will flag these as errors.

Only a limited set of programming constructs are provided by SIMPL. These are assignment-statements, if- statements, while-statements, compound-statements and

386

simple input/output operations. In the initial version of the langauge no provision is made for user definition of procedures or functions. If these are required they can be coded and held in the library of parallel routines. To assist the user in the development aC useful programs a set of matrix manipulations functions are provided. These are divided into three groups, namely

(i) those functions which are applicable to individual elements of a matrix, e.g. Sin, Sqrt, ScalarMult, etc,

(ii) those functions which can be viewed as operating on the matrix as a whole, e.g. Transpose and

(iii) those functions which describe the interaction of matrices, e.g. Mult, LinEqSolve, etc.

A full description of the matrix definition facilities and the manipulation functions cm be found elsewhere [SI. This range of functions means that a user is never required to consider the detailed coding requirements of mathematical methods, e.g. to express methods such as QR factorization or inversion the following SIMPL code statements are all that is required

D := eigenvalue (B) ; A := inverse (B) ;

As mentioned above the siystem will map the user requirements onto appropriate library subroutines. In many cases, such as the use of the eigmvalue function, more than one method (subroutine) may be appropriate. The choice of ‘appropriate’ routine will depend on the operand matrix type and the nature of the target topology.

The final component of a program contains a description of the users target topology. !;WL offers six standard topologies, namely free, hypercube, mesh, pipe, ring and twotree. So for example free(l6) denotes that the target topology is made up of sixteen bransputers which are free to be arranged as required whereas pipe(8) would indicate that eight transputers were available but that they were arranged as a pipeline of proclessors. ’hotree denotes a binary tree topology. The provision of two additional reserved words, array and vector, gives the users the power to describe a wide range of IMIMD, SIMD and vector systems, e.g.

(i) free (1, array (32, 32):) denotes a single node comprised of a 32x32 bit processor array, e.g. an AMT DAP 510,

(ii) pipe (1, vector (64)) denotes a single node comprised of a @-bit vector processor, e.g. a CRAY-1 and

(ii) free (4, vector (64)) dknotes four nodes each comprised of a 64-bit vector processor, e.g. a CRAY

X-MP.

Although the intende.d targets for the prototype system are networks of transputers as the above examples show the topology description facilities allow a wide range of machine types to be defined.

The second interface: component of the system is the recommendations list 01’ process-to-processor map. This is the final listing produced by the current version of the system and is best described by a simple example. Consider the following statement

CommonMatrix := Mult (‘hampose (MatrixA), MatrixC)

which describes the multiplication of the transpose of Mat~ixA by MatrixG. If the stated target topology was a ring of four processors lhen the corresponding map would be

level (2, [choice(“Generalmet~”,”RSGETPS,ring(4),[1,2,3,4], 6.000000,trampose(8,rl,~,real),~~~a~~A”l,tem~~4~~1~,

level (3, ~choice(“S~cedmeul~”,”REGE~”,ring(4),[1,2,3,41, 1536.000000, mult(4%,O~ea1,8,8,p‘diagonal’l~l), [temp(4),”matrixC”]~’CommonMatrix”)l),

This shows that the required transpose of MatrixA is being performed on the: ring and that the result will be placed in a temporary variable, temp(4). Subsequently temp(4) and MatrixG are multiplied again using all four processors in the ring. The phrase ‘sliced method’ at the start of the second statement is a simple text indicator to the user flaging the type of multiplication algorithm that was selected from the library. The following component, REGEMUL, is the name of the chosen subroutine. This information can be used to construct the appropriate program.

However between the input of a users program and the output of the associated recommendations list a considerable amount of work is carried out on the program. This work is described in the following sections.

5: Source Handling

This section of the system has two primary functions, namely to ensure that input programs conform to the programming language (definition in terms of syntax and semantics and to convert the input program to the intermediate represenlation used throughout the transformation procem. The Grst of the functions is achieved via the use of a standard recursive descent

381

analyser as described by Gries 191. This analyser forms the basis of the process which constructs the graph based representation (the intermediate or intemal representation) of a user program. By a simple process of enrichment additional function calls are added to the analyser and these functions construct the program graph. The graph can be viewed as a representation of the syntactic structure of a user program in that it is a list of statement-descriptor tuples. Tuples are of four types, namely assign-tuples, ifthenelse-tuples, while-tuples and call-tuples reflecting the statement types of the SIMPL language. To illustrate this representation consider the following variable declarations and SIMPL statement:

G, F : matrix [S, S] of real ; G := eigenvectors (F) ;

Analysis of this statement would result in the construction of the following graph tuple:

assign (variable (,,GI”, matrix (8, 8, U, realtype)), funccall (“eigenvectors”, variable (“P,

matrix 6 8 , U, realtype))) matrix (S,S,O, realtype)),

As can be observed the tuple (an assign-tuple) accuractely represents the original assignment statement containing details of the variables on the left and right hand sides of the assignment and recording the function called. A list of tuples like this example forms the input to the next stage of the system, namely the parallelization process.

6: Parallelization

There are many examples of research work concerned with the topic of parallelization of programs. In general the work can be organized into distinct areas. Some people work on fundamental issues such as data dependence analysis and removal, e.g. Kuck [lo], or the reorganization of data to minimize memory contention, e.g. Dongarra [ll] while others package this material to produce complete tools such as compilers that offer autoparallelization facilities, e.g. Zima [12].

The parallelization process used in the Mathematician’s Devil 1131 is considerably simpler but is sufficient given the nature of the programming model being used. As mentioned above the SIMPL language makes extensive use of libraries of parallelized code and as a result any programmer who makes use of these routines implicitly developes a parallel program. However this is not the only parallelism offered by the system.

In a paper published a considerable time ago Ramamoorthy and Gonzalez [14] reviewed a number of

simple algorithms for parallelizing programs. At a trivial level these algorithms function perfectly well, e.g. consider the simple statement

a := b*c + d*e + flg - bAi

then by applying several of the techniques this can be reorganized into a set of sub-statements

templ := b*c temp2 := d*e temp3 := flg temp4 := hAi

and as there are no inter-statement dependences these simple statements could be executed in parallel. Subsequently the temporary results can be combined in parallel in two stages, namely

temp5 := templ + temp 2 temp6 := temp3 - temp4 and temp7 := temp5 + temp6.

In other words a program is restructured into a series of levels where the components in a level can be executed in parallel but the levels must be executed sequentially.In additon the actual algorithms that perform this splitting operation are very straightforward and are easily grasped and implemented However the usefulness of the activity depends on the nature of the data objects. If these are simple objects, e.g. simple variables, then the parallel activity is useless as the overhv rds in communication, i.e. transferring the data to independent processors and copying the results back, would completely dominate the activity. However if the objects were coarse grained and if the operations were complex activities rather than simple additions then the process may be useful. Hence in the Mathematician’s Devil where the data objects are large vectors or matrices and the operations are correspondingly time consuming, e.g. the calculation of eigenvectors, such a simple scheme is viable.

An approach based on these ideas, i.e. developing a hierarchy of processing levels was adopted for use within the Mathematician’s Devil. In doing so a two-level parallel model was produced. As the trivial example illustrates the graph based representation of a program will be reorganized into a hierarchy with parallel activity within each level. In addition as the system supports the use of pdlelized routines for many of the matrix manipulations an addition layer of parallelism is provided.

The first phase of the parallelization process is a scan of the user program graph to identify and remove data

dependences. During this process each tuple is expanded to contain a unique label and a data dependence list, i.e. a list of all tuples that must be execuited before the current tuple can be executed. The resultant restructured graph has the form of a series of graph levels. Each level represents one stage in the execution hierarchy.

This restructured graph or parallel graph is passed to the final phase of the parallelization process. The output from this phase is a process-to-processor map. It is this map that represents a distributed parallel version of the original program.

The map is produced by the action of two distinct algorithms which are referred b as the selection algorithm and the distribution algorithm. The function of the selection algorithm is to transform the input parallel graph into a form referred to as a sukcbutine-selection graph. This process is composed of two stages, namely a labelling stage and an allocation stage. ‘The distribution algorithm accepts as input a subroutine-selection graph and outputs the process-to-processor map.

The selection algorithm, requires three inputs, namely

(i) a parallel graph produced by the graph transformation component representing tht: parallelized form of a user’s program;

(ii) a topology graph defining the target architecture derived from a user’s program, and

(iii) a libmy graph defining the available subroutine libraries derived from the system specification.

As mentioned in the introduction the prototype system does not permit the use of any library other than the transputer oriented version developed in-house. As a result the library graph is fixed and the topology and parallel graphs are user program dependent. The parallel graph forms the input to the labelling stage of the process and the output labelled parallel graph together with the topology and library graphs form the inputs to the allocation stage. All of the tuples in a parallel graph have an element

which descibes their data dependence relationship with the other tuples in the graph. This information is refered to as a precondition-1ist.The labelling algorithm scans the parallel graph to find all operationls which contain empty precondition-lists. Such tuples require the pre-exection of no other tuples and the null precondition-lists are replaced by a level one label. During this scan a secondary operation RmoveS all references to any lalbelled level one tuple from other tuple precondition-lists. To illustrate this consider the following graph fragment

assignment (1, [I, variable(“A”, inttype), add-op,

assignment (2, [I, variable(“C”, inttype), add-op, variable(L‘B”, inttype), variable (temp(l), inttype))

variableVD”, inttype), variable(temp(2), inttype))

variable(temp(2), inttype), variable (temp(3), inttype)) assignment (3,1131, variable(temp(l), inttype), add-op,

which represents the parallel graph that would have been produced f” the trivial SIMPL code used above

A*B + C*D + .... After the first scan of the labelling algorithm the graph

fragment would be transformed to

assignment (1, 1, variable(66A”, inttype), add-op,

assignment (2,1, variable(LLC”, inttype), add-op,

assignment (3,0, variablei[temp(l), inttype), add-op,

variable(“B”, inttype), variable (temp(l), inttype))

variableVD”, inttype), variable(temp(9, inttype))

variable(temp(2), inttype), variable (temp(3), inttype))

Note that the preconditions-lists for the first two statements have been relplaced by level 1 labels and as a result the precondition-list for the third statement has become empty. This graph forms the input to the next scan of the labelling algorithm where all null precondition-lists will be replaced by level two labels and so on until all precondition-lists have k x n replaced by appropriate labels. The find version of the simple example given above would be

assignment (1,1, variable(“A”, inttype), add-op,

assignment (2,1, variable(“C”, inttype), add-op,

assignment (3,2, variable(temp(l), inttype), add-op,

variable(“B”, inttype), variable (temp(l), inttype))

variable(“D”, inttype), variable(temp(2), inttype))

variable(temp(2), inttype), variable (temp(3), inttype))

When compound and repetition tuples occur they are treated in a straightforw;ud manner. They are regarded as single composite entities during a scan with only the overall precondition-list being checked and dtered as appropriate. Once the overall precondition-list becomes null then the labelling algorithm is applied to each sub- precondition-list within the compound entity. This creates a set of appropriate sub-levels.

The function of the allocation stage is to create the most efficient distribution of c3perations from each level given the overall target topolo,gy. Within the current version of the prototype four types of topology are possible, namely

(i) single, fixed topologjes, e.g. mesh (4,6), (ii) multiple, fixed topologies, e.g. mesh (4, 6); pipe (5);

ring (a (iii) free topologies, e.g. free (20), and (iv) hybrid topologies, i.e. topologies involving a mix of

389

fixed and ike components, e.g. mesh(4,6); free (20).

Obviously when dealing with single fixed and multiple fixed topologies only those subroutines which match with the required topology will be selected.The third group of topologies, free topologies, allows the topology to be configured into any of the defined fixed topology forms. When this form of topology is chosen then all subroutines for a particular operation must be considered. This has the effect of widening the search considerably. As a result the allocation stage will take a subroutine-selection graph, a library graph and a topology gmph and produce a suitable map. This map is produced with the aid of an expert system. The strategy used to produce a map can be thought of as being analagous to a curve fitting algorithm and can be described elsewhere in terms of high-level pseudo-code as

function AdaptedFittingAlgorithm ;

function Fitselections (PreviousListOfSelections, TopologyAvailable) : ListOfSelections ;

begin repeat

SmallerTopology :=

ListOfSelections := DecreaseTopologyByOneUnit (TopologyAvailable) ;

SelectFastestSubroutines (PreviousListOfSelections, SmallerTopology) ;

FindSlowestSelection (ListOfSelections) ;

(SlowestSelection) > CurrentSlowestTime then

SlowestSeelction :=

if TimeRequired

FailThisAttemptFlag := true

FailThisAttemptFlag := false ; RemainingTopology := SubtractATopology

(TopologyAvailable, SlowestTopology) ; WemainiigSelections := SubtractASelection

(ListOfSelectbns, SlowestSelection) ; SubListOfSelections := Fitselections

(RemainingSelectbns, RemainiigTopology)

else

endif until FailThisAttemptFlag or

if FailThisAttemptFlag then

else

endif endfunction Fitselections ;

not EmptyList (SubListOfSelectbns ;

return EmptyList

return Append (Slowestselection, SubLIstOfSelections)

begin InitialSelections := SelectFastestSubroutines

(FullListOfOperations, FullTopology) ;

FinalListOfSelections :=

return FialLiistOfSelections endfunction AdaptedFittingAlgorithm ;

Fitselections (InitialSelections, FullTopology) ;

The final phase of the selection/distribution module is the distributor which can be described as a series of steps that are applied separately to each level of operations:

(i) sort the operations into descending order given the size of the allocated topology:

(ii) sort the target topology description into ascending order size:

(iii) select the first operation and the first element of the target topology;

(iv) attempt to match the operation onto a portion of the topology, and

(v) delete the first operation and the first target topology from their respective lists and repeat step two above.

This material has been &scribed elsewhere [15] with the aid of some examples. The full system is implemented using Prolog on a workstation and is described elsewhere [161.

7: Evaluation

When setting out to evaluate the prototye it is all to easy to identify the many failings of the system, namely

(i) by avoiding Fortran the likely user base is substantially reduced,

(ii) by selecting a limited application area the user base is reduced,

(iii) no explicit code is generated, (iv) the user interface is poor, etc.

In spite of these obvious problems the system can be regarded as a successful experiment. A working version of the prototype was produced and as a result useful information on the implementation of data dependence analysis and removal techniques was gained. In addition the use of an expert system to assist with the selection/ mapping element of the parallelization process was succesful and demonstrated the viability of this approach.

8: Conclusions

The main purpose in developing a prototype is to provide an experimental base from which useful knowledge can be derived and this was the case with the Mathematician’s Devil. As a result of this work a new

390

system, called FortPort [17], has been specified and is under development. The new system offers a development and migration environment for Fortran codes and will provide user friendly interfaces to a range of analysis, parallelization, evaluation and visualization tools.

References

Inmos Limited IMS D700 Transputer Development System November 1985 Inmos Limited D713D 3L Parallel C April 1989 Inmos Limited D711D 3L Parallel Fortran February 1989 N G Brown, L M Delves, C: Howard, S Downing and C Philips Numeric Library Developmerit for Transputer Arrays Intemational Conference on the Applications of Transputers Liverpool, August 1989

Basic Linear Algebra Subprograms (BLAS) on the CDC Cyber 205 Parallel Computing 4 (1987). 143-165, North-Holland M M Chesney A Martix Manipulation Libray for a MIMD Environment MSc Dissertation, The Queen’s University of Belfast, September 1989 G E Moore A MIMD Matrix Manipulation Library MSc Dissertation, The Queen’s University of Belfast, September 1990 T J G Benson, P Milligan and N S Scott Program Development within the Mathematician’s Devil Microprocessing and Microprogramming, Vol. 30, pp 593- 597, August 1990 D Gries

M Louter-Nool

Compiler Construction for Digital Computers Wdey, 1971 D J Kuck, R H Kuhn, D A Padua, B Leasure and M Wolfe Dependence graphs and compiler optimizations Proc. 8th ACM Symposium on Principles of Programming Languages, pp 207-21 8,1981 J J Dongma, 0 Brewer, J A Kohl and S Fineburg A Tool to Aid in the Design, Implementation and Understanding of Malrix Algorithms for Parallel Processors Journal of Parallel arid Distributed Computing, 9, pp 185- 202,1990 H Zima Automatic vectorkzation and parallehtion for supercomputers In, Software for ParalUel Computers, pp 107-120, Chapman andHall, 1992 P Milligan, T J G Benson, R McConnell and A Rea Detecting Components for Parallel Execution within the Mathematician’s Devil Microprocessing and IMkroprogramming, Vol. 37, pp 65-68, January 1993 C V Ramamoorthy and M J Gonzalez A survey of techniques for recognizing parallel processable streams in computer programs PROC AFIPS Fall Joint Computer Comference, pp 1-15, 1969 P Miigan, T J G Benson, R McConnell and A Rea Process to Processor Mapping within the Mathematician’s Devil To appear, Microproetwing and Microprogramming T J G Benson Towards the development of a Mathematician’s Assistant for the specification and implementation of parallel linear algebra software PhD Thesis, The Queen’s University of Belfast, September 1992

[ 17 P Miigan, R McConmell, S Rea and P P Sage FortPort An Environment for the Development of Parallel Fortran Programs Microprocessing and Ihlicroprogramming, Vol. 34, pp 73-76, 1992

391

[ieee second euromicro workshop on parallel and distributed processing - malaga, spain (january...

Documents