publication series of the john von neumann institute for computing (nic) nic series ... ·...

Publication Series of the John von Neumann Institute for Computing (NIC)

NIC Series Volume 1

John von Neumann Institute for Computing (NIC)

Johannes Grotendorst (Editor)

Modern Methods and Algorithmsof Quantum Chemistry

Winterschool, 21 - 25 February 2000

Forschungszentrum Jülich, Germany

Proceedings

organized by

John von Neumann Institute for Computing

in cooperation with

Arbeitsgemeinschaft für Theoretische Chemie

NIC Series Volume 1

ISBN 3-00-005618-1

Die Deutsche Bibliothek – CIP-Cataloguing-in-Publication-DataA catalogue record for this publication is available from Die DeutscheBibliothek

Modern Methods and Algorithms of Quantum ChemistryProceedingsWinterschool, 21 - 25 February 2000Forschungszentrum Jülich, Germanyedited by Johannes Grotendorst

Publisher: NIC-Directors

Distributor: NIC-SecretariatResearch Centre Jülich52425 JülichGermany

Internet: www.fz-juelich.de/nic

Printer: Graphische Betriebe, Forschungszentrum Jülich

c

2000 by John von Neumann Institute for ComputingPermission to make digital or hard copies of portions of this workfor personal or classroom use is granted provided that the copiesare not made or distributed for profit or commercial advantage andthat copies bear this notice and the full citation on the first page. Tocopy otherwise requires prior specific permission by the publishermentioned above.

NIC Series Volume 1ISBN 3-00-005618-1

PREFACE

Computational quantum chemistry has long promised to become a major tool

for the study of molecular properties and reaction mechanisms. The fundamen-

tal methods of quantum chemistry date back to the earliest days of quantum

mechanics in the �rst decades of the twentieth century. However, widespread

quantitative applications have only become common practice in recent times,

primarily because of the explosive developments in computer hardware and the

associated achievements in the design of new and improved theoretical methods

and computational techniques. The signi�cance of these advances in computational

quantum chemistry is unterlined by the 1998 chemistry Nobel prize to Walter Kohn

and John Pople; this award also documents the increasing acceptance of computer

simulations and scienti�c computing as an important research method in chemistry.

Nearly one third of the projects which use the supercomputing facilities

provided by the John von Neumann Institute for Computing (NIC) pertain to the

area of computational chemistry. For projects in quantum chemistry the Central

Institute for Applied Mathematics (ZAM) which runs the Cray supercomputers

and networks at the Research Centre Julich o�ers several extensive software

packages running on its supercomputer complex. The computational requirements

of large quantum-chemical calculations are enormous. They have made the use

of parallel computers indispensable and have led to the development of a broad

range of advanced algorithms for these machines. The goal of this Winterschool

is to bring together experts from the �elds of quantum chemistry, computer

science and applied mathematics in order to present recent methodological and

computational advances to research students in the �eld of theoretical chemistry

and their applications. The participants will also learn about recent software

developments and about implementation issues that are encountered in quantum

chemistry codes, particularly in the context of high-performance computing (topics

not yet included in typical university courses). The emphasis of the Winterschool

is on method development and algorithms, but state-of-the-art applications will

also be demonstrated for illustration. The following topics are covered by twenty

lectures:

� Density functional theory

� Ab initio molecular dynamics

� Post-Hartree-Fock methods

� Molecular properties

� Heavy-element chemistry

� Linear scaling approaches

� Semiempirical and hybrid methods

� Parallel programming models and tools

� Numerical techniques and automatic di�erentiation

� Industrial applications

The programme was compiled by Johannes Grotendorst (Forschungszentrum

Julich), Marius Lewerenz (Universit�e Pierre et Marie Curie, Paris), Walter Thiel

(MPI fur Kohlenforschung, Mulheim an der Ruhr) and Hans-Joachim Werner

(Universitat Stuttgart).

Fostering education and training in important �elds of scienti�c computing

by symposia, workshops, schools and courses is a major objective of NIC. This

Winterschool continues a series of workshops and conferences in the �eld of compu-

tational chemistry organized by the ZAM in the last years; it provides a forum for

the scientifc exchange between young research students and experts from di�erent

academic disciplines. The NIC Winterschool will host more than two hundred par-

ticipants from sixteen countries, and more than �fty abstracts have been submitted

for the poster session. This overwhelming international resonance clearly reects

the attractiveness of the programme. The excellent support of the Arbeitsgemein-

schaft fur Theoretische Chemie in preparing the Winterschool is highly appreciated.

As in previous conferences, many people have made signi�cant contributions to

the success of this Winterschool. The local organization at Research Centre Julich

was perfectly done by Elke Bielitza, Rudiger Esser, Bernd Krahl-Urban, Monika

Marx, Renate Mengels, and Margarete Reiser. We are grateful for the generous

�nancial support by the Federal Ministry for Education and Research (BMBF) and

by the Research Centre Julich and for the help provided by its Conference Service.

We also thank the authors for their willingness to provide a written version of their

lecture notes. Special thanks go to Monika Marx for her commitment concerning

the composition and realization of these proceedings and the book of abstracts of

the poster session. Finally, we are indebted to Beate Herrmann who supported the

typesetting of these books with professionalism and great care.

Julich Johannes Grotendorst

February 2000

CONTENTS

Industrial Challenges for Quantum Chemistry

Ansgar Schafer 1

1 Introduction 1

2 Application Fields of Quantum Chemistry in Industry 2

3 Unsolved Problems 3

4 Conclusion 5

Methods to Calculate the Properties of Large Molecules

Reinhart Ahlrichs 7

Parallel Programming Models, Tools and Performance Analysis

Michael Gerndt 9

1 Introduction 9

2 Programming Models 12

3 Parallel Debugging 18

4 Performance Analysis 18

5 Summary 25

Basic Numerical Libraries for Parallel Systems

Inge Gutheil 29

1 Introduction 29

2 Data Distributions 30

3 User-Interfaces 33

4 Performance 36

5 Conclusions 44

Tools for Parallel Quantum Chemistry Software

Thomas Steinke 49

1 Introduction 49

2 Basic Tasks in Typical Quantum Chemical Calculations 50

3 Parallel Tools in Today's Production Codes 52

4 The TCGMSG Library 53

5 The Global Array Toolkit 54

6 The Distributed Data Interface Used in GAMESS (US) 60

7 Further Reading 63

Ab Initio Methods for Electron Correlation in Molecules

Peter Knowles, Martin Schutz, and Hans-Joachim Werner 69

1 Introduction 69

2 Closed-Shell Single-Reference Methods 83

i

3 Open-Shell Single-Reference Methods 99

4 Linear Scaling Local Correlation Methods 102

5 Multireference Electron Correlation Methods 112

6 Integral-Direct Methods 133

R12 Methods, Gaussian Geminals

Wim Klopper 153

1 Introduction 153

2 Errors in Electronic-Structure Calculations 154

3 The Basis-Set Error 155

4 Coulomb Hole 163

5 Many-Electron Systems 166

6 Second Quantization 167

7 Explicitly Correlated Coupled-Cluster Doubles Model 168

8 Weak Orthogonality Techniques 171

9 R12 Methods 173

10 Explicitly Correlated Gaussians 178

11 Similarity Transformed Hamiltonians 179

12 MP2-Limit Corrections 180

13 Computational Aspects of R12 Methods 183

14 Numerical Examples 188

15 Concluding Remark 191

Direct Solvers for Symmetric Eigenvalue Problems

Bruno Lang 203

1 Setting the Stage 203

2 Eigenvalue Computations: How Good, How Fast, and How? 205

3 Tools of the Trade: Basic Orthogonal Transformations 210

4 Phase I: Reduction to Tridiagonal Form 213

5 Phase II: Methods for Tridiagonal Matrices 219

6 Methods without Initial Tridiagonalization 226

7 Available Software 228

Semiempirical Methods

Walter Thiel 233

1 Introduction 233

2 Established Methods 234

3 Beyond the MNDO Model: Orthogonalization Corrections 238

4 NMR Chemical Shifts 241

5 Analytic Derivatives 246

6 Parallelization 248

7 Linear Scaling and Combined QM/MM Approaches 250

8 Conclusions 252

ii

Hybrid Quantum Mechanics/Molecular Mechanics Approaches

Paul Sherwood 257

1 Introduction 257

2 Terminology 257

3 Overview of QM/MM Schemes 258

4 The Issue of Conformational Complexity 268

5 Software Implementation 269

6 Summary and Outlook 271

Subspace Methods for Sparse Eigenvalue Problems

Bernhard Ste�en 279

1 Introduction 279

2 Eigenvalue Extraction 280

3 Update Procedures 282

4 Problems of Implementation and Parallelization 284

5 Conclusions 285

Computing Derivatives of Computer Programs

Christian Bischof and Martin Bucker 287

1 Introduction 287

2 Basic Modes of Automatic Di�erentiation 290

3 Design of Automatic Di�erentiation Tools 291

4 Using Automatic Di�erentiation Tools 293

5 Concluding Remarks 297

Ab Initio Molecular Dynamics: Theory and Implementation

Dominik Marx and Jurg Hutter 301

1 Setting the Stage: Why Ab Initio Molecular Dynamics ? 301

2 Basic Techniques: Theory 305

3 Basic Techniques: Implementation within the CPMD Code 343

4 Advanced Techniques: Beyond : : : 392

5 Applications: From Materials Science to Biochemistry 418

Relativistic Electronic-Structure Calculations for Atoms and Molecules

Markus Reiher and Bernd He� 451

1 Qualitative Description of Relativistic E�ects 451

2 Fundamentals of Relativistic Quantum Chemistry 452

3 Numerical 4-Component Calculations for Atoms 453

4 A Short History of Relativistic Atomic Structure Calculations 453

5 Molecular Calculations 463

6 Epilogue 472

iii

E�ective Core Potentials

Michael Dolg 479

1 Introduction 479

2 All-Electron Hamiltonian 484

3 Valence-Only Hamiltonian 487

4 Analytical Form of Pseudopotentials 493

5 Adjustment of Pseudopotentials 495

6 Core Polarization Potentials 497

7 Calibration Studies 498

8 A Few Hints for Practical Calculations 501

Molecular Properties

Jurgen Gauss 509

1 Introduction 509

2 Molecular Properties as Analytical Derivatives 510

3 Magnetic Properties 527

4 Frequency-Dependent Properties 545

5 Summary 553

Tensor Concepts in Electronic Structure Theory:

Application to Self-Consistent Field Methods and

Electron Correlation Techniques

Martin Head-Gordon 561

iv

INDUSTRIAL CHALLENGES FOR QUANTUM CHEMISTRY

ANSGAR SCH

AFER

BASF Aktiengesellschaft

Scienti�c Computing

ZDP/C - C13

67056 Ludwigshafen

Germany

E-mail: [email protected]

The current �elds of application of quantum chemical methods in the chemical

industry are described. Although there are a lot of questions that already can be

tackled with modern algorithms and computers, there are still important problems

left which will need further improved methods. A few examples are given in this

article.

1 Introduction

Already in the 1970's and 80's, quantum chemical methods were very successful

in describing the structure and properties of organic and main-group inorganic

molecules. The Hartree-Fock (HF) method and its simpli�ed semi-empirical mod-

i�cations became standard tools for a vivid rationalization of chemical processes.

The underlying molecular orbital (MO) picture was, and still is, the most important

theoretical concept for the interpretation of reactivity and molecular properties.

Nevertheless, quantum chemical methods were not used extensively for industrial

problems, although most of the industrial chemistry produces organic compounds.

One reason can be found in the fact that almost all industrial processes are cat-

alytic. The catalysts are predominantly transition metal compounds, which in

general have a more complicated electronic structure than main-group compounds,

since their variability in the occupation of the d orbitals results in a subtle balance

of several close lying energy levels. HF and post-HF methods based on a single elec-

tron con�guration are not able to describe this situation correctly. Furthermore,

the catalyst systems were generally too big to be handled. The usual approach was

to choose small model systems, e.g. with PH3 substituting any phosphine ligand

or a cluster representing a solid surface. Such investigations provided only a basic

understanding of the catalytic reaction, but no detailed knowledge on steric and

electronic dependencies.

With the improvement of both the methodology and the algorithms of density

functional theory (DFT) in the last two decades, the situation changed signi�-

cantly. DFT appears to be less sensitive to near degeneracy of electronic states,

and furthermore incorporates some e�ects of electron correlation. The development

of new functionals with improved description of non-uniform electron distributions

in molecules or on surfaces, paved the way for a qualitative or even quantitative

quantum chemical treatment of a large variety of transition metal compounds and

their reactions. When functionals without partial inclusion of HF exchange con-

tributions are used, an approximate treatment of the Coulomb interaction of the

electrons (density �tting, resolution of identity (RI) approach) allows for a very ef-

1

�cient treatment of large systems. Therefore, with e�ciently parallelized programs

of this kind, it is routinely possible today to calculate the structure of molecules

with 100-200 atoms.

Although the chemical processes in many cases involve transition metal com-

pounds, calculations on pure organic molecules are still important to predict prop-

erties like thermodynamic data or various types of spectra. However, for a quan-

titative agreement between calculated and experimental results, DFT very often

is not reliable enough, and approaches going beyond the HF approximation are

necesary to assess the e�ects caused by electron correlation. These methods are

computationally very demanding, and highly accurate calculations are still limited

to small molecules with not more than about 10 atoms. Therefore, still only few

problems of industrial relevance can be tackled by these methods at the moment.

2 Application �elds of quantum chemistry in industry

2.1 Catalysis

As already mentioned, many of the industrial chemical processes involve catalysts.

Most of the catalysts are in the solid state (heterogeneous catalysis), but, with the

extensive developments in organometallic chemistry in the last decades, catalytic

processes in the liquid phase become more and more important (homogeneous catal-

ysis). For the development of a catalyst, besides economic considerations, three

issues are of central importance:

� Activity: The catalyst must be e�cient (high turnover numbers).

� Selectivity: By-products should be avoided.

� Stability: Deactivation or decomposition of the catalyst must be slow.

To improve a catalyst with respect to these criteria, a detailed understanding of the

reaction pathways is necessary. This is one point, where quantum chemical methods

can be of enormous value, since the experimental investigation of steps in a com-

plicated mechanism is rather di�cult. Once the crucial parameters are found, new

guesses for better performing catalysts can be deduced and immediately be tested in

the calculations. Thus, when theory is used for a rough screening and only the most

promising candidates have to be tested in experiment, the development process for

new catalysts can be shortened signi�cantly. DFT is used almost exclusively for

both homogeneous and heterogeneous applications. In the latter case, solids are

treated with preriodic boundary conditions or QM/MM approaches.

2.2 Process design

The design of chemical processes and plants requires the knowledge of accurate

thermodynamic and kinetic data for all substances and reactions involved. The

experimental determination of such data is rather time-consuming and expensive.

The substances have to be prepared in high purity and precise calorimetric or ki-

netic measurements have to be done for well-de�ned reactions. This e�ort must

2

be invested before the actual technical realization of a new process, because for

economic and safety reasons, the data has to be as accurate as possible. However,

for an early assessment of the practicability and pro�tability of a process, a fast but

nevertheless reliable estimate for the thermodynamics usually is su�cient. A num-

ber of empirical methods based on group contributions are used for this purpose,

but they are not generally applicable and often not reliable enough. For exam-

ple, these methods often can not discriminate between isomers containing the same

number and type of groups. Alternatively, reaction enthalpies and entropies can

also be calculated by quantum chemical methods in combination with statistical

mechanics. For the rotational and vibrational energy levels, a correct description

of the molecular structure and the shape of the energy surface is needed, which can

very e�ciently be obtained with DFT. The crucial point for the overall accuracy,

however, are the di�erences in electronic energies, and high level ab initio meth-

ods like coupled cluster theories are often needed to get the error down to a few

kcal/mol. That such a high accuracy is needed can be illustrated by the fact, that a

change in the Gibbs free energy of only 1.4 kcal/mol already changes an equilibrium

constant by an order of magnitude at room temperature.

2.3 Material properties

When a desired property of a material can be connected to quantities on the atomic

or molecular scale, quantum chemistry can be a useful tool in the process of improv-

ing such materials. Typical examples are dyes and pigments, for which color and

brilliance depend on the energies and nature of electronic excitations. Organic dyes

typically have delocalized pi systems and functional groups chosen appropriately

to tune the optical properties. Since the molecules normally are too big for an ab

initio treatment with the desired accuracy, semi-empirical methods are currently

used to calculate the excited states.

3 Unsolved problems

3.1 Treatment of the molecular environment

Quantum chemical methods are predominantly applied to isolated molecules, which

corresponds to the state of an ideal gas. Most chemical processes, however, take

place in condensed phase, and the interaction of a molecule with its environment

can generally not be neglected. Two important examples are given here.

3.1.1 Solvent e�ects

Solvent molecules can directly interact with the reacting species, e.g. by coordi-

nation to a metal center or by formation of hydrogen bonds. In such cases it is

necessary to explicitely include solvent molecules in the calculation. Depending on

the size of the solvent molecules and their number needed to get the calculated

properties converged, the overall size of the molecular system and the resulting

computational e�ort can signi�cantly be increased. Currently, only semi-empirical

methods are able to handle several hundred atoms, but the developments towards

3

linear scaling approaches in DFT are very promising. An alternative would be

a mixed quantum mechanical (QM) and molecular mechanical (MM) treatment

(QM/MM method).

If there are no speci�c solute-solvent interactions, the main e�ect of the solvent is

electrostatic screening, depending on its dielectric constant. This can be described

very e�ciently by continuum solvation models (CSM).

3.1.2 Enzymes

Biomolecules like enzymes usually consist of many thousands of atoms and therefore

can not be handled by quantum chemical methods. Although there exist several

very elaborate force �eld methods which quite reliably reproduce protein structures,

the accurate description of the interaction of the active site with a ligand bound

to it still is an unsolved problem. Quantum mechanics is needed for a quantitative

assessment of polarization e�ects and for the description of reactions. Treating

only the active site quantum mechanically usually does not give the correct results

because of the strong electrostatic inuence of the whole enzyme and the water

molecules included. QM/MM approaches can be used for the treatment of the

whole system, but the QM/MM coupling still needs improvement.

A very important quantity of an enzyme/ligand complex is its binding free

energy. It determines the measured equilibrium constant for a ligand exchange

and is therefore important for the development of enzyme inhibitors. A reliable

calculation of such binding constants is currently not possible. It certainly must

involve quantum chemistry and molecular dynamics.

3.2 Accurate thermochemistry and kinetics

The highly accurate calculation of thermochemical data with ab initio methods is

currently possible only for small molecules up to about 10 atoms. However, many of

the data for molecules of this size are already known, whereas accurate experiments

for larger compounds are quite rare. Therefore, e�cient ab initio methods are

needed which are able to treat molecules with 30-50 atoms with the same level of

accuracy. The currently developed local treatments of electron correlation are very

promising in this direction.

Another problem arises for large molecules. They often have a high torsional

exibility, and the calculation of partition functions based on a single conformer

is therefore not correct. Quantum molecular dynamics could probably give better

answers, but is in many cases too expensive.

For the assessment of catalytic mechanisms, kinetic data are needed to discrim-

inate between di�erent reaction pathways. As described above, ab initio methods

are often not applicable when transition metals are involved. The DFT results for

reaction energies are usually reliable enough to give correct trends, but calculated

activation barriers easily can be wrong by a factor of two or more. The problem lies

in the functionals, which best describe the electron distribution in the molecular

ground states.

4

3.3 Spectroscopy for large molecules

Calculated molecular spectroscopic properties are very helful in the assignment

and interpretation of measured spectra, provided that the accuracy is su�ciently

high. In many cases IR, Raman and NMR spectra can be obtained with reason-

able accuracy on DFT or MP2 level, but UV/VIS spectra normally require more

elaborate theories like con�guration interaction, which are only applicable to very

small molecules. To improve on the currently applied semi-empirical approaches it

would be necessary to calculate accurate excitation energies also for, e.g., organic

dyes with 50 or even more atoms.

4 Conclusion

The improvement of the e�ciency of the algorithms and the enormous increase

of the available computer power already made quantum chemistry applicable to a

lot of industrial problems. However, there are still many aspects concerning the

accuracy, completeness and e�ciency of the quantum chemical treatment, which

will need more attention in the future.

Currently, DFT is the most widely used quantum theoretical method in indus-

try. Also of high importance are semi-empirical methods for certain applications.

Both approaches have in common that they do not o�er a systematic way for the

improvement of results, if they are found to be not reliable enough. Therefore,

there is still need for e�cient ab initio methods, at least as a reference.

5

METHODS TO CALCULATE THE PROPERTIES OF LARGE

MOLECULES

REINHART AHLRICHS

Universitat Karlsruhe (TH)

Institut fur Physikalische Chemie und Elektrochemie

Kaiserstr. 12, 76128 Karlsruhe

E-Mail: [email protected]

Lecture notes were not available at printing time.

7

PARALLEL PROGRAMMING MODELS, TOOLS AND

PERFORMANCE ANALYSIS

MICHAEL GERNDT


Central Institute for Applied Mathematics

52425 Julich

Germany


The major parallel programming models for scalable parallel architectures are the

message passing model and the shared memory model. This article outlines the

main concepts of those models as well as the industry standard programming in-

terfaces MPI and OpenMP. To exploit the potential performance of parallel com-

puters, programs need to be carefully designed and tuned. We will discuss design

decisions for good performance as well as programming tools that help the pro-

grammer in program tuning.

1 Introduction

Although the performance of sequential computers increases incredibly fast, it is

insu�cient for a large number of challenging applications. Applications requiring

much more performance are numerical simulations in industry and research as well

as commercial applications such as query processing, data mining, and multi-media

applications. Architectures o�ering high performance do not only exploit paral-

lelism on a very �ne grain within a single processor but apply a medium to large

number of processors concurrently to a single application. High end parallel com-

puters deliver up to 3 Teraop/s (10

12

oating point operations per second) and

are developed and exploited within the ASCI (Accelerated Strategic Computing

Initiative) program of the Department of Energy in the USA.

This article concentrates on programming numerical applications on distributed

memory computers introduced in Section 1.1. Parallelization of those applications

centers around selecting a decomposition of the data domain onto the processors

such that the workload is well balanced and the communication is reduced (Section

1.2)

7

.

The parallel implementation is then based on either the message passing or

the shared memory model (Section 2). The standard programming interface for

the message passing model is MPI (Message Passing Interface)

13;10

, o�ering a

complete set of communication routines (Section 2.1). OpenMP

5;12

is the standard

for directive-based shared memory programming and will be introduced in Section

2.2.

Since parallel programs exploit multiple threads of control, debugging is even

more complicated then for sequential programs. Section 3 outlines the main con-

cepts of parallel debuggers and presents TotalView, the most widely available de-

bugger for parallel programs.

Although the domain decomposition is key to good performance on parallel ar-

chitectures, program e�ciency depends also heavily on the implementation of the

9

communication and synchronization required by the parallel algorithm and the im-

plementation techniques chosen for sequential kernels. Optimizing those aspects is

very system dependent and thus, an interactive tuning process consisting of mea-

suring performance data and applying optimizations follows the initial coding of the

application. The tuning process is supported by programming model speci�c per-

formance analysis tools. Section 4 presents basic performance analysis techniques

and introduces the three performance analysis tools for MPI programs available on

CRAY T3E.

1.1 Parallel Architectures

Parallel computers that scale beyond a small number of processors circumvent the

main memory bottleneck by distributing the memory among the processors. Cur-

rent architectures

4

are composed of single-processor nodes with local memory or

of multiprocessor nodes where the node's main memory is shared among the node's

processors. In the following it is assumed that nodes do have only a single CPU

and the terms node and processor will be used interchangeably.

The most important characteristic of this distributed memory architecture is that

access to the local memory is faster than to remote memory. It is the challenge

for the programmer to assign data to the processors such that most of the data

accessed during the computation are already in the node's local memory.

Three major classes of distributed memory computers can be distinguished:

No Remote Memory Access (NORMA) computers do not have any hard-

ware support to access another node's local memory. Processors obtain data

from remote memory only by exchanging messages between processes on the

requesting and the supplying node.

Remote Memory Access (RMA) computers allow to access remote memory

via specialized operations implemented by hardware. The accessed memory

location is not determined via an address in a shared linear address space but

via a tuple consisting of the processor number and the local address in the

target processor's address space.

Cache-Coherent Non Uniform Memory Access (ccNUMA) computers

do have a shared physical address space. All memory locations can be accessed

via usual load and store operations. Access to a remote location results in

a copy of the appropriate cache line in the processor's cache. Coherence

algorithms ensure that multiple copies of a cache line are kept coherent, i.e.

the copies do have the same value.

While most of the early parallel computers were NORMA systems, today's

systems are either RMA or ccNUMA computers. This is because remote memory

access is a light-weight communication protocol that is more e�cient than standard

message passing since data copying and process synchronization are eliminated. In

addition, ccNUMA systems o�er the abstraction of a shared linear address space

resembling physical shared memory systems. This abstraction simpli�es the task

of program development but does not necessarily facilitate program tuning.

10

**00

**0

0

**

Figure 1. Structure of the matrix during Gaussian elimination.

Typical examples of the three classes are clusters of workstations (NORMA),

CRAY T3E (RMA), and SGI Origin 2000 (ccNUMA).

1.2 Data Parallel Programming

Applications that scale to a large number of processors usually perform computa-

tions on large data domains. For example, crash simulations are based on partial

di�erential equations that are solved on a large �nite element grid and molecular

dynamic applications simulate the behavior of a large number of atoms. Other

parallel applications apply linear algebra operations to large vectors and matrices.

The elemental operations on each object in the data domain can be executed in

parallel by the available processors.

The scheduling of operations to processors is determined according to a selected

domain decomposition

8

. Processors execute those operations that determine new

values for local elements (owner-computes rule). While processors execute an op-

eration, they might need values from other processors. The domain decomposition

has thus to be chosen so that the distribution of operations is balanced and the

communication is minimized. The third goal is to optimize single node computa-

tion, i.e. to be able to exploit the processor's pipelines and the processor's caches

e�ciently.

A good example for the design decisions taken when selecting a domain decom-

position is Gaussian elimination

2

. The main structure of the matrix during the

iterations of the algorithm is outlined in Figure 1.

The goal of this algorithm is to eliminate all entries in the matrix below the

main diagonal. It starts at the top diagonal element and subtracts multiples of

the �rst row from the second and subsequent rows to end up with zeros in the �rst

column. This operation is repeated for all the rows. In later stages of the algorithm

the actual computations have to be done on rectangular sections of decreasing size.

If the main diagonal element of the current row is zero, a pivot operation has

to be performed. The subsequent row with the maximum value in this column is

selected and exchanged with the current row.

A possible distribution of the matrix is to decompose its columns into blocks, one

11

block for each processor. The elimination of the entries in the lower triangle can then

be performed in parallel where each processor computes new values for its columns

only. The main disadvantage of this distribution is that in later computations of

the algorithms only a subgroup of the processes is actually doing any useful work

since the computed rectangle is getting smaller.

To improve load balancing, a cyclic column distribution can be applied. The

computations in each step of the algorithm executed by the processors di�er only

in one column.

In addition to load balancing also communication needs to be minimized. Com-

munication occurs in this algorithm for broadcasting the current column to all the

processors since it is needed to compute the multiplication factor for the row. If the

domain decomposition is a row distribution, which eliminates the need to commu-

nicate the current column, only the main diagonal element needs to be broadcast

to the other processors.

If we consider also the pivot operation, communication is necessary to select

the best row when a rowwise distribution is applied since the computation of the

global maximum in that column requires a comparison of all values.

Selecting the best domain decomposition is further complicated due to optimiz-

ing single node performance. In this example, it is advantageous to apply BLAS3

operations for the local computations. Those operations make use of blocks of rows

to improve cache utilization. Blocks of rows can only be obtained if a block-cyclic

distribution is applied, i.e. columns are not distributed individually but blocks of

columns are cyclically distributed.

This discussion makes clear, that choosing a domain decomposition is a very

complicated step in program development. It requires deep knowledge of the algo-

rithm's data access patterns as well as the ability to predict the resulting commu-

nication.

2 Programming Models

The two main programming models, message passing and shared memory, o�er

di�erent features for implementing applications parallelized by domain decomposi-

tion.

The message passing model is based on a set of processes with private data

structures. Processes communicate by exchanging messages with special send and

receive operations. The domain decomposition is implemented by developing a

code describing the local computations and local data structures of a single pro-

cess. Thus, global arrays have to be split up and only the local part be allocated

in a process. This handling of global data structures is called data distribution.

Computations on the global arrays also have to be transformed, e.g. by adapting

the loop bounds, to ensure that only local array elements are computed. Access

to remote elements have to be implemented via explicit communication, temporary

variables have to be allocated, messages be setup and transmitted to the target

process.

The shared memory model is based on a set of threads that are created when

parallel operations are executed. This type of computation is also called fork-join

12

parallelism. Threads share a global address space and thus access array elements

via a global index.

The main parallel operations are parallel loops and parallel sections. Paral-

lel loops are executed by a set of threads also called team. The iterations are

distributed onto the threads according to a prede�ned strategy. This scheduling

strategy implements the chosen domain decomposition. Parallel sections are also

executed by a team of threads but the tasks assigned to the threads implement dif-

ferent operations. This feature can for example be applied if domain decomposition

itself does not generate enough parallelism and whole operations can be executed

in parallel since they access di�erent data structures.

In the shared memory model, the distribution of data structures onto the node

memories is not enforced by decomposing global arrays into local arrays, but the

global address space is distributed onto the memories on system level. For example,

the pages of the virtual address space can be distributed cyclically or can be assigned

on a �rst touch basis. The chosen domain decomposition thus has to take into

account the granularity of the distribution, i.e. the size of pages, as well as the

system-dependent allocation strategy.

While the domain decomposition has to be hardcoded into the message passing

program, it can easily be changed in a shared memory program by selecting a

di�erent scheduling strategy for parallel loops.

Another advantage of the shared memory model is that automatic and incre-

mental parallelization is supported. While automatic parallelization leads to a �rst

working parallel program, its e�ciency typically needs to be improved. The rea-

son for this is that parallelization techniques work on a loop-by-loop basis and do

not globally optimize the parallel code via a domain decomposition. In addition,

dependence analysis, the prerequisite for automatic parallelization, is limited to

statically known access patterns.

In the shared memory model, a �rst parallel version is relatively easy to im-

plement and can be incrementally tuned. In the message passing model instead,

the program can be tested only after �nishing the full implementation. Subsequent

tuning by adapting the domain decomposition is usually time consuming.

2.1 MPI

The Message Passing Interface (MPI)

13;10

was developed between 1993 and 1997.

It includes routines for point-to-point communication, collective communication,

one-sided communication, and parallel IO. While the basic communication primi-

tives are already de�ned since May 1994 and implemented on allmost all parallel

computers, remote memory access and parallel IO routines are part of MPI 2.0 and

are only available on few machines.

2.1.1 MPI basic routines

MPI consists of more than 120 functions. But realistic programs can already be

developed based on no more than six functions:

13

MPI Init initializes the library. It has to be called at the beginning of a parallel

operation before any other MPI routines are executed.

MPI Finalize frees any resources used by the library and has to be called at the

end of the program.

MPI Comm size determines the number of processors executing the parallel

program.

MPI Comm rank returns the unique process identi�er.

MPI Send transfers a message to a target process. This operation is a blocking

send operation, i.e. it terminates when the message bu�er can be reused either

because the message was copied to a system bu�er by the library or because

the message was delivered to the target process.

MPI Recv receives a message. This routines terminates if a message was copied

into the receive bu�er.

2.1.2 MPI communicator

All communication routines depend on the concept of a communicator. A commu-

nicator consists of a process group and a communication context. The processes

in the process group are numbered from zero to process count - 1. The process

number returned by MPI Comm rank is the identi�cation in the process group of

the communicator which is passed as a parameter to this routine.

The communication context of the communicator is important in identifying

messages. Each message has an integer number called a tag which has to match a

given selector in the corresponding receive operation. The selector depends on the

communicator and thus on the communication context. It selects only messages

with a �tting tag and having been sent relative to the same communicator. This

feature is very useful in building parallel libraries since messages sent inside the

library will not interfere with messages outside if a special communicator is used in

the library. The default communicator that includes all processes of the application

is MPI COMM WORLD.

2.1.3 MPI collective operation

Another important class of operations are collective operations. Collective oper-

ations are executed by a process group identi�ed via a communicator. All the

processes in the group have to perform the same operation. Typical examples for

such operations are:

MPI Reduce performs a global operation on the data of each process in the

process group. For example, the sum of all values of a distributed array can

be computed by �rst summing up all local values in each process and then

summing up the local sums to get a global sum. The latter step can be per-

formed by the reduction operation with the parameter MPI SUM. The result

is delivered to a single target processor.

14

MPI Comm split is an administration routine for communicators. It allows to

create multiple new communicators based on a given coloring scheme. All

processes of the original communicator have to take part in that operation.

MPI Barrier synchronizes all processes. None of the processes can proceed be-

yond the barrier until all the processes started execution of that routine.

2.1.4 MPI IO

Data parallel applications make use of the IO subsystem to read and write big data

sets. These data sets result from replicated or distributed arrays. The reasons

for IO are to read input data, to pass information to other programs, e.g. for

visualization, or to store the state of the computation to be able to restart the

computation in case of a system failure or if the computation has to be split into

multiple runs due to its resource requirements.

IO can be implemented in three ways:

1. Sequential IO

A single node is reponsible to perform the IO. It gathers information from the

other nodes and writes it to disc or reads information from disc and scatters

it to the appropriate nodes. While the IO is sequential and thus need not be

parallelized, the full performance of the IO subsystem might not be utilized.

Modern systems provide high performance IO subsystems that are fast enough

to support multiple IO requests from di�erent nodes in parallel.

2. Private IO

Each node accesses its own �les. The big advantage of this implementation is

that no synchronization among the nodes is required and very high performance

can be obtained. The major disadvantage is that the user has to handle a large

number of �les. For input the original data set has to be splitted according to

the distribution of the data structure and for output the process-speci�c �les

have to be merged into a global �le for postprocessing.

3. Parallel IO

In this implementation all the processes access the same �le. They read and

write only those parts of the �le with relevant data. The main advantages are

that no individual �les need to be handled and that reasonable performance

can be reached. The disadvantage is that it is di�cult to reach the same

performance as with private IO. The parallel IO interface of MPI provides

exible and high-level means to implement applications with parallel IO.

Files accessed via MPI IO routines have to be opened and closed by collective

operations. The open routine allows to specify hints to optimize the performance

such as whether the application might pro�t from combining small IO requests from

di�erent nodes, what size is recommended for the combined request, and how many

nodes should be engaged in merging the requests.

The central concept in accessing the �les is the view. A view is de�ned for each

process and speci�es a sequence of data elements to be ignored and data elements

15

to be read or written by the process. When reading or writing a distributed array

the local information can be described easily as such a repeating pattern. The IO

operations read and write a number of data elements on the basis of the de�ned

view, i.e. they access the local information only. Since the views are de�ned via

runtime routines prior to the access, the information can be exploited in the library

to optimize IO.

MPI IO provides blocking as well as nonblocking operations. In contrast to

blocking operations, the nonblocking ones only start IO and terminate immediately.

If the program depends on the successful completion of the IO it has to check it via

a test function. Besides the collective IO routines which allow to combine individual

requests, also noncollective routines are available to access shared �les.

2.1.5 MPI remote memory access

Remote memory access (RMA)operations allow to access the address space of other

processes without participation of the other process. The implementation of this

concept can either be in hardware, such as in the CRAY T3E, or in software via

additional threads waiting for requests. The advantages of these operations are that

the protocol overhead is much lower than for normal send and receive operations and

that no polling or global communication is required for setting up communication

such as in unstructured grid applications and multiparticle applications.

In contrast to explicit message passing where synchronization happens im-

plicitely, accesses via RMA operations need to be protected by explicit synchro-

nization operations.

RMA communication in MPI is based on the window concept. Each process has

to execute a collective routine that de�nes a window, i.e. the part of its address

space that can be accessed by other processes.

The actual access is performed via a put and get operation. The address is

de�ned by the target process number and the displacement relative to the starting

address of the window for that process.

MPI provides also special synchronization operations relative to a window. The

MPI Win fence operation synchronizes all processes that make some address ranges

accessible to other processes. It is a collective operation that ensures, that all RMA

operations started before the fence operation terminate before the target process

executes the fence operation and that all RMA operations of a process executed

after the fence operation are executed after the target process executed the fence

operation.

2.2 OpenMP

OpenMP

5;12

is a directive-based programming interface for the shared memory

programming model. It is the result of an e�ort to standardize the di�erent pro-

gramming interfaces on the target systems. OpenMP is a set of directives and

runtime routines for Fortran 77 (1997) and a corresponding set of pragmas for C

and C++ (1998). An extension of OpenMP for Fortran 95 is under investigation.

Directives are comments that are interpreted by the compiler. Directives do

have the advantage that the code is still a sequential code that can be executed on

16

sequential machines and thus no two versions, a sequential and a parallel version,

need to be maintained.

Directives start and terminate parallel regions. When the master thread hits

a parallel region a team of threads is created or activated. The threads execute

the code in parallel and are synchronized at the beginning and the end of the

computation. After the �nal synchronization the master thread continues execution

after the parallel region. The main directives are:

PARALLEL DO speci�es a loop that can be executed in parallel. The DO loop's

iterations can be distributed in various ways including STATIC(CHUNK), DY-

NAMIC(CHUNK), and GUIDED(CHUNK) onto the set of threads (as de�ned

in the OpenMP standard). STATIC(CHUNK) distribution means that the set

of iterations are consecutively distributed onto the threads in blocks of CHUNK

size (resulting in block and cyclic distributions). DYNAMIC(CHUNK) dis-

tribution implies that iterations are distributed in blocks of CHUNK size to

threads on a �rst-come-�rst-served basis. GUIDED (CHUNK) means that

blocks of exponentially decreasing size are assigned on a �rst-come-�rst-served

basis. The size of the smallest block is determined by CHUNK size.

PARALLEL SECTION starts a set of sections that are executed in parallel by

a team of threads.

PARALLEL REGION introduces a code region that is executed redundantly

by the threads. It has to be used very carefully since assigments to global vari-

ables will lead to conicts among the threads and possibly to nondeterministic

behavior.

PDO is a work sharing construct and may be used within a parallel region. All

the threads executing the parallel region have to cooperate in the execution

of the parallel loop. There is no implicit synchronization at the beginning of

the loop but a synchronization at the end. After the �nal synchronization all

threads continue after the loop in the replicated execution of the program code.

The main advantage of this approach is that the overhead for starting up the

threads is eliminated. The team of threads exists during the execution of the

parallel region and need not be built before each parallel loop.

PSECTION is also a work sharing construct that enforces that the current team

of threads executing the surrounding parallel region cooperates in the execution

of the parallel section.

Program data can either be shared or private. While threads do have an own

copy of private data, only one copy exists of shared data. This copy can be accessed

by all threads. To ensure program correctness, OpenMP provides special synchro-

nization constructs. The main constructs are barrier synchronization enforcing that

all threads have reached this synchronization operation before execution continues

and critical sections. Critical sections ensure that only a single thread can enter the

section and thus, data accesses in such a section are protected from race conditions.

A common situation for a critical section is the accumulation of values. Since an

17

accumulation consists of a read and a write operation unexpected results can occur

if both operations are not surrounded by a critical section.

3 Parallel Debugging

Debugging parallel programs is more di�cult than debugging sequential programs

not only since multiple processes or threads need to be taken into account but also

because program behaviour might not be deterministic and might not be repro-

ducible. These problems are not attacked by current state-of-the-art commercial

parallel debuggers. Only the �rst reason is eliviated by current debuggers. They

provide menus, diplays, and commands that allow to inspect individual processes

and execute commands on individual or all processes.

The widely used debugger is TotalView from Etnus Inc

14

. TotalView provides

breakpoint de�nition, single stepping, and variable inspection via an interactive

interface. The programmer can execute those operations for individual processes

and groups of processes. TotalView also provides some means to summarize in-

formation such that equal information from multiple processes is combined into a

single information and not repeated redundantly.

4 Performance Analysis

Performance analysis is an iterative subtask during process development. The goal

is to identify program regions that do not perform well. Performance analysis is

structured into four phases:

1. Measurement

Performance analysis is executed based on information on runtime events gath-

ered during program execution. The basic events are, for example, cache

misses, termination of a oating point operation, start and stop of a sub-

routine or message passing operation. The information on individual events

can be summarized during program execution or individual trace records can

be collected for each event.

Summary information has the advantage to be of moderate size while trace

information tends to be very large. The disadvantage is that it is not that �ne

grained, the behavior of individual instances of subroutines can for example

not be investigated since all the information has been summed up.

2. Analysis

During analysis the collected runtime data are inspected to detect performance

problems. Performance problems are based on performance properties, such

as the existence of message passing in a program region, which do have a

condition for identifying it and a severity function that speci�es its importance

for program performance.

Current tools support the user in checking the conditions and the severity by

visualizing program behavior. Future tools might be able to automatically

detect performance properties based on a speci�cation of possible properties.

18

During analysis the programmer applies a threshold. Only performance prop-

erties whose severity exceeds this threshold are considered to be performance

problems.

3. Ranking

During program analysis the severest performance problems need to be identi-

�ed. This means that the problems need to be ranked according to the severity.

The most severe problem is called the program bottleneck. This is the problem

the programmer tries to resolve by applying appropriate program transforma-

tions.

4. Re�nement

The performance problems detected in the previous phases might not be pre-

cise enough to allow the user to start optimization. At the beginning of perfor-

mance analysis, summary data can be used to identify critical regions. Those

summary data might not be su�cient to identify why, for example, a region

has high message passing overhead. The reason, e.g. very big messages or load

imbalance, might be identi�ed only with more detailed information. Therefore

the performance problem should be re�ned into hypotheses about the real rea-

son and additional information be collected in the next performance analysis

cycle.

Current techniques for performance data collection are pro�ling and tracing.

Pro�ling collects summary data only. This can be done via sampling, the program

is regularly interrupted, e.g. every 10 ms, and the information is added up for the

source code location which was executed in this moment. For example, the UNIX

pro�ling tool prof applies this technique to determine the fraction of the execution

time spent in individual subroutines.

A more precise pro�ling technique is based on instrumentation, i.e. special calls

to a monitoring library are inserted into the program. This can either be done in the

source code by the compiler or specialized tools, or can be done in the object code.

While the �rst approach allows to instrument more types of regions, for example,

loops and vector statements, the latter allows to measure data for programs where

no source code is available. The monitoring library collects the information and

adds it to special counters for the speci�c region.

Tracing is a technique that collects information for each event. This results,

for example, in very detailed information for each instance of a subroutine and for

each message sent to another process. The information is stored in specialized trace

records for each event type. For example, for each start of a send operation, the

time stamp, the message size and the target process can be recorded, while for the

end of the operation, the timestamp and bandwidth are stored.

The trace records are stored in the memory of each process and are output either

because the bu�er is �lled up or because the program terminated. The individual

trace �les of the processes are merged together into one trace �le ordered according

to the time stamps of the events.

The following sections describe performance analysis tools available on CRAY

T3E.

19

4.1 The CRAY T3E Performance Analysis Environment

The programming environment of the CRAY T3E supports performance analysis

via interactive tools. Cray itself provides two tools, Apprentice and PAT, which

are both based on summary information. In addition, Apprentice accesses source

code information to map the performance information to the statements of the

original source code. Besides these two tools, programmers can use VAMPIR, a

trace analysis tool developed at our institute. Within a collaboration with Cray,

instrumentation and trace generation for VAMPIR is being integrated into the next

version of PAT.

4.2 Compiler Information File

The F90 and C compilers on CRAY T3E generate for each source �le on request a

compiler information �le (CIF). This �le includes information about the compila-

tion process (applied compiler options, target machine characteristics, and compiler

messages) and source information of the compilation units (procedure information,

symbol information, information about loops and statement types, cross-reference

information for each symbol in the source �le).

Apprentice requires this information to link the performance information back

to the source code. CIFs are initially in ASCII format but can be converted to

binary format. The information can be easily accessed in both formats via a library

interface.

4.3 Apprentice

Apprentice is a post-execution performance analysis tool for message passing pro-

grams

3

. Originally it was designed to support the CRAFT programming model on

the CRAY T3E predecessor system, the CRAY T3D. Apprentice analyzes summary

information collected at runtime via an instrumentation of the source program.

The instrumentation is performed by the compiler and is triggered via an ap-

propriate compiler switch. To reduce the overhead of the instrumentation, the pro-

grammer can selectively compile the source �les with and without instrumentation.

The instrumentation is done in a late phase of the compilation after all optimiza-

tions already occurred. This prevents that instrumentation a�ects the way code is

compiled. During runtime, summary information is collected at each processor for

each basic block. This information comprises:

� execution time

� number of oating point, integer, and load/store operations

� instrumentation overhead

For each subroutine call the execution time as well as the pass count is deter-

mined. At the end of a program run, the information of all processors is summed up

and written to the runtime information �le (RIF). In addition to the summed up

execution times and pass counts of subroutines calls, their mean value and standard

deviation, as well as the minimum and maximum values are stored.

20

The size of the resulting RIF is typically less than one megabyte. But the

overhead due to the instrumentation can easily be a factor of two which results

from instrumenting each basic block. This severe drawback of the instrumentation is

partly compensated in Apprentice by correcting the timings based on the measured

overhead.

When the user starts Apprentice to analyze the collected information, the tool

�rst reads the RIF as well as the CIFs of the individual source �les. The perfor-

mance data measured for the optimized code are related back to the original source

code. Apprentice distinguishes between:

� parallel work: user-level subroutines

� I/O: system subroutines for performing I/O

� communication overhead: MPI and PVM routines, SHMEM routines

� uninstrumented code

The available barcharts allow the user to identify critical code regions that take

most of the execution time or with a lot of I/O and communication overhead. Since

all the values have been summed up, no speci�c behavior of the processors can be

identi�ed. Load balance problems can be detected by inspecting the execution times

of calls to synchronization subroutines, such as global sums or barriers. Based on

the available information, the processors with the least and the highest execution

time can be identi�ed.

While Apprentice does not evaluate the hardware performance counters of the

DEC Alpha, it estimates the loss due to cache misses and suboptimal use of the

functional units. Based on the number of instructions and a very simple cost model

(�xed cycles for each type of instruction) it determines the loss as the di�erence

between the estimated optimal and the measured execution time.

4.4 VAMPIR

VAMPIR (Visualization and Analysis of MPI Resources) is an event trace analysis

tool

11

which was developed by the Central Institute for Applied Mathematics of the

Research Centre Julich and now is commercially distributed by a German company

named PALLAS. Its main application area is the analysis of parallel programs based

on the message passing paradigm but it also has been successfully used for other

areas (e.g., for SVM-Fortran traces to analyze shared virtual memory page transfer

behavior

9

or to analyze CRAY T3E usage based on accounting data). VAMPIR

has three components:

� The VAMPIR tool itself is a graphical event trace browser implemented for

the X11 Window system using the Motif toolkit. It is available for any major

UNIX platform.

� The VAMPIR runtime library provides an API for collecting, bu�ering, and

generating event traces as well as a set of wrapper routines for the most com-

monly used MPI and PVM communication routines which record message traf-

�c in the event trace.

21

� In order to observe functions or subroutines in the user program, their entry

and exit has to be instrumented by inserting calls to the VAMPIR runtime

library. Observing message passing functions is handled by linking the program

with the VAMPIR wrapper function library.

VAMPIR comes with a source instrumenter for ANSI Fortran 77. Programs

written in other programming languages (e.g., C or C++) have to be instru-

mented manually. To improve this situation, our institute in collaboration

with CRAY Research is currently implementing an object code instrumenter

for CRAY T3E. This is described in the next section.

During the execution of the instrumented user program, the VAMPIR runtime

library records entry and exits to instrumented user and message passing functions

and the sending and receiving of messages. For each message, its tag, communica-

tor, and length is recorded. Through the use of a con�guration �le, it is possible

to switch the runtime observation of speci�c functions on and o�. This way, the

program doesn't have to be re-instrumented and re-compiled for every change in

the instrumentation.

Large parallel programs consist of several dozens or even hundreds of functions.

To ease the analysis of such complex programs, VAMPIR arranges the functions

into groups, e.g., user functions, MPI routines, I/O routines, and so on. The user

can control/change the assignment of functions to groups and can also de�ne new

groups.

VAMPIR provides a wide variety of graphical displays to analyze the recorded

event traces:

� The dynamic behavior of the program can be analyzed by timeline diagrams

for either the whole program or a selected set of nodes. By default, the displays

show the whole event trace, but the user can zoom-in to any arbitrary region of

the trace. Also, the user can change the display style of the lines representing

messages based on their tag/communicator or the length. This way, message

tra�c of di�erent modules or libraries can easily be visually separated.

� The parallelism display shows the number of nodes in each function group over

time. This allows to easily locate speci�c parts of the program, e.g., parts with

heavy message tra�c or I/O.

� VAMPIR also provides a large number of statistical displays. It calculates

how often each function or group of functions got called and the time spent in

there. Message statistics show the number of messages sent, and the minimum,

maximum, sum, and average length or transfer rate between any two nodes.

The statistics can be displayed as barcharts, histograms, or textual tables.

A very useful feature of VAMPIR is that the statistic displays can be linked to

the timeline diagrams. By this, statistics can be calculated for any arbitrary,

user selectable part of the program execution.

� If the instrumenter/runtime library provides the necessary information in the

event trace header, the information provided by VAMPIR can be related back

22

to source code. VAMPIR provides a source code and a call graph display to

show selected functions or the location of the send and the receive of a selected

message.

In summary, VAMPIR is a very powerful and highly con�gurable event trace

browser. It displays trace �les in a variety of graphical views, and provides exible

�lter and statistical operations that condense the displayed information to a man-

ageable amount. Rapid zooming and instantaneous redraw allow to identify and

focus on the time interval of interest.

4.5 PAT

PAT (Performance Analysis Tool) is the second performance tool available from

CRAY Research for CRAY T3E. The two main di�erences to Apprentice are that

no source code instrumentation or special compiler support is necessary. The user

only needs to re-link his/her application against the PAT runtime library (because

CRAY Unicos doesn't support dynamic linking). Second, PAT aims at keeping the

additional overhead to measure/observe program behavior as low as possible. PAT

is actually three performance tools in one:

1. PAT allows the user to get an rough overview about the performance of the par-

allel program through a method called sampling, i.e., interrupting the program

at regular intervals and evaluating the program counter. PAT can calculate

then the percentage of time spent in each function. The sampling rate can be

changed by the user to adapt it to the execution time of the program and to

keep overhead low. Because the sampling method provides only a statistical

estimate of the actual time spent in a function, the tool also provides a measure

of con�dence in the sampling estimate.

In addition PAT determines the total, user, and system time of the execution

run and the number of cache misses and the number of either ointing point,

integer, store, or load operations. These are measured through the DEC Al-

pha hardware counters. The user can select the hardware counter by setting

an environment variable. All this statistical information is stored after the

execution in a so-called Performance Information File (PIF).

2. If a more detailed analysis is necessary, PAT can be used to instrument and

analyze a speci�c function or set of functions in a second phase. PAT can

instrument object code (however only on the function level). This is a big

advantage especially for large complex programs because they do not have

to be re-compiled for instrumentation. In addition, it is possible to analyze

functions contained in system or 3rd-party libraries. A third advantage is

that programs written in more than one language can be handled. The big

disadvantage is that it is more di�cult to relate the results back to the source

code.

This detailed investigation of function behavior is called Call Site Report by

PAT. It records for each call site of the instrumented functions how often it got

called and time spent in this instantiation of the function. Execution times are

23

measured with a high-resolution timer. The results are available for each CPU

used in the parallel program. The next version of PAT will allow to gather

hardware counter statistics for instrumented functions as well.

3. Last, if a very detailed analysis of the program behavior is necessary, PAT

also supports event tracing. The object instrumenter of PAT can also be used

to insert calls to entry and exit trace routines around calls to user or library

routines. Entry and exit trace routines can be provided in two ways:

� The user can supply function-speci�c wrapper functions. The routines

must be written in C, they must have the same number and same types

of arguments as the routine they are tracing, and �nally, the wrapper

function name for a function func must be func trace entry for entry

trace routines and func trace exit for exit trace routines.

� If speci�c wrapper routines for the requested function are not available,

PAT uses generic wrapper code which just records the entry and exit of

the function in the event trace.

In addition, PAT provides extra tracing runtime system calls, which can be

inserted in the source code and allow to switch tracing on and o�, and to insert

additional information into the trace (e.g., information unrelated to functions).

The tracing features of PAT were developed in a collaboration of Research

Centre Julich with CRAY Research. Our institute implemented all the nec-

essary special wrapper functions for all message passing functions available

on the T3E (MPI, PVM, and SHMEM and for both the C and Fortran in-

terfaces) which record the message tra�c in the event trace. In addition, we

implemented a tool for converting the event traces contained in PIF �les into

VAMPIR trace format.

The major drawback of PAT's object instrumentation is the very low-level in-

terface for specifying the functions to be instrumented. The user has to specify the

function names as they appear in the object code, i.e., C++ functions or F90 func-

tions which are local or contained in modules have to be speci�ed in the mangled

form (e.g., " 0FDfooPd" instead of the C++ function name "int foo(double *)").

Clearly, a more user friendly or automatic way for the instrumenter interface needs

to be added to PAT.

In addition, the combination of three di�erent instrumentation/analysis tech-

niques into a single tool is very confusing for users. This confusion is further

increased since the supported techniques overlap with the techniques applied in the

other tools.

4.6 Summary

The previous sections pointed out that the CRAY T3E has a programming envi-

ronment that includes the most advanced performance analysis tools. On the other

hand, each of these tools comes with its own instrumentation, provides partially

overlapping information, and has a totally di�erent user interface. The program-

mer has to understand the advantages and disadvantages of all the tools to be able

24

Apprentice VAMPIR PAT

data

collection

instrumentation

via compiler

source instru-

mentation by

preprocessor

sampling object code inst.

intrusion high / cor-

rected

high low high

selective

instr.

not required,

optional

required { required

selection

interface

compiler switch GUI or ASCII

�le

{ ASCII interface or �le

level of de-

tail

summary infor-

mation

event trace statistics summary

informa-

tion

event

trace

information total time and

oper. counts

for basic blocks

and call sites

subroutine

start/stop

and send/recv

events

statistical

distr. of

time

(subrou-

tines)

total

time

and pass

counts

for call

sites

subroutine

start/stop

and

send/recv

events

strength source-level,

analysis of

loops

many displays

for message

passing his-

tory and for

statistics of

arbitrary

execution

phases

low over-

head

pro�ling

total

time

for call

sites, no

recompi-

lation

object

code

instr. for

VAM-

PIR, no

recompi-

lation

Table 1. User interface and properties of performance analysis tools on CRAY T3E

to select and apply the right ones. Table 1 summarizes the main features of these

three tools.

5 Summary

This article gave an overview of parallel programming models as well as program-

ming tools. Parallel programming will always be a challenge for programmers.

Higher-level programming models and appropriate programming tools only facili-

tate the process but do not make it a simple task.

While programming in MPI o�ers the greatest potential performance, shared

memory programming with OpenMP is much more comfortable due to the global

style of the resulting program. The sequential control ow among the parallel loops

and regions matches much better with the sequential programming model all the

programmers are trained for.

Although program tools were developed over years, the current situation seems

not to be very satis�able. Program debugging is done on a per thread basis, a

technique that does not scale to larger numbers of processors. Performance anal-

25

ysis tools do also su�er scalability limitations and, in addition, those tools are

complicated to use. The programmers have to be experts for performance analy-

sis to understand potential performance problems, their proof conditions and their

serverity. In addition they have to be experts for powerful but also complex user

interfaces.

Future research in that area has to try to automate performance analysis tools,

such that frequently occuring performance problems can be identi�ed automatically.

It is the goal of the ESPRIT IV working group APART on Automatic Performance

Analysis: Resources and Tools to investigate base technologies for future more in-

telligent tools

1

. A �rst result of the work ia a collection of performance problems

for parallel programs that have been formalized with the ASL, the APART Spec-

i�cation Language

6

. This approach will lead to a formal representation of the

knowledge applied in the manually executed performance analysis process and thus

will make this knowledge accessible for automatic processing.

A second important trend that will e�ect parallel programming in the future

is the move towards clustered shared memory systems. Within the GoSMP study

executed for the Federal Ministry for Education and Research its inuence on pro-

gram development is investigated. Clearly, a hybrid programming approach will

be applied on those systems for best performance, combining message passing be-

tween the individual SMP nodes and shared memory programming in a node. This

programming model will lead to even more complex programs and program devel-

opment tools have to be enhanced to be able to help the user in developing those

codes.

References

1. APART: ESPRIT Working Group on Automatic Performance Analysis Re-

sources and Tools, www.fz-juelich.de/apart, 1999.

2. D.P. Bertsekas, J.N. Tsitsiklis: Parallel and Distributed Computation: Numer-

ical Methods, Prentice-Hall, ISBN 0-13-648759-9, 1989.

3. Cray Research: Cray MPP Fortran: Reference Manual, Cray SR-2504, 1994.

4. D.E. Culler, J.P. Singh, A. Gupta: Parallel Computer Architecture - A Hard-

ware/Software Approach, Morgan Kaufmann Publishers, ISBN 1-55860-343-3,

1999.

5. L. Dagum, R. Menon: OpenMP: An Industry-Standard API for Shared-

memory Programming, IEEE Computational Science & Engineering, 1998.

6. Th. Fahringer, M. Gerndt, G. Riley, J. Tra�: Knowledge Speci�cation for

Automatic Performance Analysis, APART Technical Report, Research Centre

Juelich FZJ-ZAM-9918, 1999.

7. I. Foster: Designing and Building Parallel Programs, Addison Wesley, ISBN

0-201-57594-9, 1994.

8. G.Fox: Domain Decomposition in Distributed and Shared Memory Environ-

ments, International Conference on Supercomputing June 8-12, 1987, Athens,

Greece Lecture Note in Computer Science 297, edited by Constantine Poly-

chronopoulos, 1987.

9. M. Gerndt, A. Krumme, S.

Ozmen: Performance Analysis for SVM-Fortran

26

with OPAL, Proceedings Int. Conf. on Parallel and Distributed Processing

Techniques and Applications (PDPTA'95), Athens, Georgia, pp. 561-570,

1995.

10. MPI Forum: Message Passing Interface, www.mcs.anl.gov/mpi, 1999.

11. W.E. Nagel, A. Arnold, M. Weber, H-C. Hoppe, K. Solchenbach: VAMPIR:

Visualization and Analysis of MPI Resources, Supercomputer 63, Vol. 12, No.

1, pp. 69-80, 1996.

12. OpenMP Forum: OpenMP Standard, www.openmp.org, 1999.

13. M. Snir, St. Otto, St. Huss-Lederman, D. Walker, J. Dongarra: MPI - The

Complete Reference, MIT Press, ISBN 0-262-69216-3, 1998.

14. Etnus Inc.: Totalview, www.etnus.com/products/totalview/index.html, 1999.

27

BASIC NUMERICAL LIBRARIES FOR PARALLEL SYSTEMS

I. GUTHEIL


Central Institute for Applied Mathematics

Research Centre Julich, 52425 Julich, Germany


Three public domain libraries with basic numerical operations for distributed mem-

ory parallel systems are presented: ScaLAPACK, PLAPACK, and Global Arrays.

They are compared not only with respect to performance on CRAY T3E but also

to user-friendliness.

1 Introduction

There are many projects for parallelization of numerical software. An overview of

public domain libraries for high performance computing can be found on the HPC-

Netlib homepage

1

. Often these libraries are very specialized either concerning the

problem which is treated or the platform on which they run. Three of the more

general packages based on message-passing with MPI will be presented here in some

detail: ScaLAPACK

2

, PLAPACK

3

, and Global Arrays

4

.

Often there is an MPI-implementation on shared-memory multiprocessor sys-

tems, hence libraries based on MPI can also be used there and often very e�ciently

as message-passing programs take great care of data locality.

1.1 ScaLAPACK, Scalable Linear Algebra PACKage

The largest and most exible public domain library with basic numerical operations

for distributed memory parallel systems up to now is ScaLAPACK. Within the

ScaLAPACK project many LAPACK

5

routines were ported to distributed memory

computers using message passing.

The communication in ScaLAPACK is based on the BLACS (Basic Linear Alge-

bra Communication Subroutines)

6

. There are public domain versions of the BLACS

based on MPI and PVM available. For CRAY T3E there is also a version of the

BLACS in Cray scienti�c libraries (libsci)

7

using Cray shared memory routines

(shmem) which is often faster than the public domain versions.

The basic routines of ScaLAPACK are the PBLAS (Parallel Basic Linear Al-

gebra Subroutines). They contain parallel versions of the BLAS

a

, which are paral-

lelized using BLACS for communication and sequential BLAS for computation. As

most vendors o�er optimized sequential BLAS the BLAS and PBLAS deliver very

good performance on most parallel computers.

Based on BLACS and PBLAS ScaLAPACK contains parallel solvers for dense

linear systems and linear systems with banded system matrix as well as parallel

routines for the solution of linear least squares problems and for singular value

a

BLAS 1 contains vector-vector operations, e.g. dotproduct, BLAS 2 matrix-vector operations,

e.g. matrix-vector multiplication, BLAS 3 matrix-matrix operations, e.g. matrix-matrix multipli-

cation

29

decomposition. Routines for the computation of all or some of the eigenvalues

and eigenvectors of dense real symmetric matrices and dense complex hermitian

matrices and for the generalized symmetric de�nite eigenproblem are also included

in ScaLAPACK.

ScaLAPACK also contains additional libraries to treat distributed matrices and

vectors. One of them is the TOOLS library, which o�ers useful routines for example

to �nd out which part of the global matrix a local process has in its memory or to

�nd out the global index of a matrix element corresponding to its local index and

vice versa. Unfortunately these routines are documented only in the source code of

the routines and not in the Users' Guide. Another library is the REDIST library

which is documented in the ScaLAPACK Users' Guide. It contains routines to copy

any block-cyclicly distributed (sub)matrix to any other block-cyclicly distributed

(sub)matrix.

ScaLAPACK is a Fortran 77 library which uses C subroutines internally to al-

locate additional workspace. Especially the PBLAS allocate additional workspace.

1.2 PLAPACK, Parallel Linear Algebra PACKage

PLAPACK does not o�er as many black-box solvers as ScaLAPACK but is designed

as a parallel infrastructure to develop routines for solving linear algebra problems.

With PLAPACK routines the user can create global matrices, vectors, and mul-

tiscalars, and he may �ll them with values with the help of an API (Application

Programming Interface). To make the development of programs easier and to get

good performance PLAPACK includes parallel versions of most real BLAS routines

and solvers for real dense linear systems using LU-decomposition and for real sym-

metric positive de�nite systems applying Cholesky-decomposition which operate on

the global data.

PLAPACK is a C library with a Fortran interface in Release 1.2. Unfortunately

up to now Release 1.2 does not run on CRAY T3E.

1.3 Global Arrays

Like PLAPACK Global Arrays supplies the user with global linear algebra objects

and an interface to �ll them with data. For the solution of linear equations there

is an interface to special ScaLAPACK routines which can be modi�ed to use other

ScaLAPACK routines, too. For the solution of the full symmetric eigenvalue prob-

lem Global Arrays contains an interface to PeIGS

8

.

Global Arrays is a Fortran 77 library which uses a memory allocator library for

dynamic memory management.

2 Data Distributions

There are many ways to distribute data, especially matrices, to processors. In the

ScaLAPACK Users' Guide many of them are presented and discussed.

All of the three libraries described here distribute matrices to a two-dimensional

processor grid, but they do it in di�erent ways and the user can more or less inuence

the way data are distributed. Users who only want to use routines from the libraries

30

don't have to care for the way data are distributed in PLAPACK and Global Arrays.

The distribution is done automatically. To use ScaLAPACK, however, the user has

to create and �ll the local parts of the global matrix on his own.

2.1 ScaLAPACK: Two-dimensional Block-Cyclic Distribution

For performance and load balancing reasons ScaLAPACK has chosen a two-

dimensional block-cyclic distribution for full matrices (see ScaLAPACK Users'

Guide). First the matrix is distributed to blocks of size MB �NB. These blocks

are then uniformly distributed across the NP �NQ processor grid in a cyclic man-

ner. As a result, every process owns a collection of blocks, which are

publication series of the john von neumann institute for computing (nic) nic series ... ·...

Documents