publication series of the john von neumann institute for computing (nic) nic series ... ·...

572
Publication Series of the John von Neumann Institute for Computing (NIC) NIC Series Volume 1

Upload: others

Post on 25-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Publication Series of the John von Neumann Institute for Computing (NIC)

    NIC Series Volume 1

  • John von Neumann Institute for Computing (NIC)

    Johannes Grotendorst (Editor)

    Modern Methods and Algorithmsof Quantum Chemistry

    Winterschool, 21 - 25 February 2000

    Forschungszentrum Jülich, Germany

    Proceedings

    organized by

    John von Neumann Institute for Computing

    in cooperation with

    Arbeitsgemeinschaft für Theoretische Chemie

    NIC Series Volume 1

    ISBN 3-00-005618-1

  • Die Deutsche Bibliothek – CIP-Cataloguing-in-Publication-DataA catalogue record for this publication is available from Die DeutscheBibliothek

    Modern Methods and Algorithms of Quantum ChemistryProceedingsWinterschool, 21 - 25 February 2000Forschungszentrum Jülich, Germanyedited by Johannes Grotendorst

    Publisher: NIC-Directors

    Distributor: NIC-SecretariatResearch Centre Jülich52425 JülichGermany

    Internet: www.fz-juelich.de/nic

    Printer: Graphische Betriebe, Forschungszentrum Jülich

    c

    2000 by John von Neumann Institute for ComputingPermission to make digital or hard copies of portions of this workfor personal or classroom use is granted provided that the copiesare not made or distributed for profit or commercial advantage andthat copies bear this notice and the full citation on the first page. Tocopy otherwise requires prior specific permission by the publishermentioned above.

    NIC Series Volume 1ISBN 3-00-005618-1

  • PREFACE

    Computational quantum chemistry has long promised to become a major tool

    for the study of molecular properties and reaction mechanisms. The fundamen-

    tal methods of quantum chemistry date back to the earliest days of quantum

    mechanics in the �rst decades of the twentieth century. However, widespread

    quantitative applications have only become common practice in recent times,

    primarily because of the explosive developments in computer hardware and the

    associated achievements in the design of new and improved theoretical methods

    and computational techniques. The signi�cance of these advances in computational

    quantum chemistry is unterlined by the 1998 chemistry Nobel prize to Walter Kohn

    and John Pople; this award also documents the increasing acceptance of computer

    simulations and scienti�c computing as an important research method in chemistry.

    Nearly one third of the projects which use the supercomputing facilities

    provided by the John von Neumann Institute for Computing (NIC) pertain to the

    area of computational chemistry. For projects in quantum chemistry the Central

    Institute for Applied Mathematics (ZAM) which runs the Cray supercomputers

    and networks at the Research Centre Julich o�ers several extensive software

    packages running on its supercomputer complex. The computational requirements

    of large quantum-chemical calculations are enormous. They have made the use

    of parallel computers indispensable and have led to the development of a broad

    range of advanced algorithms for these machines. The goal of this Winterschool

    is to bring together experts from the �elds of quantum chemistry, computer

    science and applied mathematics in order to present recent methodological and

    computational advances to research students in the �eld of theoretical chemistry

    and their applications. The participants will also learn about recent software

    developments and about implementation issues that are encountered in quantum

    chemistry codes, particularly in the context of high-performance computing (topics

    not yet included in typical university courses). The emphasis of the Winterschool

    is on method development and algorithms, but state-of-the-art applications will

    also be demonstrated for illustration. The following topics are covered by twenty

    lectures:

    � Density functional theory

    � Ab initio molecular dynamics

    � Post-Hartree-Fock methods

    � Molecular properties

    � Heavy-element chemistry

    � Linear scaling approaches

    � Semiempirical and hybrid methods

    � Parallel programming models and tools

    � Numerical techniques and automatic di�erentiation

    � Industrial applications

  • The programme was compiled by Johannes Grotendorst (Forschungszentrum

    Julich), Marius Lewerenz (Universit�e Pierre et Marie Curie, Paris), Walter Thiel

    (MPI fur Kohlenforschung, Mulheim an der Ruhr) and Hans-Joachim Werner

    (Universitat Stuttgart).

    Fostering education and training in important �elds of scienti�c computing

    by symposia, workshops, schools and courses is a major objective of NIC. This

    Winterschool continues a series of workshops and conferences in the �eld of compu-

    tational chemistry organized by the ZAM in the last years; it provides a forum for

    the scientifc exchange between young research students and experts from di�erent

    academic disciplines. The NIC Winterschool will host more than two hundred par-

    ticipants from sixteen countries, and more than �fty abstracts have been submitted

    for the poster session. This overwhelming international resonance clearly reects

    the attractiveness of the programme. The excellent support of the Arbeitsgemein-

    schaft fur Theoretische Chemie in preparing the Winterschool is highly appreciated.

    As in previous conferences, many people have made signi�cant contributions to

    the success of this Winterschool. The local organization at Research Centre Julich

    was perfectly done by Elke Bielitza, Rudiger Esser, Bernd Krahl-Urban, Monika

    Marx, Renate Mengels, and Margarete Reiser. We are grateful for the generous

    �nancial support by the Federal Ministry for Education and Research (BMBF) and

    by the Research Centre Julich and for the help provided by its Conference Service.

    We also thank the authors for their willingness to provide a written version of their

    lecture notes. Special thanks go to Monika Marx for her commitment concerning

    the composition and realization of these proceedings and the book of abstracts of

    the poster session. Finally, we are indebted to Beate Herrmann who supported the

    typesetting of these books with professionalism and great care.

    Julich Johannes Grotendorst

    February 2000

  • CONTENTS

    Industrial Challenges for Quantum Chemistry

    Ansgar Schafer 1

    1 Introduction 1

    2 Application Fields of Quantum Chemistry in Industry 2

    3 Unsolved Problems 3

    4 Conclusion 5

    Methods to Calculate the Properties of Large Molecules

    Reinhart Ahlrichs 7

    Parallel Programming Models, Tools and Performance Analysis

    Michael Gerndt 9

    1 Introduction 9

    2 Programming Models 12

    3 Parallel Debugging 18

    4 Performance Analysis 18

    5 Summary 25

    Basic Numerical Libraries for Parallel Systems

    Inge Gutheil 29

    1 Introduction 29

    2 Data Distributions 30

    3 User-Interfaces 33

    4 Performance 36

    5 Conclusions 44

    Tools for Parallel Quantum Chemistry Software

    Thomas Steinke 49

    1 Introduction 49

    2 Basic Tasks in Typical Quantum Chemical Calculations 50

    3 Parallel Tools in Today's Production Codes 52

    4 The TCGMSG Library 53

    5 The Global Array Toolkit 54

    6 The Distributed Data Interface Used in GAMESS (US) 60

    7 Further Reading 63

    Ab Initio Methods for Electron Correlation in Molecules

    Peter Knowles, Martin Schutz, and Hans-Joachim Werner 69

    1 Introduction 69

    2 Closed-Shell Single-Reference Methods 83

    i

  • 3 Open-Shell Single-Reference Methods 99

    4 Linear Scaling Local Correlation Methods 102

    5 Multireference Electron Correlation Methods 112

    6 Integral-Direct Methods 133

    R12 Methods, Gaussian Geminals

    Wim Klopper 153

    1 Introduction 153

    2 Errors in Electronic-Structure Calculations 154

    3 The Basis-Set Error 155

    4 Coulomb Hole 163

    5 Many-Electron Systems 166

    6 Second Quantization 167

    7 Explicitly Correlated Coupled-Cluster Doubles Model 168

    8 Weak Orthogonality Techniques 171

    9 R12 Methods 173

    10 Explicitly Correlated Gaussians 178

    11 Similarity Transformed Hamiltonians 179

    12 MP2-Limit Corrections 180

    13 Computational Aspects of R12 Methods 183

    14 Numerical Examples 188

    15 Concluding Remark 191

    Direct Solvers for Symmetric Eigenvalue Problems

    Bruno Lang 203

    1 Setting the Stage 203

    2 Eigenvalue Computations: How Good, How Fast, and How? 205

    3 Tools of the Trade: Basic Orthogonal Transformations 210

    4 Phase I: Reduction to Tridiagonal Form 213

    5 Phase II: Methods for Tridiagonal Matrices 219

    6 Methods without Initial Tridiagonalization 226

    7 Available Software 228

    Semiempirical Methods

    Walter Thiel 233

    1 Introduction 233

    2 Established Methods 234

    3 Beyond the MNDO Model: Orthogonalization Corrections 238

    4 NMR Chemical Shifts 241

    5 Analytic Derivatives 246

    6 Parallelization 248

    7 Linear Scaling and Combined QM/MM Approaches 250

    8 Conclusions 252

    ii

  • Hybrid Quantum Mechanics/Molecular Mechanics Approaches

    Paul Sherwood 257

    1 Introduction 257

    2 Terminology 257

    3 Overview of QM/MM Schemes 258

    4 The Issue of Conformational Complexity 268

    5 Software Implementation 269

    6 Summary and Outlook 271

    Subspace Methods for Sparse Eigenvalue Problems

    Bernhard Ste�en 279

    1 Introduction 279

    2 Eigenvalue Extraction 280

    3 Update Procedures 282

    4 Problems of Implementation and Parallelization 284

    5 Conclusions 285

    Computing Derivatives of Computer Programs

    Christian Bischof and Martin Bucker 287

    1 Introduction 287

    2 Basic Modes of Automatic Di�erentiation 290

    3 Design of Automatic Di�erentiation Tools 291

    4 Using Automatic Di�erentiation Tools 293

    5 Concluding Remarks 297

    Ab Initio Molecular Dynamics: Theory and Implementation

    Dominik Marx and Jurg Hutter 301

    1 Setting the Stage: Why Ab Initio Molecular Dynamics ? 301

    2 Basic Techniques: Theory 305

    3 Basic Techniques: Implementation within the CPMD Code 343

    4 Advanced Techniques: Beyond : : : 392

    5 Applications: From Materials Science to Biochemistry 418

    Relativistic Electronic-Structure Calculations for Atoms and Molecules

    Markus Reiher and Bernd He� 451

    1 Qualitative Description of Relativistic E�ects 451

    2 Fundamentals of Relativistic Quantum Chemistry 452

    3 Numerical 4-Component Calculations for Atoms 453

    4 A Short History of Relativistic Atomic Structure Calculations 453

    5 Molecular Calculations 463

    6 Epilogue 472

    iii

  • E�ective Core Potentials

    Michael Dolg 479

    1 Introduction 479

    2 All-Electron Hamiltonian 484

    3 Valence-Only Hamiltonian 487

    4 Analytical Form of Pseudopotentials 493

    5 Adjustment of Pseudopotentials 495

    6 Core Polarization Potentials 497

    7 Calibration Studies 498

    8 A Few Hints for Practical Calculations 501

    Molecular Properties

    Jurgen Gauss 509

    1 Introduction 509

    2 Molecular Properties as Analytical Derivatives 510

    3 Magnetic Properties 527

    4 Frequency-Dependent Properties 545

    5 Summary 553

    Tensor Concepts in Electronic Structure Theory:

    Application to Self-Consistent Field Methods and

    Electron Correlation Techniques

    Martin Head-Gordon 561

    iv

  • INDUSTRIAL CHALLENGES FOR QUANTUM CHEMISTRY

    ANSGAR SCH

    AFER

    BASF Aktiengesellschaft

    Scienti�c Computing

    ZDP/C - C13

    67056 Ludwigshafen

    Germany

    E-mail: [email protected]

    The current �elds of application of quantum chemical methods in the chemical

    industry are described. Although there are a lot of questions that already can be

    tackled with modern algorithms and computers, there are still important problems

    left which will need further improved methods. A few examples are given in this

    article.

    1 Introduction

    Already in the 1970's and 80's, quantum chemical methods were very successful

    in describing the structure and properties of organic and main-group inorganic

    molecules. The Hartree-Fock (HF) method and its simpli�ed semi-empirical mod-

    i�cations became standard tools for a vivid rationalization of chemical processes.

    The underlying molecular orbital (MO) picture was, and still is, the most important

    theoretical concept for the interpretation of reactivity and molecular properties.

    Nevertheless, quantum chemical methods were not used extensively for industrial

    problems, although most of the industrial chemistry produces organic compounds.

    One reason can be found in the fact that almost all industrial processes are cat-

    alytic. The catalysts are predominantly transition metal compounds, which in

    general have a more complicated electronic structure than main-group compounds,

    since their variability in the occupation of the d orbitals results in a subtle balance

    of several close lying energy levels. HF and post-HF methods based on a single elec-

    tron con�guration are not able to describe this situation correctly. Furthermore,

    the catalyst systems were generally too big to be handled. The usual approach was

    to choose small model systems, e.g. with PH3 substituting any phosphine ligand

    or a cluster representing a solid surface. Such investigations provided only a basic

    understanding of the catalytic reaction, but no detailed knowledge on steric and

    electronic dependencies.

    With the improvement of both the methodology and the algorithms of density

    functional theory (DFT) in the last two decades, the situation changed signi�-

    cantly. DFT appears to be less sensitive to near degeneracy of electronic states,

    and furthermore incorporates some e�ects of electron correlation. The development

    of new functionals with improved description of non-uniform electron distributions

    in molecules or on surfaces, paved the way for a qualitative or even quantitative

    quantum chemical treatment of a large variety of transition metal compounds and

    their reactions. When functionals without partial inclusion of HF exchange con-

    tributions are used, an approximate treatment of the Coulomb interaction of the

    electrons (density �tting, resolution of identity (RI) approach) allows for a very ef-

    1

  • �cient treatment of large systems. Therefore, with e�ciently parallelized programs

    of this kind, it is routinely possible today to calculate the structure of molecules

    with 100-200 atoms.

    Although the chemical processes in many cases involve transition metal com-

    pounds, calculations on pure organic molecules are still important to predict prop-

    erties like thermodynamic data or various types of spectra. However, for a quan-

    titative agreement between calculated and experimental results, DFT very often

    is not reliable enough, and approaches going beyond the HF approximation are

    necesary to assess the e�ects caused by electron correlation. These methods are

    computationally very demanding, and highly accurate calculations are still limited

    to small molecules with not more than about 10 atoms. Therefore, still only few

    problems of industrial relevance can be tackled by these methods at the moment.

    2 Application �elds of quantum chemistry in industry

    2.1 Catalysis

    As already mentioned, many of the industrial chemical processes involve catalysts.

    Most of the catalysts are in the solid state (heterogeneous catalysis), but, with the

    extensive developments in organometallic chemistry in the last decades, catalytic

    processes in the liquid phase become more and more important (homogeneous catal-

    ysis). For the development of a catalyst, besides economic considerations, three

    issues are of central importance:

    � Activity: The catalyst must be e�cient (high turnover numbers).

    � Selectivity: By-products should be avoided.

    � Stability: Deactivation or decomposition of the catalyst must be slow.

    To improve a catalyst with respect to these criteria, a detailed understanding of the

    reaction pathways is necessary. This is one point, where quantum chemical methods

    can be of enormous value, since the experimental investigation of steps in a com-

    plicated mechanism is rather di�cult. Once the crucial parameters are found, new

    guesses for better performing catalysts can be deduced and immediately be tested in

    the calculations. Thus, when theory is used for a rough screening and only the most

    promising candidates have to be tested in experiment, the development process for

    new catalysts can be shortened signi�cantly. DFT is used almost exclusively for

    both homogeneous and heterogeneous applications. In the latter case, solids are

    treated with preriodic boundary conditions or QM/MM approaches.

    2.2 Process design

    The design of chemical processes and plants requires the knowledge of accurate

    thermodynamic and kinetic data for all substances and reactions involved. The

    experimental determination of such data is rather time-consuming and expensive.

    The substances have to be prepared in high purity and precise calorimetric or ki-

    netic measurements have to be done for well-de�ned reactions. This e�ort must

    2

  • be invested before the actual technical realization of a new process, because for

    economic and safety reasons, the data has to be as accurate as possible. However,

    for an early assessment of the practicability and pro�tability of a process, a fast but

    nevertheless reliable estimate for the thermodynamics usually is su�cient. A num-

    ber of empirical methods based on group contributions are used for this purpose,

    but they are not generally applicable and often not reliable enough. For exam-

    ple, these methods often can not discriminate between isomers containing the same

    number and type of groups. Alternatively, reaction enthalpies and entropies can

    also be calculated by quantum chemical methods in combination with statistical

    mechanics. For the rotational and vibrational energy levels, a correct description

    of the molecular structure and the shape of the energy surface is needed, which can

    very e�ciently be obtained with DFT. The crucial point for the overall accuracy,

    however, are the di�erences in electronic energies, and high level ab initio meth-

    ods like coupled cluster theories are often needed to get the error down to a few

    kcal/mol. That such a high accuracy is needed can be illustrated by the fact, that a

    change in the Gibbs free energy of only 1.4 kcal/mol already changes an equilibrium

    constant by an order of magnitude at room temperature.

    2.3 Material properties

    When a desired property of a material can be connected to quantities on the atomic

    or molecular scale, quantum chemistry can be a useful tool in the process of improv-

    ing such materials. Typical examples are dyes and pigments, for which color and

    brilliance depend on the energies and nature of electronic excitations. Organic dyes

    typically have delocalized pi systems and functional groups chosen appropriately

    to tune the optical properties. Since the molecules normally are too big for an ab

    initio treatment with the desired accuracy, semi-empirical methods are currently

    used to calculate the excited states.

    3 Unsolved problems

    3.1 Treatment of the molecular environment

    Quantum chemical methods are predominantly applied to isolated molecules, which

    corresponds to the state of an ideal gas. Most chemical processes, however, take

    place in condensed phase, and the interaction of a molecule with its environment

    can generally not be neglected. Two important examples are given here.

    3.1.1 Solvent e�ects

    Solvent molecules can directly interact with the reacting species, e.g. by coordi-

    nation to a metal center or by formation of hydrogen bonds. In such cases it is

    necessary to explicitely include solvent molecules in the calculation. Depending on

    the size of the solvent molecules and their number needed to get the calculated

    properties converged, the overall size of the molecular system and the resulting

    computational e�ort can signi�cantly be increased. Currently, only semi-empirical

    methods are able to handle several hundred atoms, but the developments towards

    3

  • linear scaling approaches in DFT are very promising. An alternative would be

    a mixed quantum mechanical (QM) and molecular mechanical (MM) treatment

    (QM/MM method).

    If there are no speci�c solute-solvent interactions, the main e�ect of the solvent is

    electrostatic screening, depending on its dielectric constant. This can be described

    very e�ciently by continuum solvation models (CSM).

    3.1.2 Enzymes

    Biomolecules like enzymes usually consist of many thousands of atoms and therefore

    can not be handled by quantum chemical methods. Although there exist several

    very elaborate force �eld methods which quite reliably reproduce protein structures,

    the accurate description of the interaction of the active site with a ligand bound

    to it still is an unsolved problem. Quantum mechanics is needed for a quantitative

    assessment of polarization e�ects and for the description of reactions. Treating

    only the active site quantum mechanically usually does not give the correct results

    because of the strong electrostatic inuence of the whole enzyme and the water

    molecules included. QM/MM approaches can be used for the treatment of the

    whole system, but the QM/MM coupling still needs improvement.

    A very important quantity of an enzyme/ligand complex is its binding free

    energy. It determines the measured equilibrium constant for a ligand exchange

    and is therefore important for the development of enzyme inhibitors. A reliable

    calculation of such binding constants is currently not possible. It certainly must

    involve quantum chemistry and molecular dynamics.

    3.2 Accurate thermochemistry and kinetics

    The highly accurate calculation of thermochemical data with ab initio methods is

    currently possible only for small molecules up to about 10 atoms. However, many of

    the data for molecules of this size are already known, whereas accurate experiments

    for larger compounds are quite rare. Therefore, e�cient ab initio methods are

    needed which are able to treat molecules with 30-50 atoms with the same level of

    accuracy. The currently developed local treatments of electron correlation are very

    promising in this direction.

    Another problem arises for large molecules. They often have a high torsional

    exibility, and the calculation of partition functions based on a single conformer

    is therefore not correct. Quantum molecular dynamics could probably give better

    answers, but is in many cases too expensive.

    For the assessment of catalytic mechanisms, kinetic data are needed to discrim-

    inate between di�erent reaction pathways. As described above, ab initio methods

    are often not applicable when transition metals are involved. The DFT results for

    reaction energies are usually reliable enough to give correct trends, but calculated

    activation barriers easily can be wrong by a factor of two or more. The problem lies

    in the functionals, which best describe the electron distribution in the molecular

    ground states.

    4

  • 3.3 Spectroscopy for large molecules

    Calculated molecular spectroscopic properties are very helful in the assignment

    and interpretation of measured spectra, provided that the accuracy is su�ciently

    high. In many cases IR, Raman and NMR spectra can be obtained with reason-

    able accuracy on DFT or MP2 level, but UV/VIS spectra normally require more

    elaborate theories like con�guration interaction, which are only applicable to very

    small molecules. To improve on the currently applied semi-empirical approaches it

    would be necessary to calculate accurate excitation energies also for, e.g., organic

    dyes with 50 or even more atoms.

    4 Conclusion

    The improvement of the e�ciency of the algorithms and the enormous increase

    of the available computer power already made quantum chemistry applicable to a

    lot of industrial problems. However, there are still many aspects concerning the

    accuracy, completeness and e�ciency of the quantum chemical treatment, which

    will need more attention in the future.

    Currently, DFT is the most widely used quantum theoretical method in indus-

    try. Also of high importance are semi-empirical methods for certain applications.

    Both approaches have in common that they do not o�er a systematic way for the

    improvement of results, if they are found to be not reliable enough. Therefore,

    there is still need for e�cient ab initio methods, at least as a reference.

    5

  • METHODS TO CALCULATE THE PROPERTIES OF LARGE

    MOLECULES

    REINHART AHLRICHS

    Universitat Karlsruhe (TH)

    Institut fur Physikalische Chemie und Elektrochemie

    Kaiserstr. 12, 76128 Karlsruhe

    E-Mail: [email protected]

    Lecture notes were not available at printing time.

    7

  • PARALLEL PROGRAMMING MODELS, TOOLS AND

    PERFORMANCE ANALYSIS

    MICHAEL GERNDT

    John von Neumann Institute for Computing

    Central Institute for Applied Mathematics

    52425 Julich

    Germany

    E-mail: [email protected]

    The major parallel programming models for scalable parallel architectures are the

    message passing model and the shared memory model. This article outlines the

    main concepts of those models as well as the industry standard programming in-

    terfaces MPI and OpenMP. To exploit the potential performance of parallel com-

    puters, programs need to be carefully designed and tuned. We will discuss design

    decisions for good performance as well as programming tools that help the pro-

    grammer in program tuning.

    1 Introduction

    Although the performance of sequential computers increases incredibly fast, it is

    insu�cient for a large number of challenging applications. Applications requiring

    much more performance are numerical simulations in industry and research as well

    as commercial applications such as query processing, data mining, and multi-media

    applications. Architectures o�ering high performance do not only exploit paral-

    lelism on a very �ne grain within a single processor but apply a medium to large

    number of processors concurrently to a single application. High end parallel com-

    puters deliver up to 3 Teraop/s (10

    12

    oating point operations per second) and

    are developed and exploited within the ASCI (Accelerated Strategic Computing

    Initiative) program of the Department of Energy in the USA.

    This article concentrates on programming numerical applications on distributed

    memory computers introduced in Section 1.1. Parallelization of those applications

    centers around selecting a decomposition of the data domain onto the processors

    such that the workload is well balanced and the communication is reduced (Section

    1.2)

    7

    .

    The parallel implementation is then based on either the message passing or

    the shared memory model (Section 2). The standard programming interface for

    the message passing model is MPI (Message Passing Interface)

    13;10

    , o�ering a

    complete set of communication routines (Section 2.1). OpenMP

    5;12

    is the standard

    for directive-based shared memory programming and will be introduced in Section

    2.2.

    Since parallel programs exploit multiple threads of control, debugging is even

    more complicated then for sequential programs. Section 3 outlines the main con-

    cepts of parallel debuggers and presents TotalView, the most widely available de-

    bugger for parallel programs.

    Although the domain decomposition is key to good performance on parallel ar-

    chitectures, program e�ciency depends also heavily on the implementation of the

    9

  • communication and synchronization required by the parallel algorithm and the im-

    plementation techniques chosen for sequential kernels. Optimizing those aspects is

    very system dependent and thus, an interactive tuning process consisting of mea-

    suring performance data and applying optimizations follows the initial coding of the

    application. The tuning process is supported by programming model speci�c per-

    formance analysis tools. Section 4 presents basic performance analysis techniques

    and introduces the three performance analysis tools for MPI programs available on

    CRAY T3E.

    1.1 Parallel Architectures

    Parallel computers that scale beyond a small number of processors circumvent the

    main memory bottleneck by distributing the memory among the processors. Cur-

    rent architectures

    4

    are composed of single-processor nodes with local memory or

    of multiprocessor nodes where the node's main memory is shared among the node's

    processors. In the following it is assumed that nodes do have only a single CPU

    and the terms node and processor will be used interchangeably.

    The most important characteristic of this distributed memory architecture is that

    access to the local memory is faster than to remote memory. It is the challenge

    for the programmer to assign data to the processors such that most of the data

    accessed during the computation are already in the node's local memory.

    Three major classes of distributed memory computers can be distinguished:

    No Remote Memory Access (NORMA) computers do not have any hard-

    ware support to access another node's local memory. Processors obtain data

    from remote memory only by exchanging messages between processes on the

    requesting and the supplying node.

    Remote Memory Access (RMA) computers allow to access remote memory

    via specialized operations implemented by hardware. The accessed memory

    location is not determined via an address in a shared linear address space but

    via a tuple consisting of the processor number and the local address in the

    target processor's address space.

    Cache-Coherent Non Uniform Memory Access (ccNUMA) computers

    do have a shared physical address space. All memory locations can be accessed

    via usual load and store operations. Access to a remote location results in

    a copy of the appropriate cache line in the processor's cache. Coherence

    algorithms ensure that multiple copies of a cache line are kept coherent, i.e.

    the copies do have the same value.

    While most of the early parallel computers were NORMA systems, today's

    systems are either RMA or ccNUMA computers. This is because remote memory

    access is a light-weight communication protocol that is more e�cient than standard

    message passing since data copying and process synchronization are eliminated. In

    addition, ccNUMA systems o�er the abstraction of a shared linear address space

    resembling physical shared memory systems. This abstraction simpli�es the task

    of program development but does not necessarily facilitate program tuning.

    10

  • **00

    **0

    0

    **

    Figure 1. Structure of the matrix during Gaussian elimination.

    Typical examples of the three classes are clusters of workstations (NORMA),

    CRAY T3E (RMA), and SGI Origin 2000 (ccNUMA).

    1.2 Data Parallel Programming

    Applications that scale to a large number of processors usually perform computa-

    tions on large data domains. For example, crash simulations are based on partial

    di�erential equations that are solved on a large �nite element grid and molecular

    dynamic applications simulate the behavior of a large number of atoms. Other

    parallel applications apply linear algebra operations to large vectors and matrices.

    The elemental operations on each object in the data domain can be executed in

    parallel by the available processors.

    The scheduling of operations to processors is determined according to a selected

    domain decomposition

    8

    . Processors execute those operations that determine new

    values for local elements (owner-computes rule). While processors execute an op-

    eration, they might need values from other processors. The domain decomposition

    has thus to be chosen so that the distribution of operations is balanced and the

    communication is minimized. The third goal is to optimize single node computa-

    tion, i.e. to be able to exploit the processor's pipelines and the processor's caches

    e�ciently.

    A good example for the design decisions taken when selecting a domain decom-

    position is Gaussian elimination

    2

    . The main structure of the matrix during the

    iterations of the algorithm is outlined in Figure 1.

    The goal of this algorithm is to eliminate all entries in the matrix below the

    main diagonal. It starts at the top diagonal element and subtracts multiples of

    the �rst row from the second and subsequent rows to end up with zeros in the �rst

    column. This operation is repeated for all the rows. In later stages of the algorithm

    the actual computations have to be done on rectangular sections of decreasing size.

    If the main diagonal element of the current row is zero, a pivot operation has

    to be performed. The subsequent row with the maximum value in this column is

    selected and exchanged with the current row.

    A possible distribution of the matrix is to decompose its columns into blocks, one

    11

  • block for each processor. The elimination of the entries in the lower triangle can then

    be performed in parallel where each processor computes new values for its columns

    only. The main disadvantage of this distribution is that in later computations of

    the algorithms only a subgroup of the processes is actually doing any useful work

    since the computed rectangle is getting smaller.

    To improve load balancing, a cyclic column distribution can be applied. The

    computations in each step of the algorithm executed by the processors di�er only

    in one column.

    In addition to load balancing also communication needs to be minimized. Com-

    munication occurs in this algorithm for broadcasting the current column to all the

    processors since it is needed to compute the multiplication factor for the row. If the

    domain decomposition is a row distribution, which eliminates the need to commu-

    nicate the current column, only the main diagonal element needs to be broadcast

    to the other processors.

    If we consider also the pivot operation, communication is necessary to select

    the best row when a rowwise distribution is applied since the computation of the

    global maximum in that column requires a comparison of all values.

    Selecting the best domain decomposition is further complicated due to optimiz-

    ing single node performance. In this example, it is advantageous to apply BLAS3

    operations for the local computations. Those operations make use of blocks of rows

    to improve cache utilization. Blocks of rows can only be obtained if a block-cyclic

    distribution is applied, i.e. columns are not distributed individually but blocks of

    columns are cyclically distributed.

    This discussion makes clear, that choosing a domain decomposition is a very

    complicated step in program development. It requires deep knowledge of the algo-

    rithm's data access patterns as well as the ability to predict the resulting commu-

    nication.

    2 Programming Models

    The two main programming models, message passing and shared memory, o�er

    di�erent features for implementing applications parallelized by domain decomposi-

    tion.

    The message passing model is based on a set of processes with private data

    structures. Processes communicate by exchanging messages with special send and

    receive operations. The domain decomposition is implemented by developing a

    code describing the local computations and local data structures of a single pro-

    cess. Thus, global arrays have to be split up and only the local part be allocated

    in a process. This handling of global data structures is called data distribution.

    Computations on the global arrays also have to be transformed, e.g. by adapting

    the loop bounds, to ensure that only local array elements are computed. Access

    to remote elements have to be implemented via explicit communication, temporary

    variables have to be allocated, messages be setup and transmitted to the target

    process.

    The shared memory model is based on a set of threads that are created when

    parallel operations are executed. This type of computation is also called fork-join

    12

  • parallelism. Threads share a global address space and thus access array elements

    via a global index.

    The main parallel operations are parallel loops and parallel sections. Paral-

    lel loops are executed by a set of threads also called team. The iterations are

    distributed onto the threads according to a prede�ned strategy. This scheduling

    strategy implements the chosen domain decomposition. Parallel sections are also

    executed by a team of threads but the tasks assigned to the threads implement dif-

    ferent operations. This feature can for example be applied if domain decomposition

    itself does not generate enough parallelism and whole operations can be executed

    in parallel since they access di�erent data structures.

    In the shared memory model, the distribution of data structures onto the node

    memories is not enforced by decomposing global arrays into local arrays, but the

    global address space is distributed onto the memories on system level. For example,

    the pages of the virtual address space can be distributed cyclically or can be assigned

    on a �rst touch basis. The chosen domain decomposition thus has to take into

    account the granularity of the distribution, i.e. the size of pages, as well as the

    system-dependent allocation strategy.

    While the domain decomposition has to be hardcoded into the message passing

    program, it can easily be changed in a shared memory program by selecting a

    di�erent scheduling strategy for parallel loops.

    Another advantage of the shared memory model is that automatic and incre-

    mental parallelization is supported. While automatic parallelization leads to a �rst

    working parallel program, its e�ciency typically needs to be improved. The rea-

    son for this is that parallelization techniques work on a loop-by-loop basis and do

    not globally optimize the parallel code via a domain decomposition. In addition,

    dependence analysis, the prerequisite for automatic parallelization, is limited to

    statically known access patterns.

    In the shared memory model, a �rst parallel version is relatively easy to im-

    plement and can be incrementally tuned. In the message passing model instead,

    the program can be tested only after �nishing the full implementation. Subsequent

    tuning by adapting the domain decomposition is usually time consuming.

    2.1 MPI

    The Message Passing Interface (MPI)

    13;10

    was developed between 1993 and 1997.

    It includes routines for point-to-point communication, collective communication,

    one-sided communication, and parallel IO. While the basic communication primi-

    tives are already de�ned since May 1994 and implemented on allmost all parallel

    computers, remote memory access and parallel IO routines are part of MPI 2.0 and

    are only available on few machines.

    2.1.1 MPI basic routines

    MPI consists of more than 120 functions. But realistic programs can already be

    developed based on no more than six functions:

    13

  • MPI Init initializes the library. It has to be called at the beginning of a parallel

    operation before any other MPI routines are executed.

    MPI Finalize frees any resources used by the library and has to be called at the

    end of the program.

    MPI Comm size determines the number of processors executing the parallel

    program.

    MPI Comm rank returns the unique process identi�er.

    MPI Send transfers a message to a target process. This operation is a blocking

    send operation, i.e. it terminates when the message bu�er can be reused either

    because the message was copied to a system bu�er by the library or because

    the message was delivered to the target process.

    MPI Recv receives a message. This routines terminates if a message was copied

    into the receive bu�er.

    2.1.2 MPI communicator

    All communication routines depend on the concept of a communicator. A commu-

    nicator consists of a process group and a communication context. The processes

    in the process group are numbered from zero to process count - 1. The process

    number returned by MPI Comm rank is the identi�cation in the process group of

    the communicator which is passed as a parameter to this routine.

    The communication context of the communicator is important in identifying

    messages. Each message has an integer number called a tag which has to match a

    given selector in the corresponding receive operation. The selector depends on the

    communicator and thus on the communication context. It selects only messages

    with a �tting tag and having been sent relative to the same communicator. This

    feature is very useful in building parallel libraries since messages sent inside the

    library will not interfere with messages outside if a special communicator is used in

    the library. The default communicator that includes all processes of the application

    is MPI COMM WORLD.

    2.1.3 MPI collective operation

    Another important class of operations are collective operations. Collective oper-

    ations are executed by a process group identi�ed via a communicator. All the

    processes in the group have to perform the same operation. Typical examples for

    such operations are:

    MPI Reduce performs a global operation on the data of each process in the

    process group. For example, the sum of all values of a distributed array can

    be computed by �rst summing up all local values in each process and then

    summing up the local sums to get a global sum. The latter step can be per-

    formed by the reduction operation with the parameter MPI SUM. The result

    is delivered to a single target processor.

    14

  • MPI Comm split is an administration routine for communicators. It allows to

    create multiple new communicators based on a given coloring scheme. All

    processes of the original communicator have to take part in that operation.

    MPI Barrier synchronizes all processes. None of the processes can proceed be-

    yond the barrier until all the processes started execution of that routine.

    2.1.4 MPI IO

    Data parallel applications make use of the IO subsystem to read and write big data

    sets. These data sets result from replicated or distributed arrays. The reasons

    for IO are to read input data, to pass information to other programs, e.g. for

    visualization, or to store the state of the computation to be able to restart the

    computation in case of a system failure or if the computation has to be split into

    multiple runs due to its resource requirements.

    IO can be implemented in three ways:

    1. Sequential IO

    A single node is reponsible to perform the IO. It gathers information from the

    other nodes and writes it to disc or reads information from disc and scatters

    it to the appropriate nodes. While the IO is sequential and thus need not be

    parallelized, the full performance of the IO subsystem might not be utilized.

    Modern systems provide high performance IO subsystems that are fast enough

    to support multiple IO requests from di�erent nodes in parallel.

    2. Private IO

    Each node accesses its own �les. The big advantage of this implementation is

    that no synchronization among the nodes is required and very high performance

    can be obtained. The major disadvantage is that the user has to handle a large

    number of �les. For input the original data set has to be splitted according to

    the distribution of the data structure and for output the process-speci�c �les

    have to be merged into a global �le for postprocessing.

    3. Parallel IO

    In this implementation all the processes access the same �le. They read and

    write only those parts of the �le with relevant data. The main advantages are

    that no individual �les need to be handled and that reasonable performance

    can be reached. The disadvantage is that it is di�cult to reach the same

    performance as with private IO. The parallel IO interface of MPI provides

    exible and high-level means to implement applications with parallel IO.

    Files accessed via MPI IO routines have to be opened and closed by collective

    operations. The open routine allows to specify hints to optimize the performance

    such as whether the application might pro�t from combining small IO requests from

    di�erent nodes, what size is recommended for the combined request, and how many

    nodes should be engaged in merging the requests.

    The central concept in accessing the �les is the view. A view is de�ned for each

    process and speci�es a sequence of data elements to be ignored and data elements

    15

  • to be read or written by the process. When reading or writing a distributed array

    the local information can be described easily as such a repeating pattern. The IO

    operations read and write a number of data elements on the basis of the de�ned

    view, i.e. they access the local information only. Since the views are de�ned via

    runtime routines prior to the access, the information can be exploited in the library

    to optimize IO.

    MPI IO provides blocking as well as nonblocking operations. In contrast to

    blocking operations, the nonblocking ones only start IO and terminate immediately.

    If the program depends on the successful completion of the IO it has to check it via

    a test function. Besides the collective IO routines which allow to combine individual

    requests, also noncollective routines are available to access shared �les.

    2.1.5 MPI remote memory access

    Remote memory access (RMA)operations allow to access the address space of other

    processes without participation of the other process. The implementation of this

    concept can either be in hardware, such as in the CRAY T3E, or in software via

    additional threads waiting for requests. The advantages of these operations are that

    the protocol overhead is much lower than for normal send and receive operations and

    that no polling or global communication is required for setting up communication

    such as in unstructured grid applications and multiparticle applications.

    In contrast to explicit message passing where synchronization happens im-

    plicitely, accesses via RMA operations need to be protected by explicit synchro-

    nization operations.

    RMA communication in MPI is based on the window concept. Each process has

    to execute a collective routine that de�nes a window, i.e. the part of its address

    space that can be accessed by other processes.

    The actual access is performed via a put and get operation. The address is

    de�ned by the target process number and the displacement relative to the starting

    address of the window for that process.

    MPI provides also special synchronization operations relative to a window. The

    MPI Win fence operation synchronizes all processes that make some address ranges

    accessible to other processes. It is a collective operation that ensures, that all RMA

    operations started before the fence operation terminate before the target process

    executes the fence operation and that all RMA operations of a process executed

    after the fence operation are executed after the target process executed the fence

    operation.

    2.2 OpenMP

    OpenMP

    5;12

    is a directive-based programming interface for the shared memory

    programming model. It is the result of an e�ort to standardize the di�erent pro-

    gramming interfaces on the target systems. OpenMP is a set of directives and

    runtime routines for Fortran 77 (1997) and a corresponding set of pragmas for C

    and C++ (1998). An extension of OpenMP for Fortran 95 is under investigation.

    Directives are comments that are interpreted by the compiler. Directives do

    have the advantage that the code is still a sequential code that can be executed on

    16

  • sequential machines and thus no two versions, a sequential and a parallel version,

    need to be maintained.

    Directives start and terminate parallel regions. When the master thread hits

    a parallel region a team of threads is created or activated. The threads execute

    the code in parallel and are synchronized at the beginning and the end of the

    computation. After the �nal synchronization the master thread continues execution

    after the parallel region. The main directives are:

    PARALLEL DO speci�es a loop that can be executed in parallel. The DO loop's

    iterations can be distributed in various ways including STATIC(CHUNK), DY-

    NAMIC(CHUNK), and GUIDED(CHUNK) onto the set of threads (as de�ned

    in the OpenMP standard). STATIC(CHUNK) distribution means that the set

    of iterations are consecutively distributed onto the threads in blocks of CHUNK

    size (resulting in block and cyclic distributions). DYNAMIC(CHUNK) dis-

    tribution implies that iterations are distributed in blocks of CHUNK size to

    threads on a �rst-come-�rst-served basis. GUIDED (CHUNK) means that

    blocks of exponentially decreasing size are assigned on a �rst-come-�rst-served

    basis. The size of the smallest block is determined by CHUNK size.

    PARALLEL SECTION starts a set of sections that are executed in parallel by

    a team of threads.

    PARALLEL REGION introduces a code region that is executed redundantly

    by the threads. It has to be used very carefully since assigments to global vari-

    ables will lead to conicts among the threads and possibly to nondeterministic

    behavior.

    PDO is a work sharing construct and may be used within a parallel region. All

    the threads executing the parallel region have to cooperate in the execution

    of the parallel loop. There is no implicit synchronization at the beginning of

    the loop but a synchronization at the end. After the �nal synchronization all

    threads continue after the loop in the replicated execution of the program code.

    The main advantage of this approach is that the overhead for starting up the

    threads is eliminated. The team of threads exists during the execution of the

    parallel region and need not be built before each parallel loop.

    PSECTION is also a work sharing construct that enforces that the current team

    of threads executing the surrounding parallel region cooperates in the execution

    of the parallel section.

    Program data can either be shared or private. While threads do have an own

    copy of private data, only one copy exists of shared data. This copy can be accessed

    by all threads. To ensure program correctness, OpenMP provides special synchro-

    nization constructs. The main constructs are barrier synchronization enforcing that

    all threads have reached this synchronization operation before execution continues

    and critical sections. Critical sections ensure that only a single thread can enter the

    section and thus, data accesses in such a section are protected from race conditions.

    A common situation for a critical section is the accumulation of values. Since an

    17

  • accumulation consists of a read and a write operation unexpected results can occur

    if both operations are not surrounded by a critical section.

    3 Parallel Debugging

    Debugging parallel programs is more di�cult than debugging sequential programs

    not only since multiple processes or threads need to be taken into account but also

    because program behaviour might not be deterministic and might not be repro-

    ducible. These problems are not attacked by current state-of-the-art commercial

    parallel debuggers. Only the �rst reason is eliviated by current debuggers. They

    provide menus, diplays, and commands that allow to inspect individual processes

    and execute commands on individual or all processes.

    The widely used debugger is TotalView from Etnus Inc

    14

    . TotalView provides

    breakpoint de�nition, single stepping, and variable inspection via an interactive

    interface. The programmer can execute those operations for individual processes

    and groups of processes. TotalView also provides some means to summarize in-

    formation such that equal information from multiple processes is combined into a

    single information and not repeated redundantly.

    4 Performance Analysis

    Performance analysis is an iterative subtask during process development. The goal

    is to identify program regions that do not perform well. Performance analysis is

    structured into four phases:

    1. Measurement

    Performance analysis is executed based on information on runtime events gath-

    ered during program execution. The basic events are, for example, cache

    misses, termination of a oating point operation, start and stop of a sub-

    routine or message passing operation. The information on individual events

    can be summarized during program execution or individual trace records can

    be collected for each event.

    Summary information has the advantage to be of moderate size while trace

    information tends to be very large. The disadvantage is that it is not that �ne

    grained, the behavior of individual instances of subroutines can for example

    not be investigated since all the information has been summed up.

    2. Analysis

    During analysis the collected runtime data are inspected to detect performance

    problems. Performance problems are based on performance properties, such

    as the existence of message passing in a program region, which do have a

    condition for identifying it and a severity function that speci�es its importance

    for program performance.

    Current tools support the user in checking the conditions and the severity by

    visualizing program behavior. Future tools might be able to automatically

    detect performance properties based on a speci�cation of possible properties.

    18

  • During analysis the programmer applies a threshold. Only performance prop-

    erties whose severity exceeds this threshold are considered to be performance

    problems.

    3. Ranking

    During program analysis the severest performance problems need to be identi-

    �ed. This means that the problems need to be ranked according to the severity.

    The most severe problem is called the program bottleneck. This is the problem

    the programmer tries to resolve by applying appropriate program transforma-

    tions.

    4. Re�nement

    The performance problems detected in the previous phases might not be pre-

    cise enough to allow the user to start optimization. At the beginning of perfor-

    mance analysis, summary data can be used to identify critical regions. Those

    summary data might not be su�cient to identify why, for example, a region

    has high message passing overhead. The reason, e.g. very big messages or load

    imbalance, might be identi�ed only with more detailed information. Therefore

    the performance problem should be re�ned into hypotheses about the real rea-

    son and additional information be collected in the next performance analysis

    cycle.

    Current techniques for performance data collection are pro�ling and tracing.

    Pro�ling collects summary data only. This can be done via sampling, the program

    is regularly interrupted, e.g. every 10 ms, and the information is added up for the

    source code location which was executed in this moment. For example, the UNIX

    pro�ling tool prof applies this technique to determine the fraction of the execution

    time spent in individual subroutines.

    A more precise pro�ling technique is based on instrumentation, i.e. special calls

    to a monitoring library are inserted into the program. This can either be done in the

    source code by the compiler or specialized tools, or can be done in the object code.

    While the �rst approach allows to instrument more types of regions, for example,

    loops and vector statements, the latter allows to measure data for programs where

    no source code is available. The monitoring library collects the information and

    adds it to special counters for the speci�c region.

    Tracing is a technique that collects information for each event. This results,

    for example, in very detailed information for each instance of a subroutine and for

    each message sent to another process. The information is stored in specialized trace

    records for each event type. For example, for each start of a send operation, the

    time stamp, the message size and the target process can be recorded, while for the

    end of the operation, the timestamp and bandwidth are stored.

    The trace records are stored in the memory of each process and are output either

    because the bu�er is �lled up or because the program terminated. The individual

    trace �les of the processes are merged together into one trace �le ordered according

    to the time stamps of the events.

    The following sections describe performance analysis tools available on CRAY

    T3E.

    19

  • 4.1 The CRAY T3E Performance Analysis Environment

    The programming environment of the CRAY T3E supports performance analysis

    via interactive tools. Cray itself provides two tools, Apprentice and PAT, which

    are both based on summary information. In addition, Apprentice accesses source

    code information to map the performance information to the statements of the

    original source code. Besides these two tools, programmers can use VAMPIR, a

    trace analysis tool developed at our institute. Within a collaboration with Cray,

    instrumentation and trace generation for VAMPIR is being integrated into the next

    version of PAT.

    4.2 Compiler Information File

    The F90 and C compilers on CRAY T3E generate for each source �le on request a

    compiler information �le (CIF). This �le includes information about the compila-

    tion process (applied compiler options, target machine characteristics, and compiler

    messages) and source information of the compilation units (procedure information,

    symbol information, information about loops and statement types, cross-reference

    information for each symbol in the source �le).

    Apprentice requires this information to link the performance information back

    to the source code. CIFs are initially in ASCII format but can be converted to

    binary format. The information can be easily accessed in both formats via a library

    interface.

    4.3 Apprentice

    Apprentice is a post-execution performance analysis tool for message passing pro-

    grams

    3

    . Originally it was designed to support the CRAFT programming model on

    the CRAY T3E predecessor system, the CRAY T3D. Apprentice analyzes summary

    information collected at runtime via an instrumentation of the source program.

    The instrumentation is performed by the compiler and is triggered via an ap-

    propriate compiler switch. To reduce the overhead of the instrumentation, the pro-

    grammer can selectively compile the source �les with and without instrumentation.

    The instrumentation is done in a late phase of the compilation after all optimiza-

    tions already occurred. This prevents that instrumentation a�ects the way code is

    compiled. During runtime, summary information is collected at each processor for

    each basic block. This information comprises:

    � execution time

    � number of oating point, integer, and load/store operations

    � instrumentation overhead

    For each subroutine call the execution time as well as the pass count is deter-

    mined. At the end of a program run, the information of all processors is summed up

    and written to the runtime information �le (RIF). In addition to the summed up

    execution times and pass counts of subroutines calls, their mean value and standard

    deviation, as well as the minimum and maximum values are stored.

    20

  • The size of the resulting RIF is typically less than one megabyte. But the

    overhead due to the instrumentation can easily be a factor of two which results

    from instrumenting each basic block. This severe drawback of the instrumentation is

    partly compensated in Apprentice by correcting the timings based on the measured

    overhead.

    When the user starts Apprentice to analyze the collected information, the tool

    �rst reads the RIF as well as the CIFs of the individual source �les. The perfor-

    mance data measured for the optimized code are related back to the original source

    code. Apprentice distinguishes between:

    � parallel work: user-level subroutines

    � I/O: system subroutines for performing I/O

    � communication overhead: MPI and PVM routines, SHMEM routines

    � uninstrumented code

    The available barcharts allow the user to identify critical code regions that take

    most of the execution time or with a lot of I/O and communication overhead. Since

    all the values have been summed up, no speci�c behavior of the processors can be

    identi�ed. Load balance problems can be detected by inspecting the execution times

    of calls to synchronization subroutines, such as global sums or barriers. Based on

    the available information, the processors with the least and the highest execution

    time can be identi�ed.

    While Apprentice does not evaluate the hardware performance counters of the

    DEC Alpha, it estimates the loss due to cache misses and suboptimal use of the

    functional units. Based on the number of instructions and a very simple cost model

    (�xed cycles for each type of instruction) it determines the loss as the di�erence

    between the estimated optimal and the measured execution time.

    4.4 VAMPIR

    VAMPIR (Visualization and Analysis of MPI Resources) is an event trace analysis

    tool

    11

    which was developed by the Central Institute for Applied Mathematics of the

    Research Centre Julich and now is commercially distributed by a German company

    named PALLAS. Its main application area is the analysis of parallel programs based

    on the message passing paradigm but it also has been successfully used for other

    areas (e.g., for SVM-Fortran traces to analyze shared virtual memory page transfer

    behavior

    9

    or to analyze CRAY T3E usage based on accounting data). VAMPIR

    has three components:

    � The VAMPIR tool itself is a graphical event trace browser implemented for

    the X11 Window system using the Motif toolkit. It is available for any major

    UNIX platform.

    � The VAMPIR runtime library provides an API for collecting, bu�ering, and

    generating event traces as well as a set of wrapper routines for the most com-

    monly used MPI and PVM communication routines which record message traf-

    �c in the event trace.

    21

  • � In order to observe functions or subroutines in the user program, their entry

    and exit has to be instrumented by inserting calls to the VAMPIR runtime

    library. Observing message passing functions is handled by linking the program

    with the VAMPIR wrapper function library.

    VAMPIR comes with a source instrumenter for ANSI Fortran 77. Programs

    written in other programming languages (e.g., C or C++) have to be instru-

    mented manually. To improve this situation, our institute in collaboration

    with CRAY Research is currently implementing an object code instrumenter

    for CRAY T3E. This is described in the next section.

    During the execution of the instrumented user program, the VAMPIR runtime

    library records entry and exits to instrumented user and message passing functions

    and the sending and receiving of messages. For each message, its tag, communica-

    tor, and length is recorded. Through the use of a con�guration �le, it is possible

    to switch the runtime observation of speci�c functions on and o�. This way, the

    program doesn't have to be re-instrumented and re-compiled for every change in

    the instrumentation.

    Large parallel programs consist of several dozens or even hundreds of functions.

    To ease the analysis of such complex programs, VAMPIR arranges the functions

    into groups, e.g., user functions, MPI routines, I/O routines, and so on. The user

    can control/change the assignment of functions to groups and can also de�ne new

    groups.

    VAMPIR provides a wide variety of graphical displays to analyze the recorded

    event traces:

    � The dynamic behavior of the program can be analyzed by timeline diagrams

    for either the whole program or a selected set of nodes. By default, the displays

    show the whole event trace, but the user can zoom-in to any arbitrary region of

    the trace. Also, the user can change the display style of the lines representing

    messages based on their tag/communicator or the length. This way, message

    tra�c of di�erent modules or libraries can easily be visually separated.

    � The parallelism display shows the number of nodes in each function group over

    time. This allows to easily locate speci�c parts of the program, e.g., parts with

    heavy message tra�c or I/O.

    � VAMPIR also provides a large number of statistical displays. It calculates

    how often each function or group of functions got called and the time spent in

    there. Message statistics show the number of messages sent, and the minimum,

    maximum, sum, and average length or transfer rate between any two nodes.

    The statistics can be displayed as barcharts, histograms, or textual tables.

    A very useful feature of VAMPIR is that the statistic displays can be linked to

    the timeline diagrams. By this, statistics can be calculated for any arbitrary,

    user selectable part of the program execution.

    � If the instrumenter/runtime library provides the necessary information in the

    event trace header, the information provided by VAMPIR can be related back

    22

  • to source code. VAMPIR provides a source code and a call graph display to

    show selected functions or the location of the send and the receive of a selected

    message.

    In summary, VAMPIR is a very powerful and highly con�gurable event trace

    browser. It displays trace �les in a variety of graphical views, and provides exible

    �lter and statistical operations that condense the displayed information to a man-

    ageable amount. Rapid zooming and instantaneous redraw allow to identify and

    focus on the time interval of interest.

    4.5 PAT

    PAT (Performance Analysis Tool) is the second performance tool available from

    CRAY Research for CRAY T3E. The two main di�erences to Apprentice are that

    no source code instrumentation or special compiler support is necessary. The user

    only needs to re-link his/her application against the PAT runtime library (because

    CRAY Unicos doesn't support dynamic linking). Second, PAT aims at keeping the

    additional overhead to measure/observe program behavior as low as possible. PAT

    is actually three performance tools in one:

    1. PAT allows the user to get an rough overview about the performance of the par-

    allel program through a method called sampling, i.e., interrupting the program

    at regular intervals and evaluating the program counter. PAT can calculate

    then the percentage of time spent in each function. The sampling rate can be

    changed by the user to adapt it to the execution time of the program and to

    keep overhead low. Because the sampling method provides only a statistical

    estimate of the actual time spent in a function, the tool also provides a measure

    of con�dence in the sampling estimate.

    In addition PAT determines the total, user, and system time of the execution

    run and the number of cache misses and the number of either ointing point,

    integer, store, or load operations. These are measured through the DEC Al-

    pha hardware counters. The user can select the hardware counter by setting

    an environment variable. All this statistical information is stored after the

    execution in a so-called Performance Information File (PIF).

    2. If a more detailed analysis is necessary, PAT can be used to instrument and

    analyze a speci�c function or set of functions in a second phase. PAT can

    instrument object code (however only on the function level). This is a big

    advantage especially for large complex programs because they do not have

    to be re-compiled for instrumentation. In addition, it is possible to analyze

    functions contained in system or 3rd-party libraries. A third advantage is

    that programs written in more than one language can be handled. The big

    disadvantage is that it is more di�cult to relate the results back to the source

    code.

    This detailed investigation of function behavior is called Call Site Report by

    PAT. It records for each call site of the instrumented functions how often it got

    called and time spent in this instantiation of the function. Execution times are

    23

  • measured with a high-resolution timer. The results are available for each CPU

    used in the parallel program. The next version of PAT will allow to gather

    hardware counter statistics for instrumented functions as well.

    3. Last, if a very detailed analysis of the program behavior is necessary, PAT

    also supports event tracing. The object instrumenter of PAT can also be used

    to insert calls to entry and exit trace routines around calls to user or library

    routines. Entry and exit trace routines can be provided in two ways:

    � The user can supply function-speci�c wrapper functions. The routines

    must be written in C, they must have the same number and same types

    of arguments as the routine they are tracing, and �nally, the wrapper

    function name for a function func must be func trace entry for entry

    trace routines and func trace exit for exit trace routines.

    � If speci�c wrapper routines for the requested function are not available,

    PAT uses generic wrapper code which just records the entry and exit of

    the function in the event trace.

    In addition, PAT provides extra tracing runtime system calls, which can be

    inserted in the source code and allow to switch tracing on and o�, and to insert

    additional information into the trace (e.g., information unrelated to functions).

    The tracing features of PAT were developed in a collaboration of Research

    Centre Julich with CRAY Research. Our institute implemented all the nec-

    essary special wrapper functions for all message passing functions available

    on the T3E (MPI, PVM, and SHMEM and for both the C and Fortran in-

    terfaces) which record the message tra�c in the event trace. In addition, we

    implemented a tool for converting the event traces contained in PIF �les into

    VAMPIR trace format.

    The major drawback of PAT's object instrumentation is the very low-level in-

    terface for specifying the functions to be instrumented. The user has to specify the

    function names as they appear in the object code, i.e., C++ functions or F90 func-

    tions which are local or contained in modules have to be speci�ed in the mangled

    form (e.g., " 0FDfooPd" instead of the C++ function name "int foo(double *)").

    Clearly, a more user friendly or automatic way for the instrumenter interface needs

    to be added to PAT.

    In addition, the combination of three di�erent instrumentation/analysis tech-

    niques into a single tool is very confusing for users. This confusion is further

    increased since the supported techniques overlap with the techniques applied in the

    other tools.

    4.6 Summary

    The previous sections pointed out that the CRAY T3E has a programming envi-

    ronment that includes the most advanced performance analysis tools. On the other

    hand, each of these tools comes with its own instrumentation, provides partially

    overlapping information, and has a totally di�erent user interface. The program-

    mer has to understand the advantages and disadvantages of all the tools to be able

    24

  • Apprentice VAMPIR PAT

    data

    collection

    instrumentation

    via compiler

    source instru-

    mentation by

    preprocessor

    sampling object code inst.

    intrusion high / cor-

    rected

    high low high

    selective

    instr.

    not required,

    optional

    required { required

    selection

    interface

    compiler switch GUI or ASCII

    �le

    { ASCII interface or �le

    level of de-

    tail

    summary infor-

    mation

    event trace statistics summary

    informa-

    tion

    event

    trace

    information total time and

    oper. counts

    for basic blocks

    and call sites

    subroutine

    start/stop

    and send/recv

    events

    statistical

    distr. of

    time

    (subrou-

    tines)

    total

    time

    and pass

    counts

    for call

    sites

    subroutine

    start/stop

    and

    send/recv

    events

    strength source-level,

    analysis of

    loops

    many displays

    for message

    passing his-

    tory and for

    statistics of

    arbitrary

    execution

    phases

    low over-

    head

    pro�ling

    total

    time

    for call

    sites, no

    recompi-

    lation

    object

    code

    instr. for

    VAM-

    PIR, no

    recompi-

    lation

    Table 1. User interface and properties of performance analysis tools on CRAY T3E

    to select and apply the right ones. Table 1 summarizes the main features of these

    three tools.

    5 Summary

    This article gave an overview of parallel programming models as well as program-

    ming tools. Parallel programming will always be a challenge for programmers.

    Higher-level programming models and appropriate programming tools only facili-

    tate the process but do not make it a simple task.

    While programming in MPI o�ers the greatest potential performance, shared

    memory programming with OpenMP is much more comfortable due to the global

    style of the resulting program. The sequential control ow among the parallel loops

    and regions matches much better with the sequential programming model all the

    programmers are trained for.

    Although program tools were developed over years, the current situation seems

    not to be very satis�able. Program debugging is done on a per thread basis, a

    technique that does not scale to larger numbers of processors. Performance anal-

    25

  • ysis tools do also su�er scalability limitations and, in addition, those tools are

    complicated to use. The programmers have to be experts for performance analy-

    sis to understand potential performance problems, their proof conditions and their

    serverity. In addition they have to be experts for powerful but also complex user

    interfaces.

    Future research in that area has to try to automate performance analysis tools,

    such that frequently occuring performance problems can be identi�ed automatically.

    It is the goal of the ESPRIT IV working group APART on Automatic Performance

    Analysis: Resources and Tools to investigate base technologies for future more in-

    telligent tools

    1

    . A �rst result of the work ia a collection of performance problems

    for parallel programs that have been formalized with the ASL, the APART Spec-

    i�cation Language

    6

    . This approach will lead to a formal representation of the

    knowledge applied in the manually executed performance analysis process and thus

    will make this knowledge accessible for automatic processing.

    A second important trend that will e�ect parallel programming in the future

    is the move towards clustered shared memory systems. Within the GoSMP study

    executed for the Federal Ministry for Education and Research its inuence on pro-

    gram development is investigated. Clearly, a hybrid programming approach will

    be applied on those systems for best performance, combining message passing be-

    tween the individual SMP nodes and shared memory programming in a node. This

    programming model will lead to even more complex programs and program devel-

    opment tools have to be enhanced to be able to help the user in developing those

    codes.

    References

    1. APART: ESPRIT Working Group on Automatic Performance Analysis Re-

    sources and Tools, www.fz-juelich.de/apart, 1999.

    2. D.P. Bertsekas, J.N. Tsitsiklis: Parallel and Distributed Computation: Numer-

    ical Methods, Prentice-Hall, ISBN 0-13-648759-9, 1989.

    3. Cray Research: Cray MPP Fortran: Reference Manual, Cray SR-2504, 1994.

    4. D.E. Culler, J.P. Singh, A. Gupta: Parallel Computer Architecture - A Hard-

    ware/Software Approach, Morgan Kaufmann Publishers, ISBN 1-55860-343-3,

    1999.

    5. L. Dagum, R. Menon: OpenMP: An Industry-Standard API for Shared-

    memory Programming, IEEE Computational Science & Engineering, 1998.

    6. Th. Fahringer, M. Gerndt, G. Riley, J. Tra�: Knowledge Speci�cation for

    Automatic Performance Analysis, APART Technical Report, Research Centre

    Juelich FZJ-ZAM-9918, 1999.

    7. I. Foster: Designing and Building Parallel Programs, Addison Wesley, ISBN

    0-201-57594-9, 1994.

    8. G.Fox: Domain Decomposition in Distributed and Shared Memory Environ-

    ments, International Conference on Supercomputing June 8-12, 1987, Athens,

    Greece Lecture Note in Computer Science 297, edited by Constantine Poly-

    chronopoulos, 1987.

    9. M. Gerndt, A. Krumme, S.

    Ozmen: Performance Analysis for SVM-Fortran

    26

  • with OPAL, Proceedings Int. Conf. on Parallel and Distributed Processing

    Techniques and Applications (PDPTA'95), Athens, Georgia, pp. 561-570,

    1995.

    10. MPI Forum: Message Passing Interface, www.mcs.anl.gov/mpi, 1999.

    11. W.E. Nagel, A. Arnold, M. Weber, H-C. Hoppe, K. Solchenbach: VAMPIR:

    Visualization and Analysis of MPI Resources, Supercomputer 63, Vol. 12, No.

    1, pp. 69-80, 1996.

    12. OpenMP Forum: OpenMP Standard, www.openmp.org, 1999.

    13. M. Snir, St. Otto, St. Huss-Lederman, D. Walker, J. Dongarra: MPI - The

    Complete Reference, MIT Press, ISBN 0-262-69216-3, 1998.

    14. Etnus Inc.: Totalview, www.etnus.com/products/totalview/index.html, 1999.

    27

  • BASIC NUMERICAL LIBRARIES FOR PARALLEL SYSTEMS

    I. GUTHEIL

    John von Neumann Institute for Computing

    Central Institute for Applied Mathematics

    Research Centre Julich, 52425 Julich, Germany

    E-mail: [email protected]

    Three public domain libraries with basic numerical operations for distributed mem-

    ory parallel systems are presented: ScaLAPACK, PLAPACK, and Global Arrays.

    They are compared not only with respect to performance on CRAY T3E but also

    to user-friendliness.

    1 Introduction

    There are many projects for parallelization of numerical software. An overview of

    public domain libraries for high performance computing can be found on the HPC-

    Netlib homepage

    1

    . Often these libraries are very specialized either concerning the

    problem which is treated or the platform on which they run. Three of the more

    general packages based on message-passing with MPI will be presented here in some

    detail: ScaLAPACK

    2

    , PLAPACK

    3

    , and Global Arrays

    4

    .

    Often there is an MPI-implementation on shared-memory multiprocessor sys-

    tems, hence libraries based on MPI can also be used there and often very e�ciently

    as message-passing programs take great care of data locality.

    1.1 ScaLAPACK, Scalable Linear Algebra PACKage

    The largest and most exible public domain library with basic numerical operations

    for distributed memory parallel systems up to now is ScaLAPACK. Within the

    ScaLAPACK project many LAPACK

    5

    routines were ported to distributed memory

    computers using message passing.

    The communication in ScaLAPACK is based on the BLACS (Basic Linear Alge-

    bra Communication Subroutines)

    6

    . There are public domain versions of the BLACS

    based on MPI and PVM available. For CRAY T3E there is also a version of the

    BLACS in Cray scienti�c libraries (libsci)

    7

    using Cray shared memory routines

    (shmem) which is often faster than the public domain versions.

    The basic routines of ScaLAPACK are the PBLAS (Parallel Basic Linear Al-

    gebra Subroutines). They contain parallel versions of the BLAS

    a

    , which are paral-

    lelized using BLACS for communication and sequential BLAS for computation. As

    most vendors o�er optimized sequential BLAS the BLAS and PBLAS deliver very

    good performance on most parallel computers.

    Based on BLACS and PBLAS ScaLAPACK contains parallel solvers for dense

    linear systems and linear systems with banded system matrix as well as parallel

    routines for the solution of linear least squares problems and for singular value

    a

    BLAS 1 contains vector-vector operations, e.g. dotproduct, BLAS 2 matrix-vector operations,

    e.g. matrix-vector multiplication, BLAS 3 matrix-matrix operations, e.g. matrix-matrix multipli-

    cation

    29

  • decomposition. Routines for the computation of all or some of the eigenvalues

    and eigenvectors of dense real symmetric matrices and dense complex hermitian

    matrices and for the generalized symmetric de�nite eigenproblem are also included

    in ScaLAPACK.

    ScaLAPACK also contains additional libraries to treat distributed matrices and

    vectors. One of them is the TOOLS library, which o�ers useful routines for example

    to �nd out which part of the global matrix a local process has in its memory or to

    �nd out the global index of a matrix element corresponding to its local index and

    vice versa. Unfortunately these routines are documented only in the source code of

    the routines and not in the Users' Guide. Another library is the REDIST library

    which is documented in the ScaLAPACK Users' Guide. It contains routines to copy

    any block-cyclicly distributed (sub)matrix to any other block-cyclicly distributed

    (sub)matrix.

    ScaLAPACK is a Fortran 77 library which uses C subroutines internally to al-

    locate additional workspace. Especially the PBLAS allocate additional workspace.

    1.2 PLAPACK, Parallel Linear Algebra PACKage

    PLAPACK does not o�er as many black-box solvers as ScaLAPACK but is designed

    as a parallel infrastructure to develop routines for solving linear algebra problems.

    With PLAPACK routines the user can create global matrices, vectors, and mul-

    tiscalars, and he may �ll them with values with the help of an API (Application

    Programming Interface). To make the development of programs easier and to get

    good performance PLAPACK includes parallel versions of most real BLAS routines

    and solvers for real dense linear systems using LU-decomposition and for real sym-

    metric positive de�nite systems applying Cholesky-decomposition which operate on

    the global data.

    PLAPACK is a C library with a Fortran interface in Release 1.2. Unfortunately

    up to now Release 1.2 does not run on CRAY T3E.

    1.3 Global Arrays

    Like PLAPACK Global Arrays supplies the user with global linear algebra objects

    and an interface to �ll them with data. For the solution of linear equations there

    is an interface to special ScaLAPACK routines which can be modi�ed to use other

    ScaLAPACK routines, too. For the solution of the full symmetric eigenvalue prob-

    lem Global Arrays contains an interface to PeIGS

    8

    .

    Global Arrays is a Fortran 77 library which uses a memory allocator library for

    dynamic memory management.

    2 Data Distributions

    There are many ways to distribute data, especially matrices, to processors. In the

    ScaLAPACK Users' Guide many of them are presented and discussed.

    All of the three libraries described here distribute matrices to a two-dimensional

    processor grid, but they do it in di�erent ways and the user can more or less inuence

    the way data are distributed. Users who only want to use routines from the libraries

    30

  • don't have to care for the way data are distributed in PLAPACK and Global Arrays.

    The distribution is done automatically. To use ScaLAPACK, however, the user has

    to create and �ll the local parts of the global matrix on his own.

    2.1 ScaLAPACK: Two-dimensional Block-Cyclic Distribution

    For performance and load balancing reasons ScaLAPACK has chosen a two-

    dimensional block-cyclic distribution for full matrices (see ScaLAPACK Users'

    Guide). First the matrix is distributed to blocks of size MB �NB. These blocks

    are then uniformly distributed across the NP �NQ processor grid in a cyclic man-

    ner. As a result, every process owns a collection of blocks, which are