pt-scotch and libscotch 5.1 user’s guide · 5.0 of scotch was the first one to comprise parallel...

82
PT-Scotch and libScotch 5.1 User’s Guide (version 5.1.11) Fran¸ cois Pellegrini Bacchus team, INRIA Bordeaux Sud-Ouest ENSEIRB & LaBRI, UMR CNRS 5800 Universit´ e Bordeaux I 351 cours de la Lib´ eration, 33405 TALENCE, FRANCE [email protected] November 17, 2010 Abstract This document describes the capabilities and operations of PT-Scotch and libScotch, a software package and a software library which compute parallel static mappings and parallel sparse matrix block orderings of graphs. It gives brief descriptions of the algorithms, details the input/output formats, instructions for use, installation procedures, and provides a number of exam- ples. PT-Scotch is distributed as free/libre software, and has been designed such that new partitioning or ordering methods can be added in a straight- forward manner. It can therefore be used as a testbed for the easy and quick coding and testing of such new methods, and may also be redistributed, as a library, along with third-party software that makes use of it, either in its original or in updated forms. 1

Upload: others

Post on 02-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • PT-Scotch and libScotch 5.1

    User’s Guide

    (version 5.1.11)

    François Pellegrini

    Bacchus team, INRIA Bordeaux Sud-Ouest

    ENSEIRB & LaBRI, UMR CNRS 5800

    Université Bordeaux I

    351 cours de la Libération, 33405 TALENCE, FRANCE

    [email protected]

    November 17, 2010

    Abstract

    This document describes the capabilities and operations of PT-Scotch

    and libScotch, a software package and a software library which compute

    parallel static mappings and parallel sparse matrix block orderings of graphs.

    It gives brief descriptions of the algorithms, details the input/output formats,

    instructions for use, installation procedures, and provides a number of exam-

    ples.

    PT-Scotch is distributed as free/libre software, and has been designed

    such that new partitioning or ordering methods can be added in a straight-

    forward manner. It can therefore be used as a testbed for the easy and quick

    coding and testing of such new methods, and may also be redistributed, as

    a library, along with third-party software that makes use of it, either in its

    original or in updated forms.

    1

  • Contents

    1 Introduction 4

    1.1 Static mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    1.2 Sparse matrix ordering . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.3 Contents of this document . . . . . . . . . . . . . . . . . . . . . . . . 5

    2 The Scotch project 5

    2.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.2 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    3 Algorithms 6

    3.1 Parallel static mapping by Dual Recursive Bipartitioning . . . . . . 6

    3.1.1 Static mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    3.1.2 Cost function and performance criteria . . . . . . . . . . . . . 7

    3.1.3 The Dual Recursive Bipartitioning algorithm . . . . . . . . . 8

    3.1.4 Partial cost function . . . . . . . . . . . . . . . . . . . . . . . 9

    3.1.5 Parallel graph bipartitioning methods . . . . . . . . . . . . . 10

    3.1.6 Mapping onto variable-sized architectures . . . . . . . . . . . 11

    3.2 Parallel sparse matrix ordering by hybrid incomplete nested dissection 11

    3.2.1 Hybrid incomplete nested dissection . . . . . . . . . . . . . . 11

    3.2.2 Parallel ordering . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.2.3 Performance criteria . . . . . . . . . . . . . . . . . . . . . . . 16

    3.3 Changes from version 5.0 . . . . . . . . . . . . . . . . . . . . . . . . 16

    4 Files and data structures 17

    4.1 Distributed graph files . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    5 Programs 18

    5.1 Invocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    5.2 File names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    5.2.1 Sequential and parallel file opening . . . . . . . . . . . . . . . 19

    5.2.2 Using compressed files . . . . . . . . . . . . . . . . . . . . . . 20

    5.3 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    5.3.1 dgmap / dgpart . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    5.3.2 dgord . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    5.3.3 dgpart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    5.3.4 dgscat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    5.3.5 dgtst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    6 Library 25

    6.1 Running at proper thread level . . . . . . . . . . . . . . . . . . . . . 25

    6.2 Calling the routines of libScotch . . . . . . . . . . . . . . . . . . . 26

    6.2.1 Calling from C . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    6.2.2 Calling from Fortran . . . . . . . . . . . . . . . . . . . . . . . 26

    6.2.3 Compiling and linking . . . . . . . . . . . . . . . . . . . . . . 27

    6.2.4 Machine word size issues . . . . . . . . . . . . . . . . . . . . . 28

    6.3 Data formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    6.3.1 Distributed graph format . . . . . . . . . . . . . . . . . . . . 29

    6.3.2 Block ordering format . . . . . . . . . . . . . . . . . . . . . . 34

    6.4 Strategy strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    6.4.1 Using default strategy strings . . . . . . . . . . . . . . . . . . 35

    2

  • 6.4.2 Parallel mapping strategy strings . . . . . . . . . . . . . . . . 36

    6.4.3 Parallel graph bipartitioning strategy strings . . . . . . . . . 37

    6.4.4 Parallel ordering strategy strings . . . . . . . . . . . . . . . . 40

    6.4.5 Parallel node separation strategy strings . . . . . . . . . . . . 42

    6.5 Distributed graph handling routines . . . . . . . . . . . . . . . . . . 45

    6.5.1 SCOTCH dgraphAlloc . . . . . . . . . . . . . . . . . . . . . . . 45

    6.5.2 SCOTCH dgraphInit . . . . . . . . . . . . . . . . . . . . . . . 45

    6.5.3 SCOTCH dgraphExit . . . . . . . . . . . . . . . . . . . . . . . 46

    6.5.4 SCOTCH dgraphFree . . . . . . . . . . . . . . . . . . . . . . . 46

    6.5.5 SCOTCH dgraphLoad . . . . . . . . . . . . . . . . . . . . . . . 46

    6.5.6 SCOTCH dgraphSave . . . . . . . . . . . . . . . . . . . . . . . 47

    6.5.7 SCOTCH dgraphBuild . . . . . . . . . . . . . . . . . . . . . . . 48

    6.5.8 SCOTCH dgraphGather . . . . . . . . . . . . . . . . . . . . . . 50

    6.5.9 SCOTCH dgraphScatter . . . . . . . . . . . . . . . . . . . . . 50

    6.5.10 SCOTCH dgraphCheck . . . . . . . . . . . . . . . . . . . . . . . 51

    6.5.11 SCOTCH dgraphSize . . . . . . . . . . . . . . . . . . . . . . . 51

    6.5.12 SCOTCH dgraphData . . . . . . . . . . . . . . . . . . . . . . . 52

    6.5.13 SCOTCH dgraphGhst . . . . . . . . . . . . . . . . . . . . . . . 54

    6.5.14 SCOTCH dgraphHalo . . . . . . . . . . . . . . . . . . . . . . . 55

    6.5.15 SCOTCH dgraphHaloAsync . . . . . . . . . . . . . . . . . . . . 55

    6.5.16 SCOTCH dgraphHaloWait . . . . . . . . . . . . . . . . . . . . . 56

    6.6 Distributed graph mapping and partitioning routines . . . . . . . . . 57

    6.6.1 SCOTCH dgraphPart . . . . . . . . . . . . . . . . . . . . . . . 57

    6.6.2 SCOTCH dgraphMap . . . . . . . . . . . . . . . . . . . . . . . . 58

    6.6.3 SCOTCH dgraphMapInit . . . . . . . . . . . . . . . . . . . . . 58

    6.6.4 SCOTCH dgraphMapExit . . . . . . . . . . . . . . . . . . . . . 59

    6.6.5 SCOTCH dgraphMapSave . . . . . . . . . . . . . . . . . . . . . 60

    6.6.6 SCOTCH dgraphMapCompute . . . . . . . . . . . . . . . . . . . 60

    6.7 Distributed graph ordering routines . . . . . . . . . . . . . . . . . . . 61

    6.7.1 SCOTCH dgraphOrderInit . . . . . . . . . . . . . . . . . . . . 61

    6.7.2 SCOTCH dgraphOrderExit . . . . . . . . . . . . . . . . . . . . 61

    6.7.3 SCOTCH dgraphOrderSave . . . . . . . . . . . . . . . . . . . . 62

    6.7.4 SCOTCH dgraphOrderSaveMap . . . . . . . . . . . . . . . . . . 62

    6.7.5 SCOTCH dgraphOrderSaveTree . . . . . . . . . . . . . . . . . 63

    6.7.6 SCOTCH dgraphOrderCompute . . . . . . . . . . . . . . . . . . 64

    6.7.7 SCOTCH dgraphOrderPerm . . . . . . . . . . . . . . . . . . . . 64

    6.7.8 SCOTCH dgraphOrderCblkDist . . . . . . . . . . . . . . . . . 65

    6.7.9 SCOTCH dgraphOrderTreeDist . . . . . . . . . . . . . . . . . 65

    6.8 Centralized ordering handling routines . . . . . . . . . . . . . . . . . 66

    6.8.1 SCOTCH dgraphCorderInit . . . . . . . . . . . . . . . . . . . 66

    6.8.2 SCOTCH dgraphCorderExit . . . . . . . . . . . . . . . . . . . 67

    6.8.3 SCOTCH dgraphOrderGather . . . . . . . . . . . . . . . . . . 68

    6.9 Strategy handling routines . . . . . . . . . . . . . . . . . . . . . . . . 68

    6.9.1 SCOTCH stratInit . . . . . . . . . . . . . . . . . . . . . . . . 68

    6.9.2 SCOTCH stratExit . . . . . . . . . . . . . . . . . . . . . . . . 69

    6.9.3 SCOTCH stratSave . . . . . . . . . . . . . . . . . . . . . . . . 69

    6.9.4 SCOTCH stratDgraphMap . . . . . . . . . . . . . . . . . . . . . 69

    6.9.5 SCOTCH stratDgraphMapBuild . . . . . . . . . . . . . . . . . 70

    6.9.6 SCOTCH stratDgraphOrder . . . . . . . . . . . . . . . . . . . 71

    6.9.7 SCOTCH stratDgraphOrderBuild . . . . . . . . . . . . . . . . 71

    6.10 Other data structure routines . . . . . . . . . . . . . . . . . . . . . . 72

    3

  • 6.10.1 SCOTCH dmapAlloc . . . . . . . . . . . . . . . . . . . . . . . . 72

    6.10.2 SCOTCH dorderAlloc . . . . . . . . . . . . . . . . . . . . . . . 72

    6.11 Error handling routines . . . . . . . . . . . . . . . . . . . . . . . . . 72

    6.11.1 SCOTCH errorPrint . . . . . . . . . . . . . . . . . . . . . . . 73

    6.11.2 SCOTCH errorPrintW . . . . . . . . . . . . . . . . . . . . . . . 73

    6.11.3 SCOTCH errorProg . . . . . . . . . . . . . . . . . . . . . . . . 73

    6.12 Miscellaneous routines . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    6.12.1 SCOTCH randomReset . . . . . . . . . . . . . . . . . . . . . . . 74

    6.13 ParMeTiS compatibility library . . . . . . . . . . . . . . . . . . . . 74

    6.13.1 ParMETIS V3 NodeND . . . . . . . . . . . . . . . . . . . . . . . 74

    6.13.2 ParMETIS V3 PartGeomKway . . . . . . . . . . . . . . . . . . . 75

    6.13.3 ParMETIS V3 PartKway . . . . . . . . . . . . . . . . . . . . . 76

    7 Installation 77

    7.1 Thread issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    7.2 File compression issues . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    7.3 Machine word size issues . . . . . . . . . . . . . . . . . . . . . . . . . 78

    8 Examples 79

    1 Introduction

    1.1 Static mapping

    The efficient execution of a parallel program on a parallel machine requires that

    the communicating processes of the program be assigned to the processors of the

    machine so as to minimize its overall running time. When processes have a limited

    duration and their logical dependencies are accounted for, this optimization problem

    is referred to as scheduling. When processes are assumed to coexist simultaneously

    for the entire duration of the program, it is referred to as mapping. It amounts to

    balancing the computational weight of the processes among the processors of the

    machine, while reducing the cost of communication by keeping intensively inter-

    communicating processes on nearby processors.

    In most cases, the underlying computational structure of the parallel programs

    to map can be conveniently modeled as a graph in which vertices correspond to

    processes that handle distributed pieces of data, and edges reflect data dependencies.

    The mapping problem can then be addressed by assigning processor labels to the

    vertices of the graph, so that all processes assigned to some processor are loaded

    and run on it. In a SPMD context, this is equivalent to the distribution across

    processors of the data structures of parallel programs; in this case, all pieces of data

    assigned to some processor are handled by a single process located on this processor.

    A mapping is called static if it is computed prior to the execution of the program.

    Static mapping is NP-complete in the general case [10]. Therefore, many studies

    have been carried out in order to find sub-optimal solutions in reasonable time,

    including the development of specific algorithms for common topologies such as the

    hypercube [8, 16]. When the target machine is assumed to have a communication

    network in the shape of a complete graph, the static mapping problem turns into the

    partitioning problem, which has also been intensely studied [3, 17, 25, 26, 40]. How-

    ever, when mapping onto parallel machines the communication network of which is

    not a bus, not accounting for the topology of the target machine usually leads to

    worse running times, because simple cut minimization can induce more expensive

    4

  • long-distance communication [16, 43]; the static mapping problem is gaining pop-

    ularity as most of the newer massively parallel machines have a strongly NUMA

    architecture

    1.2 Sparse matrix ordering

    Many scientific and engineering problems can be modeled by sparse linear systems,

    which are solved either by iterative or direct methods. To achieve efficiency with di-

    rect methods, one must minimize the fill-in induced by factorization. This fill-in is a

    direct consequence of the order in which the unknowns of the linear system are num-

    bered, and its effects are critical both in terms of memory and of computation costs.

    Because there always exist large problem graphs which cannot fit in the memory

    of sequential computers and cost too much to partition, it is necessary to resort to

    parallel graph ordering tools. PT-Scotch provides such features.

    1.3 Contents of this document

    This document describes the capabilities and operations of PT-Scotch, a software

    package devoted to parallel static mapping and sparse matrix block ordering. It is

    the parallel extension of Scotch, a sequential software package devoted to static

    mapping, graph and mesh partitioning, and sparse matrix block ordering. While

    both packages share a significant amount of code, because PT-Scotch transfers

    control to the sequential routines of the libScotch library when the subgraphs on

    which it operates are located on a single processor, the two sets of routines have

    a distinct user’s manual. Readers interested in the sequential features of Scotch

    should refer to the Scotch User’s Guide [35].

    The rest of this manual is organized as follows. Section 2 presents the goals

    of the Scotch project, and section 3 outlines the most important aspects of the

    parallel partitioning and ordering algorithms that it implements. Section 4 defines

    the formats of the files used in PT-Scotch, section 5 describes the programs of

    the PT-Scotch distribution, and section 6 defines the interface and operations of

    the parallel routines of the libScotch library. Section 7 explains how to obtain

    and install the Scotch distribution. Finally, some practical examples are given in

    section 8.

    2 The Scotch project

    2.1 Description

    Scotch is a project carried out at the Laboratoire Bordelais de Recherche en In-

    formatique (LaBRI) of the Université Bordeaux I, and now within the ScALApplix

    project of INRIA Bordeaux Sud-Ouest. Its goal is to study the applications of graph

    theory to scientific computing, using a “divide and conquer” approach.

    It focused first on static mapping, and has resulted in the development of the

    Dual Recursive Bipartitioning (or DRB) mapping algorithm and in the study of

    several graph bipartitioning heuristics [33], all of which have been implemented in

    the Scotch software package [37]. Then, it focused on the computation of high-

    quality vertex separators for the ordering of sparse matrices by nested dissection, by

    extending the work that has been done on graph partitioning in the context of static

    mapping [38, 39]. More recently, the ordering capabilities of Scotch have been

    5

  • extended to native mesh structures, thanks to hypergraph partitioning algorithms.

    New graph partitioning methods have also been recently added [6, 34]. Version

    5.0 of Scotch was the first one to comprise parallel graph ordering routines [7],

    and version 5.1 now offers parallel graph partitioning features, while parallel static

    mapping will be available in the next release.

    2.2 Availability

    Starting from version 4.0, which has been developed at INRIA within the ScAlAp-

    plix project, Scotch is available under a dual licensing basis. On the one hand, it

    is downloadable from the Scotch web page as free/libre software, to all interested

    parties willing to use it as a library or to contribute to it as a testbed for new

    partitioning and ordering methods. On the other hand, it can also be distributed,

    under other types of licenses and conditions, to parties willing to embed it tightly

    into closed, proprietary software.

    The free/libre software license under which Scotch 5.1 is distributed is

    the CeCILL-C license [4], which has basically the same features as the GNU

    LGPL (“Lesser General Public License”) [29]: ability to link the code as a

    library to any free/libre or even proprietary software, ability to modify the

    code and to redistribute these modifications. Version 4.0 of Scotch was dis-

    tributed under the LGPL itself. This version did not comprise any parallel features.

    Please refer to section 7 to see how to obtain the free/libre distribution of

    Scotch.

    3 Algorithms

    3.1 Parallel static mapping by Dual Recursive Bipartitioning

    For a detailed description of the sequential implementation of this mapping algo-

    rithm and an extensive analysis of its performance, please refer to [33, 36]. In the

    next sections, we will only outline the most important aspects of the algorithm.

    3.1.1 Static mapping

    The parallel program to be mapped onto the target architecture is modeled by a val-

    uated unoriented graph S called source graph or process graph, the vertices of which

    represent the processes of the parallel program, and the edges of which the commu-

    nication channels between communicating processes. Vertex- and edge- valuations

    associate with every vertex vS and every edge eS of S integer numbers wS(vS) and

    wS(eS) which estimate the computation weight of the corresponding process and

    the amount of communication to be transmitted on the channel, respectively.

    The target machine onto which is mapped the parallel program is also modeled

    by a valuated unoriented graph T called target graph or architecture graph. Vertices

    vT and edges eT of T are assigned integer weights wT (vT ) and wT (eT ), which

    estimate the computational power of the corresponding processor and the cost of

    traversal of the inter-processor link, respectively.

    A mapping from S to T consists of two applications τS,T : V (S) −→ V (T ) and

    ρS,T : E(S) −→ P(E(T )), where P(E(T )) denotes the set of all simple loopless

    paths which can be built from E(T ). τS,T (vS) = vT if process vS of S is mapped

    6

  • onto processor vT of T , and ρS,T (eS) = {e1T , e

    2T , . . . , e

    nT } if communication channel

    eS of S is routed through communication links e1T , e

    2T , . . . , e

    nT of T . |ρS,T (eS)|

    denotes the dilation of edge eS , that is, the number of edges of E(T ) used to route

    eS .

    3.1.2 Cost function and performance criteria

    The computation of efficient static mappings requires an a priori knowledge of the

    dynamic behavior of the target machine with respect to the programs which are

    run on it. This knowledge is synthesized in a cost function, the nature of which

    determines the characteristics of the desired optimal mappings. The goal of our

    mapping algorithm is to minimize some communication cost function, while keeping

    the load balance within a specified tolerance. The communication cost function fCthat we have chosen is the sum, for all edges, of their dilation multiplied by their

    weight:

    fC(τS,T , ρS,T )def

    =∑

    eS∈E(S)

    wS(eS) |ρS,T (eS)| .

    This function, which has already been considered by several authors for hypercube

    target topologies [8, 16, 20], has several interesting properties: it is easy to compute,

    allows incremental updates performed by iterative algorithms, and its minimization

    favors the mapping of intensively intercommunicating processes onto nearby pro-

    cessors; regardless of the type of routage implemented on the target machine (store-

    and-forward or cut-through), it models the traffic on the interconnection network

    and thus the risk of congestion.

    The strong positive correlation between values of this function and effective

    execution times has been experimentally verified by Hammond [16] on the CM-2,

    and by Hendrickson and Leland [21] on the nCUBE 2.

    The quality of mappings is evaluated with respect to the criteria for quality that

    we have chosen: the balance of the computation load across processors, and the

    minimization of the interprocessor communication cost modeled by function fC .

    These criteria lead to the definition of several parameters, which are described

    below.

    For load balance, one can define µmap, the average load per computational

    power unit (which does not depend on the mapping), and δmap, the load imbalance

    ratio, as

    µmapdef

    =

    vS∈V (S)

    wS(vS)

    vT ∈V (T )

    wT (vT )and

    δmapdef

    =

    vT ∈V (T )

    1wT (vT )

    vS ∈ V (S)τS,T (vS) = vT

    wS(vS)

    − µmap

    vS∈V (S)

    wS(vS).

    However, since the maximum load imbalance ratio is provided by the user in input

    of the mapping, the information given by these parameters is of little interest, since

    what matters is the minimization of the communication cost function under this

    load balance constraint.

    For communication, the straightforward parameter to consider is fC . It can be

    normalized as µexp, the average edge expansion, which can be compared to µdil,

    7

  • the average edge dilation; these are defined as

    µexpdef

    =fC

    eS∈E(S)

    wS(eS)and µdil

    def

    =

    eS∈E(S)

    |ρS,T (eS)|

    |E(S)|.

    δexpdef

    =µexpµdil

    is smaller than 1 when the mapper succeeds in putting heavily inter-

    communicating processes closer to each other than it does for lightly communicating

    processes; they are equal if all edges have same weight.

    3.1.3 The Dual Recursive Bipartitioning algorithm

    Our mapping algorithm uses a divide and conquer approach to recursively allocate

    subsets of processes to subsets of processors [33].

    It starts by considering a set of processors, also called domain, containing all

    the processors of the target machine, and with which is associated the set of all

    the processes to map. At each step, the algorithm bipartitions a yet unprocessed

    domain into two disjoint subdomains, and calls a graph bipartitioning algorithm to

    split the subset of processes associated with the domain across the two subdomains,

    as sketched in the following.

    mapping (D, P)Set_Of_Processors D;Set_Of_Processes P;{

    Set_Of_Processors D0, D1;Set_Of_Processes P0, P1;

    if (|P| == 0) return; /* If nothing to do. */if (|D| == 1) { /* If one processor in D */result (D, P); /* P is mapped onto it. */return;

    }

    (D0, D1) = processor_bipartition (D);(P0, P1) = process_bipartition (P, D0, D1);mapping (D0, P0); /* Perform recursion. */mapping (D1, P1);

    }

    The association of a subdomain with every process defines a partial mapping of

    the process graph. As bipartitionings are performed, the subdomain sizes decrease,

    up to give a complete mapping when all subdomains are of size one.

    The above algorithm lies on the ability to define five main objects:

    • a domain structure, which represents a set of processors in the target archi-

    tecture;

    • a domain bipartitioning function, which, given a domain, bipartitions it in

    two disjoint subdomains;

    • a domain distance function, which gives, in the target graph, a measure of the

    distance between two disjoint domains. Since domains may not be convex nor

    connected, this distance may be estimated. However, it must respect certain

    homogeneity properties, such as giving more accurate results as domain sizes

    decrease. The domain distance function is used by the graph bipartitioning

    algorithms to compute the communication function to minimize, since it allows

    the mapper to estimate the dilation of the edges that link vertices which belong

    to different domains. Using such a distance function amounts to considering

    8

  • that all routings will use shortest paths on the target architecture, which

    is how most parallel machines actually do. We have thus chosen that our

    program would not provide routings for the communication channels, leaving

    their handling to the communication system of the target machine;

    • a process subgraph structure, which represents the subgraph induced by a

    subset of the vertex set of the original source graph;

    • a process subgraph bipartitioning function, which bipartitions subgraphs in

    two disjoint pieces to be mapped onto the two subdomains computed by the

    domain bipartitioning function.

    All these routines are seen as black boxes by the mapping program, which can thus

    accept any kind of target architecture and process bipartitioning functions.

    3.1.4 Partial cost function

    The production of efficient complete mappings requires that all graph bipartition-

    ings favor the criteria that we have chosen. Therefore, the bipartitioning of a

    subgraph S′ of S should maintain load balance within the user-specified tolerance,

    and minimize the partial communication cost function f ′C , defined as

    f ′C(τS,T , ρS,T )def

    =∑

    v ∈ V (S′)

    {v, v′} ∈ E(S)

    wS({v, v′}) |ρS,T ({v, v

    ′})| ,

    which accounts for the dilation of edges internal to subgraph S′ as well as for the

    one of edges which belong to the cocycle of S′, as shown in Figure 1. Taking into

    account the partial mapping results issued by previous bipartitionings makes it pos-

    sible to avoid local choices that might prove globally bad, as explained below. This

    amounts to incorporating additional constraints to the standard graph bipartition-

    ing problem, turning it into a more general optimization problem termed as skewed

    graph partitioning by some authors [23].

    D0 D1

    D

    a. Initial position.

    D0 D1

    D

    b. After one vertex is moved.

    Figure 1: Edges accounted for in the partial communication cost function when

    bipartitioning the subgraph associated with domain D between the two subdomains

    D0 and D1 of D. Dotted edges are of dilation zero, their two ends being mapped

    onto the same subdomain. Thin edges are cocycle edges.

    9

  • 3.1.5 Parallel graph bipartitioning methods

    The core of our parallel recursive mapping algorithm uses process graph parallel

    bipartitioning methods as black boxes. It allows the mapper to run any type of

    graph bipartitioning method compatible with our criteria for quality. Bipartitioning

    jobs maintain an internal image of the current bipartition, indicating for every vertex

    of the job whether it is currently assigned to the first or to the second subdomain. It

    is therefore possible to apply several different methods in sequence, each one starting

    from the result of the previous one, and to select the methods with respect to the

    job characteristics, thus enabling us to define mapping strategies. The currently

    implemented graph bipartitioning methods are listed below.

    Band

    Like the multi-level method which will be described below, the band method

    is a meta-algorithm, in the sense that it does not itself compute partitions, but

    rather helps other partitioning algorithms perform better. It is a refinement

    algorithm which, from a given initial partition, extracts a band graph of given

    width (which only contains graph vertices that are at most at this distance

    from the separator), calls a partitioning strategy on this band graph, and

    prolongs1 back the refined partition on the original graph. This method was

    designed to be able to use expensive partitioning heuristics, such as genetic

    algorithms, on large graphs, as it dramatically reduces the problem space by

    several orders of magnitude. However, it was found that, in a multi-level

    context, it also improves partition quality, by coercing partitions in a problem

    space that derives from the one which was globally defined at the coarsest

    level, thus preventing local optimization refinement algorithms to be trapped

    in local optima of the finer graphs [6].

    Diffusion

    This global optimization method, the sequential formulation of which is pre-

    sented in [34], flows two kinds of antagonistic liquids, scotch and anti-scotch,

    from two source vertices, and sets the new frontier as the limit between ver-

    tices which contain scotch and the ones which contain anti-scotch. In order to

    add load-balancing constraints to the algorithm, a constant amount of liquid

    disappears from every vertex per unit of time, so that no domain can spread

    across more than half of the vertices. Because selecting the source vertices is

    essential to the obtainment of useful results, this method has been hard-coded

    so that the two source vertices are the two vertices of highest indices, since

    in the band method these are the anchor vertices which represent all of the

    removed vertices of each part. Therefore, this method must be used on band

    graphs only, or on specifically crafted graphs.

    Multi-level

    This algorithm, which has been studied by several authors [3, 18, 25] and

    should be considered as a strategy rather than as a method since it uses other

    methods as parameters, repeatedly reduces the size of the graph to bipartition

    by finding matchings that collapse vertices and edges, computes a partition for

    the coarsest graph obtained, and prolongs the result back to the original graph,

    as shown in Figure 2. The multi-level method, when used in conjunction with

    the banded diffusion method to refine the prolonged partitions at every level,

    1While a projection is an application to a space of lower dimension, a prolongation refers to

    an application to a space of higher dimension. Yet, the term projection is also commonly used to

    refer to such a propagation, most often in the context of a multilevel framework.

    10

  • Coarseningphase

    Uncoarseningphase

    Initial partitioning

    Prolonged partition

    Refined partition

    Figure 2: The multi-level partitioning process. In the uncoarsening phase, the light

    and bold lines represent for each level the prolonged partition obtained from the

    coarser graph, and the partition obtained after refinement, respectively.

    usually stabilizes quality irrespective of the number of processors which run

    the parallel static mapper.

    3.1.6 Mapping onto variable-sized architectures

    Several constrained graph partitioning problems can be modeled as mapping the

    problem graph onto a target architecture, the number of vertices and topology of

    which depend dynamically on the structure of the subgraphs to bipartition at each

    step.

    Variable-sized architectures are supported by the DRB algorithm in the follow-

    ing way: at the end of each bipartitioning step, if any of the variable subdomains

    is empty (that is, all vertices of the subgraph are mapped only to one of the sub-

    domains), then the DRB process stops for both subdomains, and all of the vertices

    are assigned to their parent subdomain; else, if a variable subdomain has only one

    vertex mapped onto it, the DRB process stops for this subdomain, and the vertex

    is assigned to it.

    The moment when to stop the DRB process for a specific subgraph can be con-

    trolled by defining a bipartitioning strategy that tests for the validity of a criterion

    at each bipartitioning step, and maps all of the subgraph vertices to one of the

    subdomains when it becomes false.

    3.2 Parallel sparse matrix ordering by hybrid incomplete

    nested dissection

    When solving large sparse linear systems of the form Ax = b, it is common to

    precede the numerical factorization by a symmetric reordering. This reordering is

    chosen in such a way that pivoting down the diagonal in order on the resulting

    permuted matrix PAPT produces much less fill-in and work than computing the

    factors of A by pivoting down the diagonal in the original order (the fill-in is the

    set of zero entries in A that become non-zero in the factored matrix).

    3.2.1 Hybrid incomplete nested dissection

    The minimum degree and nested dissection algorithms are the two most popular

    reordering schemes used to reduce fill-in and operation count when factoring and

    11

  • solving sparse matrices.

    The minimum degree algorithm [42] is a local heuristic that performs its pivot

    selection by iteratively selecting from the graph a node of minimum degree. It is

    known to be a very fast and general purpose algorithm, and has received much

    attention over the last three decades (see for example [1, 13, 31]). However, the

    algorithm is intrinsically sequential, and very little can be theoretically proved

    about its efficiency.

    The nested dissection algorithm [14] is a global, recursive heuristic algorithm

    which computes a vertex set S that separates the graph into two parts A and B, or-

    dering S with the highest remaining indices. It then proceeds recursively on parts A

    and B until their sizes become smaller than some threshold value. This ordering

    guarantees that, at each step, no non zero term can appear in the factorization

    process between unknowns of A and unknowns of B.

    Many theoretical results have been obtained on nested dissection order-

    ing [5, 30], and its divide and conquer nature makes it easily parallelizable.

    The main issue of the nested dissection ordering algorithm is thus to find small

    vertex separators that balance the remaining subgraphs as evenly as possible.

    Provided that good vertex separators are found, the nested dissection algorithm

    produces orderings which, both in terms of fill-in and operation count, compare

    favorably [15, 25, 38] to the ones obtained with the minimum degree algorithm [31].

    Moreover, the elimination trees induced by nested dissection are broader, shorter,

    and better balanced, and therefore exhibit much more concurrency in the con-

    text of parallel Cholesky factorization [2, 11, 12, 15, 38, 41, and included references].

    Due to their complementary nature, several schemes have been proposed to

    hybridize the two methods [24, 27, 38]. Our implementation is based on a tight

    coupling of the nested dissection and minimum degree algorithms, that allows each

    of them to take advantage of the information computed by the other [39].

    However, because we do not provide a parallel implementation of the minimum

    degree algorithm, this hybridization scheme can only take place after enough steps

    of parallel nested dissection have been performed, such that the subgraphs to be

    ordered by minimum degree are centralized on individual processors.

    3.2.2 Parallel ordering

    The parallel computation of orderings in PT-Scotch involves three different levels

    of concurrency, corresponding to three key steps of the nested dissection process:

    the nested dissection algorithm itself, the multi-level coarsening algorithm used to

    compute separators at each step of the nested dissection process, and the refinement

    of the obtained separators. Each of these steps is described below.

    Nested dissection As said above, the first level of concurrency relates to the

    parallelization of the nested dissection method itself, which is straightforward thanks

    to the intrinsically concurrent nature of the algorithm. Starting from the initial

    graph, arbitrarily distributed across p processors but preferably balanced in terms

    of vertices, the algorithm proceeds as illustrated in Figure 3 : once a separator

    has been computed in parallel, by means of a method described below, each of

    the p processors participates in the building of the distributed induced subgraph

    corresponding to the first separated part (even if some processors do not have any

    12

  • P1

    P2P0 P3

    P0 P3P1

    P2

    P0 P3

    P1

    P2 P2

    P1P0

    P3

    Figure 3: Diagram of a nested dissection step for a (sub-)graph distributed across

    four processors. Once the separator is known, the two induced subgraphs are built

    and folded (this can be done in parallel for both subgraphs), yielding two subgraphs,

    each of them distributed across two processors.

    vertex of it). This induced subgraph is then folded onto the first ⌈p2⌉ processors, such

    that the average number of vertices per processor, which guarantees efficiency as it

    allows the shadowing of communications by a subsequent amount of computation,

    remains constant. During the folding process, vertices and adjacency lists owned

    by the ⌊p2⌋ sender processors are redistributed to the ⌈p

    2⌉ receiver processors so as

    to evenly balance their loads.

    The same procedure is used to build, on the ⌊p2⌋ remaining processors, the

    folded induced subgraph corresponding to the second part. These two constructions

    being completely independent, the computations of the two induced subgraphs and

    their folding can be performed in parallel, thanks to the temporary creation of an

    extra thread per processor. When the vertices of the separated graph are evenly

    distributed across the processors, this feature favors load balancing in the subgraph

    building phase, because processors which do not have many vertices of one part

    will have the rest of their vertices in the other part, thus yielding the same overall

    workload to create both graphs in the same time. This feature can be disabled

    when the communication system of the target machine is not thread-safe.

    At the end of the folding process, every processor has a folded subgraph fragment

    of one of the two folded subgraphs, and the nested dissection process car recursively

    proceed independently on each subgroup of p2 (thenp4 ,

    p8 , etc.) processors, until

    each subgroup is reduced to a single processor. From then on, the nested dissection

    process will go on sequentially on every processor, using the nested dissection rou-

    tines of the Scotch library, eventually ending in a coupling with minimum degree

    methods [39], as described in the previous section.

    Graph coarsening The second level of concurrency concerns the computation

    of separators. The approach we have chosen is the now classical multi-level one [3,

    22, 27]. It consists in repeatedly computing a set of increasingly coarser albeit

    topologically similar versions of the graph to separate, by finding matchings which

    collapse vertices and edges, until the coarsest graph obtained is no larger than a

    few hundreds of vertices, then computing a separator on this coarsest graph, and

    prolonging back this separator, from coarser to finer graphs, up to the original graph.

    Most often, a local optimization algorithm, such as Kernighan-Lin [28] or Fiduccia-

    Mattheyses [9] (FM), is used in the uncoarsening phase to refine the partition that

    is prolonged back at every level, such that the granularity of the solution is the one

    of the original graph and not the one of the coarsest graph.

    The main features of our implementation are outlined in Figure 4. Once the

    13

  • matching phase is complete, the coarsened subgraph building phase takes place.

    It can be parametrized so as to allow one to choose between two options. Either

    all coarsened vertices are kept on their local processors (that is, processors that

    hold at least one of the ends of the coarsened edges), as shown in the first steps

    of Figure 4, which decreases the number of vertices owned by every processor and

    speeds-up future computations, or else coarsened graphs are folded and duplicated,

    as shown in the next steps of Figure 4, which increases the number of working copies

    of the graph and can thus reduce communication and increase the final quality of

    the separators.

    As a matter of fact, separator computation algorithms, which are local heuristics,

    heavily depend on the quality of the coarsened graphs, and we have observed with

    the sequential version of Scotch that taking every time the best partition among

    two ones, obtained from two fully independent multi-level runs, usually improved

    overall ordering quality. By enabling the folding-with-duplication routine (which

    will be referred to as “fold-dup” in the following) in the first coarsening levels, one

    can implement this approach in parallel, every subgroup of processors that hold a

    working copy of the graph being able to perform an almost-complete independent

    multi-level computation, save for the very first level which is shared by all subgroups,

    for the second one which is shared by half of the subgroups, and so on.

    The problem with the fold-dup approach is that it consumes a lot of memory.

    Consequently, a good strategy can be to resort to folding only when the number

    of vertices of the graph to be considered reaches some minimum threshold. This

    threshold allows one to set a trade off between the level of completeness of the

    independent multi-level runs which result from the early stages of the fold-dup

    process, which impact partitioning quality, and the amount of memory to be used

    in the process.

    Once all working copies of the coarsened graphs are folded on individual pro-

    cessors, the algorithm enters a multi-sequential phase, illustrated at the bottom of

    Figure 4: the routines of the sequential Scotch library are used on every processor

    to complete the coarsening process, compute an initial partition, and prolong it

    back up to the largest centralized coarsened graph stored on the processor. Then,

    the partitions are prolonged back in parallel to the finer distributed graphs, select-

    ing the best partition between the two available when prolonging to a level where

    fold-dup had been performed. This distributed prolongation process is repeated

    until we obtain a partition of the original graph.

    Band refinement The third level of concurrency concerns the refinement heuris-

    tics which are used to improve the prolonged separators. At the coarsest levels of

    the multi-level algorithm, when computations are restricted to individual proces-

    sors, the sequential FM algorithm of Scotch is used, but this class of algorithms

    does not parallelize well.

    This problem can be solved in two ways: either by developing scalable and

    efficient local optimization algorithms, or by being able to use the existing sequential

    FM algorithm on very large graphs. In [6] has been proposed a solution which

    enables both approaches, and is based on the following reasoning. Since every

    refinement is performed by means of a local algorithm, which perturbs only in a

    limited way the position of the prolonged separator, local refinement algorithms

    need only to be passed a subgraph that contains the vertices that are very close to

    the prolonged separator.

    The computation and use of distributed band graphs is outlined in Figure 5.

    Given a distributed graph and an initial separator, which can be spread across

    14

  • P3P2

    P1P0

    P1

    P2P0 P3

    P0

    P1

    P2

    P3

    P1

    P2P0 P3

    P1

    P2P0 P3

    Figure 4: Diagram of the parallel computation of the separator of a graph dis-

    tributed across four processors, by parallel coarsening with folding-with-duplication

    in the last stages, multi-sequential computation of initial partitions that are locally

    prolonged back and refined on every processor, and then parallel uncoarsening of

    the best partition encountered.

    P0 P3

    Figure 5: Creation of a distributed band graph. Only vertices closest to the sep-

    arator are kept. Other vertices are replaced by anchor vertices of equivalent total

    weight, linked to band vertices of the last layer. There are two anchor vertices per

    processor, to reduce communication. Once the separator has been refined on the

    band graph using some local optimization algorithm, the new separator is prolonged

    back to the original distributed graph.

    several processors, vertices that are closer to separator vertices than some small

    user-defined distance are selected by spreading distance information from all of

    the separator vertices, using our halo exchange routine. Then, the distributed

    band graph is created, by adding on every processor two anchor vertices, which are

    connected to the last layers of vertices of each of the parts. The vertex weight of

    the anchor vertices is equal to the sum of the vertex weights of all of the vertices

    they replace, to preserve the balance of the two band parts. Once the separator of

    the band graph has been refined using some local optimization algorithm, the new

    separator is prolonged back to the original distributed graph.

    Basing on these band graphs, we have implemented a multi-sequential refine-

    ment algorithm, outlined in Figure 6. At every distributed uncoarsening step, a

    distributed band graph is created. Centralized copies of this band graph are then

    gathered on every participating processor, which serve to run fully independent in-

    stances of our sequential FM algorithm. The perturbation of the initial state of the

    sequential FM algorithm on every processor allows us to explore slightly different

    solution spaces, and thus to improve refinement quality. Finally, the best refined

    band separator is prolonged back to the distributed graph, and the uncoarsening

    process goes on.

    15

  • P0 P3

    Figure 6: Diagram of the multi-sequential refinement of a separator prolonged back

    from a coarser graph distributed across four processors to its finer distributed graph.

    Once the distributed band graph is built from the finer graph, a centralized version

    of it is gathered on every participating processor. A sequential FM optimization

    can then be run independently on every copy, and the best improved separator is

    then distributed back to the finer graph.

    3.2.3 Performance criteria

    The quality of orderings is evaluated with respect to several criteria. The first

    one, NNZ, is the number of non-zero terms in the factored reordered matrix. The

    second one, OPC, is the operation count, that is the number of arithmetic operations

    required to factor the matrix. The operation count that we have considered takes

    into consideration all operations (additions, subtractions, multiplications, divisions)

    required by Cholesky factorization, except square roots; it is equal to∑

    c n2c , where

    nc is the number of non-zeros of column c of the factored matrix, diagonal included.

    A third criterion for quality is the shape of the elimination tree; concurrency

    in parallel solving is all the higher as the elimination tree is broad and short. To

    measure its quality, several parameters can be defined: hmin, hmax, and havg denote

    the minimum, maximum, and average heights of the tree2, respectively, and hdltis the variance, expressed as a percentage of havg. Since small separators result in

    small chains in the elimination tree, havg should also indirectly reflect the quality

    of separators.

    3.3 Changes from version 5.0

    PT-Scotch now provides routines to compute in parallel partitions of distributed

    graphs.

    A new integer index type has been created in the Fortran interface, to address

    array indices larger than the maximum value which can be stored in a regular

    integer. Please refer to Section 7.3 for more information.

    A new set of routines has been designed, to ease the use of the libScotch as

    a dynamic library. The SCOTCH version routine returns the version, release and

    patchlevel numbers of the library being used. The SCOTCH *Alloc routines, which

    are only available in the C interface at the time being, dynamically allocate storage

    space for the opaque API Scotch structures, which frees application programs from

    the need to be systematically recompiled because of possible changes of Scotch

    2We do not consider as leaves the disconnected vertices that are present in some meshes, since

    they do not participate in the solving process.

    16

  • structure sizes.

    4 Files and data structures

    For the sake of portability and readability, all the data files shared by the differ-

    ent programs of the Scotch project are coded in plain ASCII text exclusively.

    Although we may speak of “lines” when describing file formats, text-formatting

    characters such as newlines or tabulations are not mandatory, and are not taken

    into account when files are read. They are only used to provide better readabil-

    ity and understanding. Whenever numbers are used to label objects, and unless

    explicitely stated, numberings always start from zero, not one.

    4.1 Distributed graph files

    Because even very large graphs are most often stored in the form of centralized

    files, the distributed graph loading routine of the PT-Scotch package, as well as

    all parallel programs which handle distributed graphs, are able to read centralized

    graph files in the Scotch format and to scatter them on the fly across the

    available processors (the format of centralized Scotch graph files is described

    in the Scotch User’s Guide [35]). However, in order to reduce loading time, a

    distributed graph format has been designed, so that the different file fragments

    which comprise distributed graph files can be read in parallel and be stored on

    local disks on the nodes of a parallel or grid cluster.

    Distributed graph files, which usually end in “.dgr”, describe fragments of val-

    uated graphs, which can be valuated process graphs to be mapped onto target

    architectures, or graphs representing the adjacency structures of matrices to order.

    In Scotch, graphs are represented by means of adjacency lists: the definition

    of each vertex is accompanied by the list of all of its neighbors, i.e. all of its

    adjacent arcs. Therefore, the overall number of edge data is twice the number of

    edges. Distributed graphs are stored as a set of files which contain each a subset

    of graph vertices and their adjacencies. The purpose of this format is to speed-up

    the loading and saving of large graphs when working for some time with the same

    number of processors: the distributed graph loading routine will allow each of the

    processors to read in parallel from a different file. Consequently, the number of

    files must be equal to the number of processors involved in the parallel loading phase.

    The first line of a distributed graph file holds the distributed graph file version

    number, which is currently 2. The second line holds the number of files across

    which the graph data is distributed (referred to as procglbnbr in libScotch; see

    for instance Figure 8, page 32, for a detailed example), followed by the number of

    this file in the sequence (ranging from 0 to (procglbnbr − 1), and analogous to

    proclocnum in Figure 8). The third line holds the global number of graph vertices

    (referred to as vertglbnbr), followed by the global number of arcs (inappropriately

    called edgeglbnbr, as it is in fact equal to twice the actual number of edges). The

    fourth line holds the number of vertices contained in this graph fragment (analogous

    to vertlocnbr), followed by its local number of arcs (analogous to edgelocnbr).

    The fifth line holds two figures: the graph base index value (baseval) and a numeric

    flag.

    The graph base index value records the value of the starting index used to

    describe the graph; it is usually 0 when the graph has been output by C programs,

    17

  • and 1 for Fortran programs. Its purpose is to ease the manipulation of graphs within

    each of these two environments, while providing compatibility between them.

    The numeric flag, similar to the one used by the Chaco graph format [19], is

    made of three decimal digits. A non-zero value in the units indicates that vertex

    weights are provided. A non-zero value in the tenths indicates that edge weights

    are provided. A non-zero value in the hundredths indicates that vertex labels are

    provided; if it is the case, vertices can be stored in any order in the file; else, natural

    order is assumed, starting from the starting global index of each fragment.

    This header data is then followed by as many lines as there are vertices in the

    graph fragment, that is, vertlocnbr lines. Each of these lines begins with the vertex

    label, if necessary, the vertex load, if necessary, and the vertex degree, followed by

    the description of the arcs. An arc is defined by the load of the edge, if necessary,

    and by the label of its other end vertex. The arcs of a given vertex can be provided

    in any order in its neighbor list. If vertex labels are provided, vertices can also be

    stored in any order in the file.

    Figure 7 shows the contents of two complementary distributed graph files mod-

    eling a cube with unity vertex and edge weights and base 0, distributed across two

    processors.

    2

    2 0

    8 24

    4 12

    0 000

    3 4 2 1

    3 5 3 0

    3 6 0 3

    3 7 1 2

    2

    2 1

    8 24

    4 12

    0 000

    3 0 6 5

    3 1 7 4

    3 2 4 7

    3 3 5 6

    Figure 7: Two complementary distributed graph files representing a cube dis-

    tributed across two processors.

    5 Programs

    5.1 Invocation

    All of the programs comprised in the Scotch and PT-Scotch distributions have

    been designed to run in command-line mode without any interactive prompting,

    so that they can be called easily from other programs by means of “system ()”

    or “popen ()” system calls, or be piped together on a single shell command line.

    In order to facilitate this, whenever a stream name is asked for (either on input

    or output), the user may put a single “-” to indicate standard input or output.

    Moreover, programs read their input in the same order as stream names are given

    in the command line. It allows them to read all their data from a single stream

    (usually the standard input), provided that these data are ordered properly.

    A brief on-line help is provided with all the programs. To get this help, use the

    “-h” option after the program name. The case of option letters is not significant,

    except when both the lower and upper cases of a letter have different meanings.

    When passing parameters to the programs, only the order of file names is significant;

    options can be put anywhere in the command line, in any order. Examples of use

    of the different programs of the PT-Scotch project are provided in section 8.

    18

  • Error messages are standardized, but may not be fully explanatory. However,

    most of the errors you may run into should be related to file formats, and located in

    “...Load” routines. In this case, compare your data formats with the definitions

    given in section 4, and use the dgtst program of the PT-Scotch distribution to

    check the consistency of your distributed source graphs.

    According to your MPI environment, you may either run the programs directly,

    or else have to invoke them by means of a command such as mpirun. Check your

    local MPI documentation to see how to specify the number of processors on which

    to run them.

    5.2 File names

    5.2.1 Sequential and parallel file opening

    The programs of the PT-Scotch distribution can handle either the classical cen-

    tralized Scotch graph files, or the distributed PT-Scotch graph files described

    in section 4.1.

    In order to tell whether programs should read from, or write to, a single file

    located on only one processor, or to multiple instances of the same file on all of

    the processors, or else to distinct files on each of the processors, a special grammar

    has been designed, which is based on the “%” escape character. Four such escape

    sequences are defined, which are interpreted independently on every processor, prior

    to file opening. By default, when a filename is provided, it is assumed that the file

    is to be opened on only one of the processors, called the root processor, which is

    usually process 0 of the communicator within which the program is run. Using any

    of the first three escape sequences below will instruct programs to open in parallel

    a file of name equal to the interpreted filename, on every processor on which they

    are run.

    %p Replaced by the number of processes in the global communicator in which the

    program is run. Leads to parallel opening.

    %r Replaced on each process running the program by the rank of this process in

    the global communicator. Leads to parallel opening.

    %- Discarded, but leads to parallel opening. This sequence is mainly used to

    instruct programs to open on every processor a file of identical name. The

    opened files can be, according whether the given path leads to a shared direc-

    tory or to directories that are local to each processor, either to the opening

    of multiple instances of the same file, or to the opening of distinct files which

    may each have a different content, respectively (but in this latter case it is

    much recommended to identify files by means of the “%r” sequence).

    %% Replaced by a single “%” character. File names using this escape sequence are

    not considered for parallel opening, unless one or several of the three other

    escape sequences are also present.

    For instance, filename “brol” will lead to the opening of file “brol” on the root

    processor only, filename “%-brol” (or even “br%-ol”) will lead to the parallel open-

    ing of files called “brol” on every processor, and filename “brol%p-%r” will lead

    to the opening of files “brol2-0” and “brol2-1”, respectively, on each of the two

    processors on which which would run a program of the PT-Scotch distribution.

    19

  • 5.2.2 Using compressed files

    Starting from version 5.0.6, Scotch allows users to provide and retrieve data in

    compressed form. Since this feature requires that the compression and decompres-

    sion tasks run in the same time as data is read or written, it can only be done

    on systems which support multi-threading (Posix threads) or multi-processing (by

    means of fork system calls).

    To determine if a stream has to be handled in compressed form, Scotch checks

    its extension. If it is “.gz” (gzip format), “.bz2” (bzip2 format) or “.lzma” (lzma

    format), the stream is assumed to be compressed according to the corresponding

    format. A filter task will then be used to process it accordingly if the format is

    implemented in Scotch and enabled on your system.

    To date, data can be read and written in bzip2 and gzip formats, and can

    also be read in the lzma format. Since the compression ratio of lzma on Scotch

    graphs is 30% better than the one of gzip and bzip2 (which are almost equivalent

    in this case), the lzma format is a very good choice for handling very large graphs.

    To see how to enable compressed data handling in Scotch, please refer to Section 7.

    When the compressed format allows it, several files can be provided on

    the same stream, and be uncompressed on the fly. For instance, the

    command “cat brol.grf.gz brol.xyz.gz | gout -.gz -.gz -Mn - brol.iv”

    concatenates the topology and geometry data of some graph brol and feed them

    as a single compressed stream to the standard input of program gout, hence the

    ”-.gz” to indicate a compressed standard stream.

    5.3 Description

    5.3.1 dgmap / dgpart

    Synopsis

    dgmap [input graph file [input target file [output mapping file [output log

    file]]]] options

    dgpart number of parts [input graph file [output mapping file [output

    log file]]] options

    Description

    The dgmap program is the parallel static mapper. It uses a static mapping

    strategy to compute a mapping of the given source graph to the given target

    architecture. The implemented algorithms aim at assigning source graph ver-

    tices to target vertices such that every target vertex receives a set of source

    vertices of summed weight proportional to the relative weight of the target

    vertex in the target architecture, and such that the communication cost func-

    tion fC is minimized (see Section 3.1.2 for the definition and rationale of this

    cost function).

    Since its main purpose is to provide mappings that exhibit high concurrency

    for communication minimization in the mapped application, it comprises a

    parallel implementation of the dual recursive bipartitioning algorithm [33], as

    well as all of the sequential static mapping methods used by its sequential

    counterpart gmap, to be used on subgraphs located on single processors.

    20

  • dgpart is a simplified interface to dgmap, which performs graph partitioning

    instead of static mapping. Consequently, the desired number of parts has to

    be provided, in lieu of the target architecture.

    The -b and -c options allow the user to set preferences on the behavior of the

    mapping strategy which is used by default. The -m option allows the user to

    define a custom mapping strategy.

    The input graph file filename can refer either to a centralized or to a dis-

    tributed graph, according to the semantics defined in Section 5.2. The map-

    ping file must be a centralized file.

    Options

    Since the program is devoted to experimental studies, it has many optional

    parameters, used to test various execution modes. Values set by default will

    give best results in most cases.

    -brat

    Set the maximum load imbalance ratio to rat, which should be a value

    comprised between 0 and 1. This option can be used in conjunction with

    option -c, but is incompatible with option -m.

    -cflags

    Tune the default mapping strategy according to the given preference

    flags. Some of these flags are antagonistic, while others can be combined.

    See Section 6.4.1 for more information. The currently available flags are

    the following.

    b Enforce load balance as much as possible.

    q Privilege quality over speed. This is the default behavior.

    s Privilege speed over quality.

    t Use only safe methods in the strategy.

    x Favor scalability.

    This option can be used in conjunction with option -b, but is incompat-

    ible with option -m. The resulting strategy string can be displayed by

    means of the -vs option.

    -h Display the program synopsis.

    -mstrat

    Apply parallel static mapping strategy strat. The format of parallel

    mapping strategies is defined in section 6.4.2. This option is incompatible

    with options -b and -c.

    -rnum

    Set the number of the root process which will be used for centralized file

    accesses. Set to 0 by default.

    -sobj

    Mask source edge and vertex weights. This option allows the user to “un-

    weight” weighted source graphs by removing weights from edges and ver-

    tices at loading time. obj may contain several of the following switches.

    e Remove edge weights, if any.

    v Remove vertex weights, if any.

    -V Print the program version and copyright.

    21

  • -vverb

    Set verbose mode to verb, which may contain several of the following

    switches.

    a Memory allocation information.

    m Mapping information, similar to the one displayed by the gmtst

    program of the sequential Scotch distribution.

    s Strategy information. This parameter displays the default mapping

    strategy used by gmap.

    t Timing information.

    5.3.2 dgord

    Synopsis

    dgord [input graph file [output ordering file [output log file]]] options

    Description

    The dgord program is the parallel sparse matrix block orderer. It uses an

    ordering strategy to compute block orderings of sparse matrices represented

    as source graphs, whose vertex weights indicate the number of DOFs per node

    (if this number is non homogeneous) and whose edges are unweighted, in order

    to minimize fill-in and operation count.

    Since its main purpose is to provide orderings that exhibit high concur-

    rency for parallel block factorization, it comprises a parallel nested dissection

    method [14], but sequential classical [31] and state-of-the-art [39] minimum

    degree algorithms are implemented as well, to be used on subgraphs located

    on single processors.

    Ordering methods can be combined by means of selection, grouping, and

    condition operators, so as to define ordering strategies, which can be passed

    to the program by means of the -o option. The -c option allows the user

    to set preferences on the behavior of the ordering strategy which is used by

    default.

    The input graph file filename can refer either to a centralized or to a dis-

    tributed graph, according to the semantics defined in Section 5.2. The order-

    ing file must be a centralized file.

    Options

    Since the program is devoted to experimental studies, it has many optional

    parameters, used to test various execution modes. Values set by default will

    give best results in most cases.

    -cflags

    Tune the default ordering strategy according to the given preference flags.

    Some of these flags are antagonistic, while others can be combined. See

    Section 6.4.1 for more information. The resulting strategy string can be

    displayed by means of the -vs option.

    b Enforce load balance as much as possible.

    q Privilege quality over speed. This is the default behavior.

    s Privilege speed over quality.

    22

  • t Use only safe methods in the strategy.

    x Favor scalability.

    -h Display the program synopsis.

    -moutput mapping file

    Write to output mapping file the mapping of graph vertices to column

    blocks. All of the separators and leaves produced by the nested dissection

    method are considered as distinct column blocks, which may be in turn

    split by the ordering methods that are applied to them. Distinct integer

    numbers are associated with each of the column blocks, such that the

    number of a block is always greater than the ones of its predecessors

    in the elimination process, that is, its descendants in the elimination

    tree. The structure of mapping files is described in detail in the relevant

    section of the Scotch User’s Guide [35].

    When the geometry of the graph is available, this mapping file may be

    processed by program gout to display the vertex separators and super-

    variable amalgamations that have been computed.

    -ostrat

    Apply parallel ordering strategy strat. The format of parallel ordering

    strategies is defined in section 6.4.4.

    -rnum

    Set the number of the root process which will be used for centralized file

    accesses. Set to 0 by default.

    -toutput tree file

    Write to output tree file the structure of the separator tree. The data

    that is written resembles much the one of a mapping file: after a first

    line that contains the number of lines to follow, there are that many lines

    of mapping pairs, which associate an integer number with every graph

    vertex index. This integer number is the number of the column block

    which is the parent of the column block to which the vertex belongs,

    or −1 if the column block to which the vertex belongs is a root of the

    separator tree (there can be several roots, if the graph is disconnected).

    Combined to the column block mapping data produced by option -m, the

    tree structure allows one to rebuild the separator tree.

    -V Print the program version and copyright.

    -vverb

    Set verbose mode to verb, which may contain several of the following

    switches.

    a Memory allocation information.

    s Strategy information. This parameter displays the default parallel

    ordering strategy used by dgord.

    t Timing information.

    5.3.3 dgpart

    Synopsis

    dgpart [number of parts [input graph file [output mapping file [output log

    file]]]] options

    23

  • Description

    The dgpart program is the parallel graph partitioner. It is in fact a shortcut

    for the dgmap program, where the number of parts is turned into a complete

    graph with same number of vertices which is passed to the static mapping

    routine.

    Save for the number of parts parameter which replaces the input target file,

    the parameters of dgpart are identical to the ones of dgmap. Please refer

    to its manual page, in Section 5.3.1, for a description of all of the available

    options.

    5.3.4 dgscat

    Synopsis

    dgscat [input graph file [output graph file]] options

    Description

    The dgscat program creates a distributed source graph, in the Scotch dis-

    tributed graph format, from the given centralized source graph file.

    The input graph file filename should therefore refer to a centralized graph,

    while output graph file must refer to a distributed graph, according to the

    semantics defined in Section 5.2.

    Options

    -c Check the consistency of the distributed graph at the end of the graph

    loading phase.

    -h Display the program synopsis.

    -rnum

    Set the number of the root process which will be used for centralized file

    accesses. Set to 0 by default.

    -V Print the program version and copyright.

    5.3.5 dgtst

    Synopsis

    dgtst [input graph file [output data file]] options

    Description

    The program dgtst is the source graph tester. It checks the consistency of

    the input source graph structure (matching of arcs, number of vertices and

    edges, etc.), and gives some statistics regarding edge weights, vertex weights,

    and vertex degrees.

    It produces the same results as the gtst program of the Scotch sequential

    distribution.

    24

  • Options

    -h Display the program synopsis.

    -rnum

    Set the number of the root process which will be used for centralized file

    accesses. Set to 0 by default.

    -V Print the program version and copyright.

    6 Library

    All of the features provided by the programs of the PT-Scotch distribution may

    be directly accessed by calling the appropriate functions of the libScotch library,

    archived in files ptlibscotch.a and libptscotcherr.a. All of the existing parallel

    routines belong to four distinct classes:

    • distributed source graph handling routines, which serve to declare, build, load,

    save, and check the consistency of distributed source graphs;

    • strategy handling routines, which allow the user to declare and build parallel

    mapping and ordering strategies;

    • parallel graph partitioning and static mapping routines, which allow the user

    to declare, compute, and save distributed static mappings of distributed source

    graphs;

    • parallel ordering routines, which allow the user to declare, compute, and save

    distributed orderings of distributed source graphs.

    Error handling is performed using the existing sequential routines of the Scotch

    distribution, which are described in the Scotch User’s Guide [35]. Their use is

    recalled in Section 6.11.

    A ParMeTiS compatibility library, called libptscotchparmetis.a, is also

    available. It allows users who were previously using ParMeTiS in their software to

    take advantage of the efficieny of PT-Scotch without having to modify their code.

    The services provided by this library are described in Section 6.13.

    6.1 Running at proper thread level

    Since PT-Scotch is based on the MPI API, all processes must call some flavor

    of MPI Init before using any routine of the library that performs communication.

    The thread support level of MPI has to be set in accordance to the level required

    by the library.

    If PT-Scotch has been compiled without the -DSCOTCH PTHREAD flag, a call to

    the simple MPI Init routine will suffice, because no concurrent MPI calls will be

    performed by library routines. Else, the extended MPI Init thread initialization

    routine has to be used, to request the MPI THREAD MULTIPLE level, and the provided

    thread support level value returned by the routine must be checked carefully. If your

    MPI implementation does not provide the MPI THREAD MULTIPLE level, you will have

    to recompile PT-Scotch without the -DSCOTCH PTHREAD flag. Else, library calls

    may cause random bugs in the communication subsystem, resulting in program

    crashes.

    25

  • 6.2 Calling the routines of libScotch

    6.2.1 Calling from C

    All of the C routines of the libScotch library are prefixed with “SCOTCH ”. The

    remainder of the function names is made of the name of the type of object to which

    the functions apply (e.g. “dgraph”, “dorder”, etc.), followed by the type of action

    performed on this object: “Init” for the initialization of the object, “Exit” for the

    freeing of its internal structures, “Load” for loading the object from one or several

    streams, and so on.

    Typically, functions that return an error code return zero if the function suc-

    ceeds, and a non-zero value in case of error.

    For instance, the SCOTCH dgraphInit and SCOTCH dgraphLoad routines, de-

    scribed in section 6.5, can be called from C by using the following code.

    #include

    #include

    #include "ptscotch.h"

    ...

    SCOTCH_Dgraph grafdat;

    FILE * fileptr;

    if (SCOTCH_dgraphInit (&grafdat) != 0) {

    ... /* Error handling */

    }

    if ((fileptr = fopen ("brol.grf", "r")) == NULL) {

    ... /* Error handling */

    }

    if (SCOTCH_dgraphLoad (&grafdat, fileptr, -1, 0) != 0) {

    ... /* Error handling */

    }

    ...

    Since “ptscotch.h” uses several system and communication objects which are

    declared in “stdio.h” and “mpi.h”, respectively, these latter files must be included

    beforehand in your application code.

    Although the “scotch.h” and “ptscotch.h” files may look very similar on

    your system, never mistake them, and always use the “ptscotch.h” file as the

    right include file for compiling a program which uses the parallel routines of the

    libScotch library, whether it also calls sequential routines or not.

    6.2.2 Calling from Fortran

    The routines of the libScotch library can also be called from Fortran. For any C

    function named SCOTCH typeAction() which is documented in this manual, there

    exists a SCOTCHFTYPEACTION () Fortran counterpart, in which the separating

    underscore character is replaced by an “F”. In most cases, the Fortran routines

    have exactly the same parameters as the C functions, save for an added trailing

    INTEGER argument to store the return value yielded by the function when the

    return type of the C function is not void.

    Since all the data structures used in libScotch are opaque, equivalent declara-

    tions for these structures must be provided in Fortran. These structures must there-

    26

  • fore be defined as arrays of DOUBLEPRECISIONs, of sizes given in file ptscotchf.h,

    which must be included whenever necessary.

    For routines that read or write data using a FILE * stream in C, the Fortran

    counterpart uses an INTEGER parameter which is the numer of the Unix file descrip-

    tor corresponding to the logical unit from which to read or write. In most Unix

    implementations of Fortran, standard descriptors 0 for standard input (logical unit

    5), 1 for standard output (logical unit 6) and 2 for standard error are opened by

    default. However, for files that are opened using OPEN statements, an additional

    function must be used to obtain the number of the Unix file descriptor from the

    number of the logical unit. This function is called PXFFILENO in the normalized

    POSIX Fortran API, and files which use it should include the USE IFPOSIX direc-

    tive whenever necessary. An alternate, non normalized, function also exists in most

    Unix implementations of Fortran, and is called FNUM.

    For instance, the SCOTCH dgraphInit and SCOTCH dgraphLoad routines, de-

    scribed in sections 6.5.2 and 6.5.5, respectively, can be called from Fortran by using

    the following code.

    INCLUDE "ptscotchf.h"

    DOUBLEPRECISION GRAFDAT(SCOTCH_DGRAPHDIM)

    INTEGER RETVAL

    ...

    CALL SCOTCHFDGRAPHINIT (GRAFDAT (1), RETVAL)

    IF (RETVAL .NE. 0) THEN

    ...

    OPEN (10, FILE=’brol.grf’)

    CALL SCOTCHFDGRAPHLOAD (GRAFDAT (1), FNUM (10), 1, 0, RETVAL)

    CLOSE (10)

    IF (RETVAL .NE. 0) THEN

    ...

    Although the “scotchf.h” and “ptscotchf.h” files may look very similar on

    your system, never mistake them, and always use the “ptscotchf.h” file as the

    include file for compiling a Fortran program that uses the parallel routines of the

    libScotch library, whether it also calls sequential routines or not.

    All of the Fortran routines of the libScotch library are stubs which call their C

    counterpart. While this poses no problem for the usual integer and double precision

    data types, some conflicts may occur at compile or run time if your MPI implemen-

    tation does not represent the MPI Comm type in the same way in C and in Fortran.

    Please check on your platform to see in the mpi.h include file if the MPI Comm data

    type is represented as an int. If it is the case, there should be no problem in using

    the Fortran routines of the PT-Scotch library.

    6.2.3 Compiling and linking

    The compilation of C or Fortran routines which use parallel routines of the lib-

    Scotch library requires that either ptscotch.h or ptscotchf.h be included, re-

    spectively. Since some of the parallel routines of the libScotch library must be

    passed MPI communicators, it is necessary to include MPI files mpi.h or mpif.h,

    respectively, before the relevant PT-Scotch include files, such that prototypes of

    the parallel libScotch routines are properly defined.

    The parallel routines of the libScotch library, along with taylored versions of

    the sequential routines, are grouped in a library file called libptscotch.a. Default

    27

  • error routines that print an error message and exit are provided in the classical

    Scotch library file libptscotcherr.a.

    Therefore, the linking of applications that make use of the libScotch li-

    brary with standard error handling is carried out by using the following options:

    “-lptscotch -lptscotcherr -lmpi -lm”. The “-lmpi” option is most often not

    necessary, as the MPI library is automatically considered when compiling with com-

    mands such as mpicc.

    If you want to handle errors by yourself, you should not link with library file

    libptscotcherr.a, but rather provide a SCOTCH errorPrint() routine. Please

    refer to Section 6.11 for more information on error handling.

    6.2.4 Machine word size issues

    Graph indices are represented in Scotch as integer values of type SCOTCH Num. By

    default, this type equates to the int C type, that is, an integer type of size equal

    to the one of the machine word. However, it can represent any other integer type.

    Indeed, the size of the SCOTCH Num integer type can be coerced to 32 or 64 bits

    by using the “-DINTSIZE32” or “-DINTSIZE64” compilation flags, respectively, or

    else by using the “-DINT=” definition (see Section 7.3 for more information on the

    setting of these compilation flags).

    This may, however, pose a problem with MPI, the interface of which is based

    on the regular int type. PT-Scotch has been coded so as to avoid typecast bugs,

    but overflow errors may result from the conversion of values of a larger integer type

    into ints when handling communication buffer indices.

    Consequently, the C interface of Scotch uses two types of integers. Graph-

    related quantities are passed as SCOTCH Nums, while system-related values such as

    file handles, as well as return values of libScotch routines, are always passed as

    ints.

    Because of the variability of library integer type sizes, one must be careful when

    using the Fortran interface of Scotch, as it does not provide any prototyping

    information, and consequently cannot produce any warning at link time. In the

    manual pages of the libScotch routines, Fortran prototypes are written using

    three types of INTEGERs. As for the C interface, the regular INTEGER type is used

    for system-based values, such as file handles and MPI communicators, as well as for

    return values of the libScotch routines, while the INTEGER*num type should be

    used for all graph-related values, in accordance to the size of the SCOTCH Num type, as

    set by the “-DINTSIZEx” compilation flags. Also, the INTEGER*idx type represents

    an integer type of a size equivalent to the one of a SCOTCH Idx, as set by the

    “-DIDXSIZEx” compilation flags. Values of this type are used in the Fortran interface

    to represent arbitrary array indices which can span across the whole address space,

    and consequently deserve special treatment.

    In practice, when Scotch is compiled on a 32-bit architecture so as

    to use 64-bit SCOTCH Nums, graph indices should be declared as INTEGER*8,

    while error return values should still be declared as plain INTEGER (that is,

    INTEGER*4) values. On a 32 64-bit architecture, irrespective of whether SCOTCH

    Nums are defined as INTEGER*4 or INTEGER*8 quantities, the SCOTCH Idx type

    should always be defined as a 64-bit quantity, that is, an INTEGER*8, because

    it stores differences between memory addresses, which are represented by 64-bit

    values. The above is no longer a problem if Scotch is compiled such that ints

    equate 64-bit integers. In this case, there is no need to use any type coercing

    definition.

    28

  • Also, the MeTiS compatibility library provided by Scotch will not work when

    SCOTCH Nums are not ints, since the interface of MeTiS uses regular ints to represent

    graph indices. In addition to compile-time warnings, an error message will be issued

    when one of these routines is called.

    6.3 Data formats

    All of the data used in the libScotch interface are of integer type SCOTCH Num. To

    hide the internals of PT-Scotch to callers, all of the data structures are opaque,

    that is, declared within ptscotch.h as dummy arrays of double precision values,

    for the sake of data alignment. Accessor routines, the names of which end in

    “Size” and “Data”, allow callers to retrieve information from opaque structures.

    In all of the following, whenever arrays are defined, passed, and accessed, it is

    assumed that the first element of these arrays is always labeled as baseval, whether

    baseval is set to 0 (for C-style arrays) or 1 (for Fortran-style arrays). PT-Scotch

    internally manages with base values and array pointers so as to process these arrays

    accordingly.

    6.3.1 Distributed graph format

    In PT-Scotch, distributed source graphs are represented so as to distribute graph

    data without any information duplication which could hinder scalability. The only

    data which are replicated on every process are of a size linear in the number of pro-

    cesses and small. Apart from these, the sum across all processes of all of the vertex

    data is in O(v + p), where v is the overall number of vertices in the distributed

    graph and p the number of processes, and the sum of all of the edge data is in O(e),

    where e is the overall number of arcs (that is, twice the number of edges) in the

    distributed graph. When graphs are ill-distributed, the overall halo vertex infor-

    mation may also be in o(e) at worst, which makes the distributed graph structure

    fully scalable.

    Distributed source graphs are described by means of adjacency lists. The de-

    scription of a distributed graph requires several SCOTCH Num scalars and arrays, as

    shown for instance in Figures 8 and 9. Some of these data are said to be global,

    and are duplicated on every process that holds part of the distributed graph; their

    names contain the “glb” infix. Others are local, that is, their value may differ for

    each process; their names contain the “loc” or “gst” infix. Global data have the

    following meaning:

    baseval

    Base value for all array indexings.

    vertglbnbr

    Overall number of vertices in the distributed graph.

    edgeglbnbr

    Overall number of arcs in the distributed graph. Since edges are represented

    by both of their ends, the number of edge data in the graph is twice the

    number of edges.

    procglbnbr

    Overall number of processes that share distributed graph data.

    29

  • proccnttab

    Array holding the current number of local vertices borne by every process.

    procvrttab

    Array holding the global indices from which the vertices of every process are

    numbered. For optimization purposes, this array has an extra slot which

    stores a number which must be greater than all of the assigned global in-

    dices. For each process p, it must be ensured that procvrttab[p + 1] ≥

    (procvrttab[p] + proccnttab[p]), that is, that no process can have more

    local vertices than allowed by its range of global indices. When the global

    numbering of vertices is continuous, for each process p, procvrttab[p +1] =

    (procvrttab[p] + proccnttab[p]).

    Local data have the following meaning:

    vertlocnbr

    Number of local vertices borne by the given process. In fact, on every process

    p, vertlocnbr is equal to proccnttab[p