h-relation personalized communication strategy for … · h-relation personalized communication...
Post on 11-May-2020
1 Views
Preview:
TRANSCRIPT
h-RELATION PERSONALIZED
COMMUNICATION STRATEGY FOR
HP-ADAPTIVE COMPUTATIONS
M. Paszynski(1),(2), K. Milfeld(3)
(1) Institute for Computational Engineering and Sciences (ICES)
The University of Texas at Austin
e-mail: maciek@ices.utexas.edu(2) AGH University of Science and Technology,
Department of Computer Methods in Metallurgy, Cracow, Poland(3) Texas Advanced Computing Center (TACC)
Abstract
In an h-relation personalized communication, each processor can have a variable number of datasets of non-fixed size, that must be communicated with a subset of other processors, such thateach processor is the origin or destination of at most h messages. For the case where nothingcan be assumed about the regularity of the communication pattern, the h-relation personalizedcommunication strategy problem has been solved using two transpose primitives.In HP -adaptive finite element methods, the domain is frequently repartitioned between
processors representing adjacent sub-domains in a geometrical sense. We show that all thecommunications in this case can be described as h-relation personalized communications, withh equal to the maximum number of neighboring sub-domains. We also demonstrate that thecommunications problem can be solved by scheduling over bipartite graphs.Specifically, we show that when the communication pattern arises from the partition of a 2D
geometric area, the planar graph representation of the computational domain can be partitionedinto not more than two bipartite graphs and a third graph with maximum vertex valency 2, bymeans of a simple algorithm. In the general case (e.g., for computations over partitioned 3Dgeometric areas) our simple algorithm finds h−1 or fewer bipartite graphs. After the graphs areconstructed, the task of message scheduling can be reduced to a set of independent schedulingproblems over the bipartite graphs.The algorithm is supported by a theoretical discussion on its correctness and efficiency,
together with performance measurements. We also present measurements for the concurrentpoint-to-point communication used.
Keywords: Parallel Algorithms, Routing h-Relation Personalized Communication, Concurrent
Point-to-Point Communication, HP -Adaptive Finite Element Method.
1
Acknowledgment
The computations reported in this work were done through the National Science Foundation’s
National Partnership for Advanced Computational Infrastructure. The authors are greatly indebted
to Anna Paszynska, Leszek Demkowicz, Robert A. van de Geijn, and Greg Plaxton for numerous
discussions on the subject. We are greatfull to the Texas Advanced Computing Center and the
University of Texas at Austin for the use of their resources.
1 Introduction
Main results of the paper. In the h-relation personalized communication problem, each pro-
cessor has possibly different amounts of data to share with some subset of the other processors,
such that each processor is the origin or destination of at most h messages [2]. In the general case,
where nothing can be assumed about the regularity of the communication pattern, the h-relation
personalized communication strategy problem can be solved by using two transpose primitives
[2] (all-to-all communications where each processor has to send a unique block of data to every
processor, and all the blocks are of the same size).
The paper presents an alternative strategy for h-relation personalized communications (h-rpc),
by employing bipartite graphs derived from the communication graph. Through this reformulation,
the task of message scheduling is reduced to a set of independent problems of scheduling over the
bipartite graphs.
We present a simple algorithm for partitioning of the planar graph into two bipartite graphs
and another graph with maximum vertex valency 2. In a general case, when the communication
pattern is a non-planar graph, our algorithm partitions the graph into h − 1 bipartite graphs.
We also analyze the communication patterns that appear in parallel H- and HP -adaptive finite
element method applications [28]. We show that an efficient implementation of message passing
can be represented as an h-rpc strategy with h equal to the maximum number of neighboring
sub-domains over the graph representation of the partitioned computational domain.
We include a theoretical discussion of the algorithm’s correctness, an analysis of its complexities,
and performance measurements of our proposed communication strategy.
Routing h-relation. The h-rpc problem can be solved using a simple fixed-pattern algorithm
[35]. However, if there is a large variation in the message size, the algorithm becomes inefficient.
A scalable two-phase algorithm for the h-rpc was proposed by Bader, Helman, and JaJa in [2] to
mitigate this inefficiency. The authors also noted that the problem can be solved by using two
transpose primitives.
Some basic work with MPI primitive implementations was done by Johnsson [27], [26], [19],
[20]. Efficient implementation of the MPI primitives is described in papers of Bader [1], Sundar
2
et al. [33]; and the approach presented in [2] is acclaimed to be the optimal strategy for routing
h-relation.
The authors of [2] showed that the overall complexity of the algorithm isO(
τ + h+(
h+ np+ p2
)
σ)
,
where τ is the message initialization time, n is total amount of data assumed to be distributed innpportions over p processors, and σ is the time per byte at which processor can send or receive
data through the network ( 1σ
is the bandwidth of the network). In their paper, Goldman and
Trystram [15] noticed that when all the communication patterns are known by every processor, a
global knowledge algorithm can be used. By finding an optimal solution through a permutation on
the communication matrix, their solution becomes potentially better; however, they used a greedy
algorithm whose results can be far from the optimal schedule.
In our approach, we use a global knowledge communication strategy based on scheduling over
bipartite graphs, with an overall complexity O(
τh+ npσ)
, which scales favorably when the proces-
sor count is increased. Also, in H- and HP - adaptive finite element applications, the maximum
number h of neighboring sub-domains is usually constrained to a small value by domain decompo-
sition algorithms such as Zoltan [9]. This implies that our algorithm will scale better then other
proposed h-rpc algorithms.
Partitions of the planar graph. A bipartite graph is a graph whose set of vertices can be
decomposed into two disjoint sets such that no two graph vertices within the same set are adjacent.
A forest is an acyclic graph (set of possibly disconnected trees). Every forest is a bipartite graph,
but not every bipartite graph is a forest. A vertex valency d(x) is the number of edges incident to
it. The maximum vertex valency in a graph G is denoted by ∆(G).
The problem of partitioning planar graphs has been extensively investigated in graph theory,
using forests. An effective algorithm for finding 3 forest partitions of every planar graph is presented
in Chen et al. [37]. However, there are no theoretical results on the upper bound for the valency
of forests.
The most recent theoretical results are presented in He, Hou et al. [36], and can be summarized
as follows. Every planar graph G has an edge-partition in two forests T1 and T2 and another graph
H with ∆(H) ≤ 8. An exact evaluation of ∆(H) is still an open problem, according to He Hou et
al. [36]. There are examples of planar graphs whose graphs H have the property ∆(H) = 2, which
implies that 2 ≤ ∆(H) ≤ 8. The theoretical results presented in He, Hou et al. [36] do not imply
any simple graph partition algorithm, since their proofs are based on the mathematical induction
on the number of vertices and edges in G.
We present a simple linear time algorithm for partitioning a planar graph into not more than
three bipartite graphs {Bi}i=1,2,3 with ∆(B1) ≤ h, ∆(B2) ≤ h−1 and ∆(B3) ≤ 2. We use bipartite
graphs that may not be forests, because we do not need to map partitions into a set of forests, we
only need partitions consisting of bipartite graphs.
3
Scheduling over bipartite graphs. The usefulness of edge coloring for scheduling data transfers
in large parallel architectures has been demonstrated by several authors, and a comprehensive
survey of applications in various fields has been compiled by Jain et al. [18]. The application
of simple edge-coloring algorithms to scheduling problems and evaluation of their experimental
performance are presented by Marathe et al. [25], [30]. In general, when nothing can be assumed
about the regularity of a graph, the problem of edge coloring is NP-complete (see [3], [5]). However,
the problem of edge coloring of bipartite graphs can be solved by a polynomial-time algorithm.
The fastest known bipartite graph coloring algorithm was developed by Rizzi et al. [31], [21]. The
computational complexity of the algorithm is O(n logn log v + e log v) where n is the number of
nodes, e is the number of edges, and v is the maximum vertex degree in the graph.
Managing the communication problem in HP -adaptive applications. The problem of
managing the communication in parallel H- and HP -adaptive finite element simulations was inves-
tigated by several researchers. Crutchfield et al. [6] developed an object oriented library to support
adaptive mesh refinement for hyperbolic gas dynamic simulations running on vector supercomput-
ers. Simple geometric abstractions for expressing complicated communication patterns in dynamics
simulations were developed within the LPARX system [22] and used by the KeLP programming
system [13].
Patra et al. [23] enumerated data according to a Space Filing Curve (SFC) algorithm to simplify
communication patterns. In particular, load balancing was achieved in this case by periodically
adjusting the key space and migrating node- and element-objects to neighboring sub-domains.
Sandia National Laboratories has developed a finite element infrastructure package called SIERRA
([10], [12], [11], [34]) that supports parallel architectures. The SIERRA package uses the Interpro-
cessor Collective Communication Library (iCC) [17]. The collective communications in the library
are based on gather operations followed by broadcast within a subset of processors.
In H- and HP -adaptive applications, the communication pattern changes throughout the cal-
culation after each adaptation. It is impossible to predict the pattern at compile time. The concept
of creating an inspector and an executor part of the computational loop was introduced by Saltz et
al. [7]. The inspector produces sets of schedules and the executor gathers data using the schedules
and performs the calculations on them. Both inspector and executor are implemented using a set
of primitives within the PARTI library [7].
The CHAOS system subsumes the previous PARTI library and supports a larger class of ap-
plications, as described by Hwang et al. [16]. CHAOS can also be used by compilers to parallelize
irregular applications automatically. When data access patterns are known only at runtime, com-
piler support for irregular loops can create pre-processed code for loops that generates appropriate
communication calls at runtime and places off-processor data in a pre-determined order [16], [8].
Such compiler techniques transform the irregular loops into inspector / executor parts, and embeds
CHAOS remap procedures that redistribute data irregularities.
4
The actual communication is generated inside the inspector part by a Communication Pattern
Generator (CPG) from the communication graph, as described by Saltz et al. [8]. The commu-
nication pattern is optimized in the following ways: (1) simple software caching and incremental
scheduling eliminates duplicate off-processor data retrieval using a hash table, (2) communication
coalescing packs the maximum possible amount of data into a single message, (3) schedule merging
uses the same schedule for gather and scatter operations. In the CHAOS library there are no
global knowledge optimizers that use the geometrical properties of the planar graph to represent
communication patterns.
Our approach differs from the methods used in the above applications in the following ways:
• All essential communication appearing in HP -adaptive computations is reduced to an h-rpc.
• The h-relation algorithm does not rely on expensive, collective (global) all-to-all primitives,
which implies better scalability with increasing numbers of processors.
• A global knowledge optimization is applied (from a global understanding of the communica-
tion pattern at each processor).
• The communication optimizer synchronizes all local communication patterns by applying
graph partitions and edge-coloring over bipartite graphs.
2 Communication problems in HP -adaptive applications
This section describes the parallel, fully automatic HP -adaptive finite element method (FEM)
package, as detailed in Demkowicz et al. [28]. We show that all communication problems that
arise in H- and HP -adaptive FEM computations can be described as h-rpc, with h equal to the
maximum number of neighboring sub-domains.
Parallel H- and HP -Adaptive Finite Element Methods. In the H version of the FEM, ele-
ments may have a locally varying element size, H. In the more flexible HP version, the polynomial
order of approximation P is allowed to vary. High-order elements are usually necessary to minimize
the dispersion error, while small elements are necessary to capture real world geometrical details.
Only HP elements can economically combine small elements of lower order with large elements of
higher order in the same mesh.
A parallel version of the fully automatic HP -adaptive 2D FEM was developed at ICES [28].
Given an initial mesh, the code automatically produces a sequence of optimal meshes delivering
exponential convergence rates, using a general HP -refinement strategy based on the minimization
of a projection-based interpolation error. The parallelized components in each stage of the HP -
adaptivity algorithm are: the decomposition of the computational domain, load balancing and
5
data redistribution, a parallel frontal solver, and algorithms for parallel mesh refinement and mesh
reconciliation.
Flowchart for parallel execution. A flow chart of the iterations for the HP -adaptive method
is presented in Fig. 1. The input data are the initial mesh, and the equation defined over the
computational domain. The initial mesh is the coarse mesh for the first iteration. Load balancing
using the Zoltan library [9] is performed at the beginning of each iteration step, on the level of
initial mesh elements. The coarse mesh is globally refined in both element size H (for 2D each
rectangular element is divided into 4) and order of approximation P (raised uniformly by one) to
obtain a fine mesh, as illustrated in Fig.2. The problem is solved twice, once on the current coarse
mesh and once on a fine mesh. The automatic HP strategy requires estimation of the global error
obtained by comparison of the coarse grid and the fine grid solutions. The optimal mesh (in Fig.2)
in the current iteration is obtained on the basis of error estimations over each finite element. The
refinement trees grows vertically within the initial mesh elements. The optimal mesh obtained in
current step constitutes the coarse mesh for the next step in the algorithm. The iteration loop is
terminated when all element errors reach a (small) limit.
Communication tasks.
1. Data migration after load balancing.
A domain decomposition approach is the most natural way to parallelize a FEM application.
In this approach the computational domain is divided (partitioned) into sub-domains, and
each sub-domain is assigned to a processor.
In H- or HP -adaptive methods, the finite element mesh changes from one iteration to the
next, and the domain must be repartitioned to balance the load uniformly over all processors.
This is usually done by a load-balancing library like Zoltan [9]. The library uses informa-
tion about computational load estimation over each initial mesh element together with its
refinements tree, and it returns a repartition of the initial mesh elements. To balance the
calculations, some initial mesh elements, together with their refinement trees, must be sent
to other processors; likewise, elements must be received from some other processors. The
communication pattern for creating the new partition of the mesh is mainly between pro-
cessors representing adjacent sub-domains in the current partition of the mesh. Hence, the
migration of the initial mesh elements (together with refinement trees) is usually to adjacent
sub-domains.
Moreover, when an SFC load balancing algorithm is used, the load balance can be achieved
by periodically migrating node and element data to just neighboring processors, as was shown
by Patra et al. [23].
2. Parallel solver.
Both the coarse mesh and fine mesh problems are solved using a parallel frontal solver [28].
6
The frontal solver is a direct solver, an extension of the Gaussian elimination algorithm, where
assembly and elimination are performed together on the so-called frontal sub-matrix of the
global matrix. The use of direct solvers in H- and HP -adaptive applications is motivated by
the high accuracy required by the error-estimation routines. In the domain-decomposition ap-
proach the domain is partitioned into sub-domains, each sub-domain is assigned to separated
processor, and all degrees of freedom (dof) within the computational mesh are enumerated in
the following way:
• All internal dof from 1-st sub-domain
• All internal dof from 2-nd sub-domain
• ...
• All internal dof from last sub-domain
• All dof from the global interface
By the interface, we mean all common edges of neighboring finite elements located in adjacent
sub-domains. The following steps are performed:
(a) a forward elimination is executed over each sub-domain in parallel
(b) a wire-frame problem involving all dof from the interface is formulated and solved
(c) a solution of the wire-frame problem is broadcast
(d) a backward substitution is executed over each sub-domain in parallel.
The greatest communication cost is incurred in the formulation of the interface problem
and the broadcast of its solution. We will show how the data exchanges can be reduced to
communications between neighboring sub-domains only.
Parallel direct solvers, found in libraries such as PLAPACK [29], are used to solve the interface
problem in an optimal way. The interface problem matrix is assumed to be distributed by
rows, where each processor stores some consecutive rows of the matrix. The rows are assigned
to particular processors by following a simple algorithm, implemented in [28]:
(a) Each interface degree of freedom may be kept by many processors, since the interface
dof are shared between neighboring sub-domains. The list of processors assigned to the
dof is obtained from the Zoltan library after the load balancing step.
(b) Each interface degree is assigned to the processor with the smaller number in the list.
It is assumed that the row of the interface problem matrix associated with the dof will
be stored on that processor.
An example of the interface node assignment is presented in Fig. 3.
7
The forward elimination stage of the parallel frontal solver provides the partially assembled
rows of the interface dof [28]. The interface dof rows must be fully assembled by sending data
to the processor associated with the row.
It is clear that the communication is only between neighboring sub-domains, since the interface
dof are shared only between neighboring sub-domains. The communication pattern graph can
be assumed to be planar, if the adjacency is considered only through non-zero length edges
and not through vertices. In the planar graph case, the communication pattern has much
better properties, as shown in the following sections. Unfortunately the adjacency through
vertices must be taken into account during assembly of the interface problem matrix. However,
it has been shown that the load balancing can be optimally performed on the level of the
initial mesh elements only [28]. For the case of regular rectangular initial mesh elements
with curvilinear edges only on the boundary of the domain, the diagonal adjacency through
the vertex can be assumed to occur with only one sub-domain. In other words, there are
always no more than four sub-domains sharing a common vertex in 2D, and eight in 3D.
The amount of data to be exchanged between neighboring sub-domains is proportional to
the size of the edge shared by neighboring sub-domains. Thus, it is reasonable to split the
communication pattern into two planar graphs, one containing all adjacency through non-zero
length edges, and the other one containing all adjacency through vertices. Hence, we need
to schedule two communication patterns over two planar graphs. For the case of non-regular
initial mesh elements, the resulting graph may be non-planar, and the number of bipartite
graphs is bounded by h− 1.
3. Parallel mesh refinement and mesh reconciliation.
In the parallel mesh refinement step some finite elements are divided into multiple (new)
elements (each rectangular element is divided into 2 or 4 sons in 2D), new nodes are created,
and new orders of approximation are set (see [28] for more details). The parallel version of the
algorithm requires a global mesh-reconciliation step after each refinement has been performed
over each sub-domain. The mesh must be globally consistent, which involves
• The 1-irregularity rule: Edges of a given element can be broken only once, without break-
ing neighboring elements. This implies that refinement tree data describing newly created
nodes on the interface edges must be exchanged between neighboring sub-domains.
• When an interface edge is broken, then the polynomial order of newly created nodes
must be established. This is done by comparing coarse- and fine-grid solutions over
finite elements neighboring the edge. But some of the elements neighboring the interface
edge are present in adjacent sub-domains. We need to ask the neighboring sub-domain
for the order of new nodes.
The parallel mesh refinement and reconciliation algorithm has 3 stages:
8
(a) determination of the optimal refinements (essentially a parallel task)
(b) exchange of information about edge refinement trees and polynomial order along the
interface
(c) mesh reconciliation
Stages (b) and (c) are repeated if some of the interface edges are modified during the last
mesh reconciliation.
The communication problem in this case is just to exchange some data between neighboring
sub-domains. It should be noted that for the 2D case, data need only be exchanged between
sub-domains adjacent through nonzero length edges, and not through finite element vertices.
In the 3D case, data are only exchanged between sub-domains adjacent through non-empty-
area finite element faces and not through finite element edges and vertices, since the state of
edges is determined by the state of faces, and the state of vertices is determined by the state
of edges.
The amount of data to be exchanged between neighboring sub-domains is proportional to the
density of refinements over edges shared by adjacent sub-domains.
As noted, the communication strategy of 3D mesh refinement and reconciliation is more com-
plicated. Also, the h-rpc is solved over a non planar graph in 3D, which implies a larger number of
bipartite graphs.
3 Graph partition algorithm
Graph representation of the computational domain. Let us consider a computational do-
main divided into sub-domains. A graph representation of the domain can be created,
G = (V,E) (3.1)
where graph vertices V denote sub-domains, and edges E = {{u, v} : u, v ∈ V } denote their
adjacency. A graph G is assumed to be undirected, since the sub-domains adjacency relation is
symmetric.
Partition of the graph. The layer-connection algorithm can be applied to the graph represen-
tation of the domain. The goal of this algorithm is to partition the graph into a set of subgraphs
called layers and connections (Figs. 4 and 5). Any vertex representing the sub-domain located on
the boundary of the domain is selected and assigned to the first layer. Then all neighbors of the
previously selected vertex are assigned to the second layer. The procedure is repeated by assigning
the sequence of neighboring vertices to consecutive layers until all vertices have been assigned to
some layer.
9
More formally, we select one vertex u0 in the graph and create an attributing function (3.2) that
assigns a layer index to each graph vertex, such that within a minimal length path connecting the
vertex with the initial vertex u0, all vertices belong to a different layer.
u0 ∈ V, f : V → {1, ..., NL} :
∀v ∈ V ∃ < v1, ..., vk >: v1 = u0, vk = v,
f (vi) = i, k = min<w1,...,wl>:w1=u0,wl=v
l (3.2)
The path < v1, ..., vk > is defined here as a sequence of graph vertices v1, ..., vk ∈ V such that each
vertex is connected with its successor and ancestor {vi−1, vi} ∈ E ∀i = 1, ..., k. The path length k
is assumed to be minimal (it is the shortest path connecting vertices u0 and v). Such a path exists
for each vertex since the domain is assumed to be compact. NL denotes the resulting number of
layers.
Layers and connections. The i-th layer is defined as a subgraph containing all vertices attributed
(by the f attributing function) with value i, and all edges between these vertices present in the
original graph G.
Li =(
V Li , EL
i
)
⊂ G :
V Li = {u ∈ V : f(u) = i}
ELi = {{u, v} ∈ E : f(u) = f(v) = i}
i = 1, ..., NL (3.3)
The i-th connection is defined as a subgraph containing all vertices attributed (by the f at-
tributing function) with i and i + 1 values, and all edges between vertices assigned to layer i and
i+ 1.
Ii =(
V Ii , EI
i
)
⊂ G :
V Ii = {u ∈ V : f(u) ∈ {i, i+ 1}}
EIi = {{u, v} ∈ E, f(u) = i, f(v) = i+ 1}
i = 1, ..., NL− 1 (3.4)
Graph partition algorithm. A bipartite graph is a graph in which a set of vertices can be de-
composed into two disjoint sets such that no two graph vertices within the same set are adjacent.
A valency of a vertex is the number of edges incident to it. We denote by ∆(G) = h the maximum
vertex valency in the graph G representing the partitioned computational domain (i.e.,the maxi-
mum number of neighboring sub-domains). Here we describe the graph-partitioning algorithm for
partitioning the graph G into a set of bipartite subgraphs with disjoint edges. The input for the
algorithm is the graph G, and the output is the set of bipartite graphs {Bi}i=1,...,NB .
10
As shown in the algorithm description below, in a general case when the graph G is not planar,
the total number of bipartite graphs NB is no greater than h− 1, with
∆(BK) ≤ h−K + 1,K = 1, ..., h− 2; ∆(Bh−1) ≤ 2 (3.5)
Also, for a planar graph G representing a partitioned 2D area, the resulting number of bipartite
graphs is not greater than 3, with
∆(B1) ≤ h; ∆(B2) ≤ h− 1; ∆(B3) ≤ 2 (3.6)
An example of a planar graph that partitions into 2 bipartite graphs is illustrated in Fig. 4. A
planar graph with more complex partitioning, requiring sub-layering, is illustrated in Fig. 5.
Through the layer-connection algorithm, the task of message scheduling over the entire graph
G is reduced to scheduling over independent bipartite graphs, which are simple and can be solved
in polynomial time. The algorithm steps are:
• Step (1): Execute the layer-connection algorithm over entire graph G. The algorithm results
in a set of connections {Ii}i=1,...,LN−1 and disjoint layers {Li}i=1,...,LN sub-graphs.
• Step (2): Create the first bipartite graph as the union of all connection subgraphs
B1 =⋃
k=0,...,LN−1 Ik.
• Step (3): Let iteration index K = 0.
• Step (4): If the maximum vertex valency in layer subgraphs {Li}i=1,...,LN is less than or
equal to 2, Then Go To Step (9)
• Step (5): Let iteration index K = K + 1.
• Step (6): Execute recursively the layer-connection algorithm over each layer subgraph
{Li}i=1,...,LN . The algorithm results in set of K-th sub-connection {Im,Ki }
m=1,...,LNK
i−1
i=1,...,LN and
disjoint K-th sub-layers {Lm,Ki }
m=1,...,LNK
i
i=1,...,LN subgraphs. LNKi is the number of the K-th sub-
layer resulting from execution of the algorithm over the Li layer.
• Step (7): The K +1 bipartite graph is defined as the union of all K-th sub-connection sub-
graphs BK+1 =⋃m=0,...,LNK
i−1
i=1,...,LN Im,Ki resulting from recursive execution of the layer-connection
algorithm.
• Step (8): Denote by Li all K-th sub-layer subgraphs, and by Ii all K-th sub-connection
subgraphs. Denote LN as the total number of K-th sub-layer subgraphs. Go To Step (4).
• Step (9): The K + 2 bipartite graph is defined as the union of all layer subgraphs BK+2 =⋃
i=1,...,LN Li.
Stop
11
The correctness and efficiency analysis of the algorithm are presented in the Correctness of
graph partition algorithm and Efficiency of graph partition algorithm sections.
4 Scheduling over bipartite graphs
We assume that
• Each processor can only send or receive one message at one time.
• Each processor sends different messages to different neighbors.
• The communication cost between each discrete pair of communicating processors is not the
same.
The cost of each communication can be bounded as tcomm = τ + Lσ where τ is the message ini-
tialization time, L is the number of words in the message, and σ is the time per byte at which
processor can send or receive data through the network. In reality, σ may depends on the number
of processor pairs trying to send a message at the same time and the routes within the intercon-
nection. The measurements of the σ parameter on the TACC Cray-Dell Power-Edge Xeon Cluster
are presented in the section Performance measurements. The discussion on the architecture of
the cluster in the context of concurrent point-to-point communication is presented in the section
Multiple point-to-point communication on clusters.
If we assume the size of each message is the same, the scheduling problem is equivalent
to the problem of coloring of edges of a graph representation of the computational domain, where
each vertex denote processor and each edge denotes a message to be sent.
A k-edge-coloring of graph G is an assignment of k colors to the edges of G in such a way that
any two edges meeting at a common vertex are assigned different colors. If G can be k-edge colored,
then G is said to be k-edge colorable. The chromatic index of G, denoted by X‘(G), is the smallest
k for which G is k-edge-colorable. From Konig’s Theorem [32] it follows that if G is a bipartite
graph whose maximum vertex valency is d, then X‘(G) = d. The idea of the proof is mathematical
induction on the number of edge of G. Konig’s theorem tells us that every bipartite graph with
maximum vertex valency d can be edge-colored with just d colors. All trees are bipartite graphs
[32].
When a graph is not a representation of 2D mesh, then (3.5) implies that the entire graph G
can be colored by using not more than∑
K=1,h−2{h−K + 1}+ 2 = 12h
2 + 12h− 3 = O
(
h2)
colors,
in a pessimistic case. The total communication cost is then bounded by
tcomm = (τ + Lσ) (1
2h2 +
1
2h− 3)2 = O
(
τh2 + Lσh2)
(4.7)
12
When a graph to be partitioned is a representation of a 2D mesh, there are not more than
three bipartite subgraphs and equation (3.6) implies that the chromatic index (number of colors
required to color the entire graph G) is no greater then h+ h− 1 + 2 = 2h+ 1 = O (h). The total
communication cost is then bounded by
tcomm = (τ + Lσ) (2h+ 1)2 = O (τh+ Lσh) (4.8)
The presented evaluations consider a pessimistic case, and usually the algorithm deals better with
the input graph. For example, on a regular 2D mesh with maximum vertex valency 6, presented
in Fig. 4, there are only two sets of messages to be sent within 6 time units, and this result does
not depend on the number of processors represented by the graph vertices.
In the best known bipartite graph edge coloring algorithms, the complexity is
O (n logn log h+ e log h); where n is number of vertices, e is number of edges, and h is the maximum
vertex valency in the graph [31].
If we assume that the size of each message is not the same, the scheduling problem can
also be solved by the same strategy above. It is only necessary to select a minimum size message,
divide all large messages into a sequence of minimal-sized messages, and replicate the edges in the
bipartite graphs representing the larger messages. In the new graph, a single edge between a pair
of processors is a minimum size message, and multiple edges constitute a large-sized message.
5 h-relation personalized communication strategy
In the case when the communication pattern arises from partition of computational mesh and
different messages with different sizes must be exchanged between processors representing sub-
domains adjacent in a geometrical sense, the proposed communication strategy over a heterogeneous
interconnection network can be summarized in the following steps:
1. Create a graph G describing the partitioned mesh as presented in (3.1). The maximum vertex
valency in the graph is denoted by ∆(G) = h.
2. Execute the graph partitioning algorithm described in section Graph partition algorithm. The
algorithm results in a set of bipartite graphs {Bi}i=1,...,NB .
3. Schedule messages over each bipartite graph, as described in a section Scheduling over bipartite
graphs:
(a) Find the minimum and maximum size of messages represented by edges in the Bi graphs.
Split each message larger than the minimum into a sequence of messages with a length
equal to the minimum size message (not that the last one will usually contain a shorter
message). Replicate each edge by the number of messages in the sequence.
13
(b) Color the bipartite graph with multiple edges using an efficient algorithm e.g. presented
in [31].
(c) Schedule messages according to the edge coloring.
In the case of a partitioned 2D mesh, the graph partition algorithm produces no more than three
bipartite graphs, and in the general case there are no more than h− 1 bipartite graphs, as shown
in in the next section.
6 Correctness of the graph partition algorithm
Algorithm Correctness: general case. The analysis in this section applies to a simple geomet-
rical consideration over any graph representation of 3D partitioned domain, where vertices denote
sub-domains, and edges denote adjacency (neighbors that share a common border - a non-empty
2D area). A planar graph is a graph representation of a partitioned 2D domain. This section deals
with graph representation of partitioned 3D meshes, which are not planar graphs.
Lemma 1. The execution of the layer-connection algorithm over graph G with ∆(G) ≤ h
produces a connection subgraph I with ∆(I) ≤ h, and a layer sub-graph L with ∆(L) ≤ h− 1.
Proof: Execution of the algorithm results in {Li =(
V Li , EL
i
)
}i=1,...,LN and
{Ii =(
V Ii , EI
i
)
}i=1,...,LN−1 subgraphs. Connection subgraph I and layer subgraph L are defined
as unions I =⋃
k=0,...,LN−1 Ik and L =⋃
i=1,...,LN Li.
Each connection subgraph Ik, k = 1, ..., LN − 1 contains all vertices from layers Lk and Lk+1,
but does not contain edges between vertices within the same layer. If the vertex v = u0 ∈ V L1 has
valency h and all neighbors of v are in the next layer, then ∆(I1) = h. If there are neighbors of the
vertex v in graph G within the same layer Li as v, then ∆(Ii) < h. We conclude that ∆(I) ≤ h.
Any vertex v ∈ V Li has valency no greater then h− 1 in Li, since
• If i = 1, v ∈ L1, v is the initial vertex v = u0 then v has no edge in L1, because all of its edges
are assigned to the connection subgraph I1. It simply implies that the valency of v = u0 is 0
in L1.
• If i > 1, then each vertex within Li has an edge to some vertex from the previous layer
Li−1, from definition of layers-connections algorithm. If v ∈ Li has h neighbors in G, then at
least one of its neighbors belongs to the previous layer Li−1. This implies that v may have a
maximum of h− 1 neighbors in Li; in other words, the maximum valency of v in Li is h− 1.
We conclude that ∆(L) ≤ h− 1.
Lemma 2. Each connection subgraph Ii is a bipartite graph.
14
Proof: There are two disjoint sets of vertices in each Ii graph, the first one contains all
vertices from layer Li and the second one contains vertices from layer Li+1. From the definition
of the connection subgraph construction there are no edges between vertices inside each set; hence
the union of the two layer graphs is a bipartite graph.
Lemma 3. The union of all connection subgraphs I =⋃
k=0,..,LN−1 Ik is a bipartite graph.
Proof: Each connection subgraph Ii contains all vertices from layers Li and Li+1. Vertices
from the union graph I = (V I , EI) can be split into two disjoint sets V I = V I1 ∪ V I
2 , where V I1
is the union of all vertices from all odd layers V I1 =
⋃
k=0,...,bLN−1
2c V
L2k+1 and V I
2 is a union of all
vertices from all even layers V I2 =
⋃
k=1,..,bLN−1
2c V
L2k.
There are no edges between vertices from the same layer. There are also no edges between
vertices from layer Li and Li+2 and beyond. This implies that the union graph I is a bipartite
graph.
Theorem 1. Any graph G is a union of h− 1 bipartite graphs {Bi}i=1,...,h−1, where ∆(Bi) ≤
h− i+ 1, i = 1, ..., h− 2 and ∆(Bh−1) ≤ 2.
Proof: By executing the layer-connection algorithm over G we obtain a connection sub-graph
I1 with ∆(I1) ≤ h and layer sub-graphs L with ∆(L) ≤ h − 1, according to Lemma 1. By
executing the algorithm recursively h−2 times over layer sub-graphs, we obtain a set of connection
subgraphs Bk = Ik, k = 1, ..., h− 2 and the layer subgraph Bh−1 = Lh−1 with ∆(Lh−1) ≤ 2. Each
connection sub-graph Ik, k = 1, ..., h − 2 is a bipartite graph by Lemma 3. The terminal layer
subgraph, Bh−1, has a maximum vertex valency 2, which implies that it is also a bipartite graph.
The procedure is illustrated in Fig. 6.
The Theorem 1 establishes the correctness of the layer-connection algorithm in the general
case.
Algorithm correctness: graph representations of 2D meshes. The analysis in this sec-
tion applies to a simple geometrical consideration over planar graphs. Graph representations of
partitioned 2D meshes are planar graphs.
For clarity, we assume that the domain is convex without holes. However, the analysis can be
easily extended to the case of non-convex domains with holes. We make the following observations:
• All layers are created by passing of a wave from the initial sub-domain vertex u0, located on
the boundary of the domain.
• Each layer represents a sequence of connected sub-domains. The first and the last sub-domains
are located on the boundary of the domain.
• Within a layer there exists paths between each pair of vertices. Each edge represents adjacency
between sub-domains.
15
• We define an internal path as one that passes through the entire layer Li, and call edges
within this path internal edges. An internal path contains all vertices in the layer. We order
all vertices representing sub-domains in Li from “up” to “down”, where by the “upper” sub-
domain we mean that one located on the upper boundary of the domain. Fig. 7 illustrates
the ordering.
We enumerate vertices according to the order and denote them by < vLi
1 , ..., vLi
p > where p is
the length of the path.
• In general, all non-internal path edges may be the result of adjacency on the side of the pre-
vious layer, or on the side of the next layer, since edges represent geometrical 2D connections
between sub-domains.
• We distinguish two sides of a layer for assigning a “side” to connections between sub-domains
(that are not in the internal path defined above): the previous layer side and the next layer
side.
• Within a layer, there are three kinds of edges: internal edges, previous side edges and next
side edges, according to their geometrical location, as illustrated in Figs. 7,8.
Lemma 4. There are no previous side edges.
Proof: Let us suppose there is a previous side edge {vLi
k , vLi
k+l} with l ≥ 2, see Fig.8. From the
definition of the layer construction, each vertex vLi
k must be connected with some vertex from the
previous layer. Existence of a previous side edge {vLi
k , vLi
k+l} means that the sub-domain represented
by vertex vLi
k is adjacent to the sub-domain represented by vertex vLi
k+l through a common vertex
in the previous layer side. But previous-side adjacency of sub-domain vLi
k to sub-domain vLi
k+l for
l ≥ 2 means that sub-domains vLi
k+1, ..., vLi
k+l−1 are “covered” from any possible connection to the
previous layer. Hence, the existence of previous-side edges is a contradiction with the fact that the
covered vertices are adjacent to some vertex vLi−1m in the previous layer.
Lemma 5. Next side edges do not intersect themselves.
Proof: Let us suppose there is a next side edge {vLi
k , vLi
k+m} intersecting with a next side
edge {vLi
k+l, vLi
k+n} where m > 2, n − l > 2 and 1 < l < m < n. Adjacency of sub-domain vLi
k to
sub-domain vk+m means that sub-domains vLi
k+1, ...vLi
k+l, ..., vLi
k+m−1 are covered on the next layer
side by the {vLi
k , vLi
k+l} connection. Hence, this contradicts the fact that sub-domain vLi
k+l can also
be connected to sub-domain vLi
k+n on the next side.
Lemma 6. The second execution of the layer-connection algorithm over all layers Li with
∆(Li) = h − 1 results in one bipartite graph SI with ∆(SI) ≤ h − 1 and one graph SL with
∆(SL) ≤ 2.
Proof: By executing the layer-connection algorithm over each layer Li, we obtain a set of bipar-
tite sub-layers subgraphs {Lmi }
m=1,...,LNi−1i=1,...,LN−1 and a set of sub-connection subgraphs {Imi }
m=1,...,LNi−1i=1,...,LN−1 ,
16
where LNi denotes the number of sub-layers resulting from execution of the algorithm over the layer
Li.
The union of all sub-connection subgraphs is a bipartite subgraph, SI =⋃m=1,...,LNi−1
i=1,...,LN−1 Lmi , by
Lemma 3. Also, ∆(SI) ≤ h − 1, since ∆(Lmi ) ≤ h − 1, i = 1, ..., LN , for each layer according to
Lemma 1.
It is necessary to show that ∆(SL) ≤ 2, where SL is the union of sub-layers Lmi resulting from
the execution of the algorithm over all layers Li. We can do this by proving the same property
for each sub-layer, ∆(Lmi ) ≤ 2, i = 1, ..., LN , m = 1, ..., LNi, since all sub-layers are disjoint. In
other words, it is only necessary to show that each vertex vLm
i
k in any sub-layer Lmi has a valency
no greater then 2 in Lmi .
We proceed with the proof by considering all adjacency cases of the vertex vLm
i
k , and observations
about the cases (each sub-layer has all the properties of a layer):
• The internal path in the sub-layer is an ordered set of vertices along the path < vLm
i
1 , ..., vLm
ip >.
• We can assume that the initial vertex vL1
i
1 is the first vertex in the internal path, and is located
on the boundary of the domain.
• Each vertex vLm
i
k from sub-layer Lmi can have only two kind of edges: internal edges and next
side edges, since previous side edges are not possible, according to Lemma 4.
• Each vertex vLm
i
k in sub-layer Lmi for m > 1 must have an edge to some vertex v
Lm−1
i
l from
the previous sub-layer Lm−1i .
• Each vertex vLm
i
k in sub-layer Lmi can have neighbors in the layer Li only from layers Lm−1
i
or Lmi or Lm+1
i , from the properties of the layer-connection algorithm.
• We distinguish forward and backward directions along the internal path according to the order
of vertices on the internal path.
First, we consider the following observations for the possible adjacency cases in the backward
direction:
1. Each vertex vLm
i
k from sub-layer Lmi can have only one internal edge to the vertex v
Lm
i
k−1 and
many next side edges.
2. If there is a next side edge {vLm
i
l , vLm
i
k } between two vertices in sub-layer Lmi , then between
those vertices there can only be vertices from further sub-layers Lm+ni , n ≥ 1. Assume between
them there is a vertex vLm
i
k+n, n ∈ {1, ..., l − 1} from the same sub-layer, see Fig. 9(a). The
vertex vLm
i
k+n must be connected with some vertex from the previous sub-layer vLm−1
i , by the
definition of sub-layer. There is no possibility of connecting internal vertices vk+1, ..., vl−1
17
with any external vertex, because all the vertices are covered by the next side edge, and edges
do not intersect.
3. If the vertex vLm
i
k from sub-layer Lmi is connected by a next side edge with another vertex
vLm
i
l from the same layer in the backward direction, then the neighbor vL
m+1
i
j of the vertex
vLm
i
k through the internal edge in the backward direction is from the next layer Lm+1i , see
Fig. 9(b). Let us consider the execution of of the layer-connection algorithm over the layer
Li. The m-step assigns vertices vLm
i
k and vLm
i
l to sub-layer Lmi . In the next step vj will be
assigned into a next sub-layer Lm+1i .
4. Each vertex vLm
i
k from sub-layer Lmi can have only one neighbor v
Lm
i
l through the next side
edge from the same sub-layer in the backward direction. Suppose it has two such neighbors
vLm
i
l and vLm
i
l+1, see Fig, 9(c). Then, the neighbor vLm
i
l+1 from sub-layer Lmi is covered by the
edge {vLm
i
k , vLm
i
l }, which is a contradiction with observation 2.
5. Each vertex vLm
i
k from sub-layer Lmi can only have one neighbor within the same sub-layer
in the backward direction. The reason is the following: The vertex vLm
i
k cannot have two
neighbors from the same sub-layer through the next side edges in the backward direction,
from observation 4, see Fig, 9(c). If the vertex has only one neighbor from the same sub-
layer through next side edge in the backward direction, then its internal edge neighbor is
not from the same sub-layer, from observation 3, see Fig, 9(b). If the vertex vLm
i
k has an
internal path neighbor from the same sub-layer in the backward direction, then the vertex
vLm
i
k cannot have another neighbor through a next sub-layer edge, compare Fig. 9(d).
We conclude that the vertex vLm
i
k may have only one neighbor vLm
i
k−1 from the same sub-layer when
traversing the internal path in the backward direction. The same observations apply in the forward
direction. Since the vertex vLm
i
k from sub-layer Lmi can have only two neighbors v
Lm
i
k−1 and vLm
i
k+1 in
the same sub-layer as determined from a scan in both directions, ∆Lmi ≤ 2.
Theorem 2. If the graph G is planar and ∆(G) = h, then G is a union of no more than 3
bipartite graphs; that is, G =⋃
i=1,...,3 Bi, with ∆(B1) ≤ h, ∆(B2) ≤ h− 1 and ∆(B3) ≤ 2.
Proof: The theorem is true as a consequence of Lemma 1 and Lemma 6. The recursive
partitioning procedure in Fig. 6 is truncated at the second recursion level as illustrated in Fig. 10,
because the sub-graph layers have a maximum vertex valency of 2.
The theorem proves the implied correctness of the layer-connection algorithm in the case of
planar graph representing partition of 2D mesh.
18
7 Complexity of the graph partition algorithm
Algorithm complexity for the general case. In the general case the graph partition algorithm
produces h− 1 bipartite graphs, where h is the maximum vertex valency in the graph.
We assume that the graph is stored as a data structure where graph vertices and edges are
kept in a variable-length list. The operation of accessing all edges of all vertices in the list has a
complexity O (nh), where n denotes the number of vertices in the graph. The complexity of each
step (See Graph partition algorithm.) in the algorithm is described below:
1. Step(1) is O (nh). The layer-connection algorithm visits each vertex, browses all of its
neighbors, and assigns layer-connection indexes to each vertex.
2. Step(2) is O (nh). This step browses all vertices and edges of the graph, to form a bipartite
graph B1 and layer subgraph L1. It is performed by attributing graph edges and vertices.
3. Step(3) is O (1). Initializes a layer counter.
4. Step(4) is O (nh). This step searches all vertices of the graph, and all edges of particular
layers, to find the maximum valency of the graph vertices. All edges of all graph vertices are
browsed, to determine their assignment to layers. layer.
5. Step(5) is O (1). The layer counter is incremented.
6. Step(6) is O (nh). This step required browsing of all vertices of the graph and all edges of
those vertices assigned to particular layers. All edges of the graph vertices are checked to
determine their layer assignment.
7. Step(7) is O (nh). This step queries all vertices of the graph, and all edges of particular
layers, to group them into a bipartite graph Bk+1 and the sub-layers of subgraph Lk+1. This
is accomplished by attributing graph edges and vertices. All edges of all graph vertices are
searched for assignment to a sub-layer.
8. The Step(8) This is a pseudo-code description for defining a naming convention.
9. Step(9) is O (nh). This step requires browsing all vertices of the graph, and all edges of
particular sub-layers, for attributing them to the last bipartite graph Bk+2. All graph edges
of the graph vertices are searched for assignment to a sub-layer.
In the general case, Step(4) - Step(8) are repeated h − 2 times. The total computational com-
plexity contribution from the above list is
O (nh+ nh+ 1 + (h− 2) (nh+ 1 + nh+ nh) + nh) = O(nh2) (7.9)
19
Algorithm complexity for a planar graph. The case of planar graph differs from the general
case only in the number of iterations. The Step(4) - Step(7) loop is performed no more than one
time. Hence, the total computational complexity of the algorithm for a planar graph is
O (nh+ nh+ 1 + nh+ 1 + nh+ nh+ nh) = O(nh) (7.10)
8 Concurrent point-to-point communication on clusters
In today’s HPC clusters the processors are interconnected by high performance switches that pro-
vide point-to-point bandwidths from 0.1 to 10 Gb/s (gigabits per second). Every processor is
connected to a system bus, which in turn is bridged to an I/O bus (usually a PCI bus). The I/O
bus, in turn, is connected to the network switch through a “network adapter”, as shown in Figure
11.
Peak performance is usually quoted for a single direction, as derived with a “ping-pong” program
that measures the round-trip MPI send/receive message speeds between two processors. Some
switched networks are bidirectional: that is, they can send messages between the same two points
simultaneously in both directions. But even when the quoted bidirectional speed is much higher
than the unidirectional speed, practical MPI communication performance measurements show only
a marginal or no performance improvement in bidirectional communication [24], although there are
exceptions.
The physical connection to the switch from the network adapter is called a “port” of the switch.
There are usually between 4 and 128 ports on a switch. Often switches with large port counts have
layers of “crossbars” to interconnect groups of ports. To extend a cluster’s node interconnection
beyond the number of ports available on a single switch, it is necessary to create a hierarchy of
switches. A Clos topology [4] is implemented to obtain full bi-sectional bandwidth; that is, to
achieve the peak aggregate performance when any half of the nodes simultaneous communicates
with any other half. In essence, a Clos topology allows concurrent independent communications to
occur throughout a hierarchical switch fabric.
For high-speed switches the performance of each simultaneous communication is nearly the
same as the communication performance of a single communication through the switch. Figure
12(a) depicts 8 single-processor nodes connected to a high speed switch. A large message can be
sent between any two processors at a nominal bandwidth. For example, on a high-speed switch,
processors 1−4 can send messages to processors 5−8 concurrently, each communicating at or near
the nominal bandwidth.
The peak bandwidth for switches is determined by measuring the “effective” bandwidth, for a
very large message size, using the formula
20
BE =L
tcomm
(8.11)
where BE is the effective bandwidth, L is the message size, and tcomm is the communication
time.
The effective bandwidth is often measured over a wide range of message sizes to produce a
profile of the communication performance. At very large message sizes an asymptotic (peak) is
achieved. Developers use these profiles to quickly determine the time for sending a fixed-size
message. Fig. 13 shows the profiles for single and concurrent (explained below) communications.
(Note the logarithmic scale in the message size.)
The effective bandwidth does not account for the startup time, or latency, required to initialize
the communication, and hence only provides a complete bandwidth in the asymptotic region where
the latency has practically no contribution. The communication time is often modeled by a constant
bandwidth B and latency τ , with a linear dependence on the message size, by the formula
tcomm = τ +L
B(8.12)
The true bandwidth and latency in this formula are the asymptotic bandwidth plus the y
intercept time (message size / effective bandwidth for the smallest message size) in an effective
bandwidth profile.
In HPC networks, the latency and bandwidth are about 10−5 seconds and 100 (Megabytes
per second) respectively. For a message size that contributes an amount of time equivalent to
the latency (Latency = Message Size / Bandwidth), the system reaches half the potential of the
network. It is necessary to send 1 KB messages (10−5 sec. 100 MB/sec) to achieve the half-peak
performance. While the bandwidth, per se, doesn’t change in this model, the communication times
for short-message algorithms are significantly impacted by the latency.
In Fig. 14 the communication time through a Myrinet 2000 switch for a range of message sizes
illustrates that the model is correct for a single communication path. The graph also shows the
linear behavior for multiple (30) simultaneous communications through the switch fabric on the
system. While a large number of multiple simultaneous communications produces a degradation of
performance, the linear character implies that a constant bandwidth in equation 8.12 can be used
to model a set of simultaneous communications. (The degradation of the bandwidth is attributed
to non-dynamic routing through the switch fabric. This can be minimized by a dynamic routing
mechanism, soon to be released by Myricom [38].) Also, for messages below 10K there may be
different algorithms for handling normal and short messages. Nevertheless, a single linear equation
may be adequate to handle both cases.
Many of the newer HPC clusters have dual-processor nodes (e.g., Intel Xeon, Apple G5 X-
server, AMD MP/Opteron), and the IBM Power4 series has a 4-way platform (model P655) that
21
is used as a building block for clusters. Figures 12(b) and 12(c) show typical interconnects for
2-way and 4-way nodes, respectively. These systems complicate the communication analysis in
two ways. First, intra-node communication between processors on a node is usually much faster
than the inter-node communication, between processors in different nodes. (This might not be
the case if the MPI libraries were not compiled with the correct configuration switches.) Second,
the switch bandwidth must be shared with the other processors on the node. For instance, in
Fig. 12(c), a single message between processor 1 and 4 might experience a performance of 250
MB/s, while groups of simultaneous messages between nodes (e.g., between processors 1 − 4 and
processors 5 − 8) would experience a performance of less than 62.5 MB/s, one fourth of the peak
network speed. The characteristics of intra-node and node-shared communications are illustrated
in Fig. 15. Concurrent communications between nodes “share” the bandwidth equally. Intra-node
communication performance is significantly higher than inter-node performance. Also, for small
intra-node messages, cached message performance is higher. (Inter-node communications is not
affected by caching, results not shown.) Often the inter-node bandwidth is used in equation 8.12
because it is the bottleneck in the communications.
9 Performance measurements
The performance of our h-rpc algorithm of colored bipartite graphs was evaluated on the TACC
Linux Xeon cluster, using a planar graph representing a 2D rectangular domain partitioned into p
sub-domains. The domain was partitioned according to the pattern presented in Fig. 4, but for
different numbers of sub-domains p = 4, 8, 16, 32 and 64, equal to the number of processors. Here,
the maximum number of neighboring sub-domains is equal to ∆ (G) = h = 6. The graph-partition
algorithm produces 2 bipartite graphs (B1 and B2), where ∆(B1) = 4 and ∆(B2) = 2. The number
of colors required for edge-coloring is equal to 6. There are S = 6 sets of messages to be exchanged
concurrently. The number of messages, M , in the set is dependent on the number of processors and
is determined by the formula M = p1
2 + p. The theoretical communication cost of the algorithm in
this case is equal to tcomm = (τh+ Lσ (M)) 2S, where L is the size of each message within a set,
S is the number of sets, and the factor 2 accounts for the fact that each message is sent in both
directions. The time required to send a byte, σ (M), depends on the number of messages in the set
M .
The measurements of the effective bandwidth BE(M) performed on the Xeon cluster were
presented in Fig.13. The effective bandwidth BE (M) is related with the real bandwidth B (M) by
the relation BE(M) = L/ (τ + L/B(M)) -compare (8.11) and (8.12). This implies that
B(M) =L
LBE(M) − τ
(9.13)
22
If σ (M) is in units of bytes, then
σ(M) =106
B(M)=
106
BE(M)−
106τ
L(9.14)
where τ = 10−5 and 106τL
= 10L
is negligible for very large message sizes L. We can simply calculate
values of σ(M) by using the profiles of BE(M) in Fig. 13.
In our experiment the processor count was set and then the message size was varied. The
number of nodes in the graph G changes as well as the number of messages (M = p1
2 + p in each
of S = 6 sets), when the processor count is varied.
From the profiles in Fig. 13, (asymptotic) bandwidths were assigned for each processor count
value (9.14); and then the theoretical performance was calculated from: tcomm = (τh+ Lσ (M)) 2S
(message size = L, number of messages in a set = M = p1
2 + p, and number of sets S = 6, see
Fig. 17). A comparison of the measured and theoretical results in Figs. 17 and 16 shows good
agreement.
The above measurements were performed on partitioned domains with a maximum number of
neighboring sub-domains equal to 6 (i.e., h = 6). In HP adaptive computations, the maximum
number of neighboring sub-domains usually varies between 4 and 8. Hence our experimental case
represents a realistic test for HP -adaptive methods and predicts excellent scaling when a large
number of processors is required.
10 Conclusions
• We have shown that all communication problems arising in H- and HP -adaptive computa-
tions can be solved using a h-relation personalized communication (h-rpc), where h is the
maximum number of neighboring sub-domains.
• We propose an alternative h-rpc strategy for communication patterns arising from partition
of the computational domain into sub-domains.
• The strategy uses an algorithm that partitions a graph representation of the computational
domain into a set of bipartite graphs, edge-colors the bipartite graphs, and schedules com-
munication according to the assigned color.
• We have shown the correctness of proposed graph partitioning algorithm and analyzed its
computational complexity.
• We have derived the cost for scheduling our bipartite graphs edge-coloring algorithm.
• For planar graphs representing partitioned 2D areas, the number of resulting bipartite graphs
is not more that 3, and the number of colors required for edge-coloring is not greater then
2h+ 1.
23
• For the graph representation of partitioned 3D areas, the number of resulting bipartite graphs
is not more that h− 1, and the number of colors required for edge-coloring is no greater than12h
2 + 12h− 3.
• The performance of the proposed strategy was evaluated on a Linux cluster and agrees well
with the theoretical communication cost.
• The communication cost of the proposed strategy O(
τh+ npσ)
scales favorably when the
processor count p is increased, as opposed to the two-phase algorithm considered the optimal
solution for h-rpc.
References
[1] Bader D.A. “High-Performance Algorithms and Applications for SMP Clusters”. NASA High
Performance Computing and Communications Aerospace Workshop CAS 2000, NASA Ames
Research Center, Moffett Field, CA, February 15-17, (2000)
[2] Bader D.A., Helman D.R. JaJa J. “Practical Parallel Algorithms for Personalized Communica-
tion and Integer Sorting”. ACM Journal of Experimental Algorithmics, 1(3), (1996) pp.1-42
[3] Biggs N.L. Discrete Mathematics, Oxford University Press, (1985) pp.173
[4] Clos C. “A Study of Non-Blocking Switch Networks”, Interconnection Networks for Parallel and
Distributed Processing, Chuan-lin Wu and Tse-yun Feng (ed.), IEEE Computer Society Press,
(1984), pp.126-144
[5] Cormen T.H, Leiserson C.E, Rivest R.L. Introduction to Algorithms, MIT Press, (1999) pp.964
[6] Crutchfield W.Y., Welcome M.L. “Object oriented implementation of adaptive mesh refinement
algorithms”. Scientific Prog, 2, (1993) pp.145-156
[7] Das R., Hwang Y.-S., Uysal M., Saltz J., Sussman A. “Applying the CHAOS/PARTI Library
to Irregular Problems in Computational Chemistry and Computational Aerodynamics”. Pro-
ceedings of the Scalable Parallel Libraries Conference, Mississippi State University, Starkville,
(1993) pp.45-46
[8] Das R., Uysal M., Saltz J., Hwang Y.-S. “Communication Optimizations for Irregular Scien-
tific Computations on Distributed Memory Architectures”. UMIACS Technical Reports CS-TR-
3163, UMIACS-TR-93-109 University of Maryland, Department of Computer Science (1993)
[9] Devine K.D, Hendrickson B, Boman E.G, St.John M.M, Vaughan C. “Zoltan: A dynamic load-
balancing library for parallel applications - user’s guide”. Tech. Rep. SAND99-1377, Sandia
National Laboratories, (1999)
24
[10] Edwards H.C. “SIERRA Framework Version 3: Core Services Theory and Design”. SAND2002-
3616, Albuquerque, NM: Sandia National Laboratories, (2002)
[11] Edwards H.C, Stewart J.R. “SIERRA, A Software Environment for Developing Complex Mul-
tiphysics Applications”. Computational Fluid and Solid Mechanics Proc. First MIT Conf., Cam-
bridge MA, (2001)
[12] Edwards H.C, Stewart J.R, Zepper J.D. “Mathematical Abstractions of the SIERRA Compu-
tational Mechanics Framework”. Proceedings of the 5th World Congress Comp. Mech., Vienna
Austria, (2002)
[13] Fink S.J., Kohn S.R., Baden S.B. “Efficient run-time support for irregular block-structured
applications”. J. Parallel Distrib. Comput., 50(1,2), (1998) pp.61-82
[14] Foster I. “Designing and Building Parallel Programs”. Available at
http://www-unix.mcs.anl.gov/dbpp
[15] Goldman A., Trystram D. “Algorithms for the Message Exchange Problem”. Proceedings of
the International Conference on Parallel Computing in Electrical Engineering, Poland (1998)
pp.153-161
[16] Hwang Y.-S., Moon B., Sharma S.D. “Runtime and Language Support for Compile Adap-
tive Irregular Problems on Distributed Memory Machines”. Software-Practice and Experience,
25(6), (1995) pp.597-621
[17] iCC: Interprocessor Collective Communication Library. Available at
http://www.cs.utexas.edu/users/rvdg/intercom
[18] Jain R., Werth J., Browne J.C., Sasaki G. “A graph-theoretic model for scheduling problem
and its application to simultaneous resource scheduling”. Computer Science and Operations
Research: New Developments in Their Interfaces, Ed. by O. Balci, R. Schander, and S. Zerrick,
Penguin Press, (1992)
[19] Johnsson S.L., Ho. C.-T. “Optimal All-to-All Personalized Communication with Minimum
Span on Boolean Cubes”. Technical Report 18-91, Center for Research in Computing Technol-
ogy, Harvard University (1991)
[20] Johnsson S.L., C.T. Ho. “Optimal Broadcasting and Personalized Communication in Hyper-
cubes”. IEEE Transactions on Computers, 38(9), (1989) pp.1249-1268
[21] Kapoor A., Rizzi R. “Edge Coloring Bipartite Graphs”. Technical Report DIT-02-040, Infor-
matica e Telecomunicazioni, University of Trento (1999)
25
[22] Kohn S.R., Baden S.B. “A robust parallel programming model for dynamic non-uniform sci-
entific computations”. Proceeding of SHPCC-94, IEEE Computer Society Press, Silver Spring,
MD, (1993) pp.509-597
[23] Laszloffy A., Long J., Patra A. “Simple data management, scheduling and solution strate-
gies for managing the irregularities in parallel adaptive hp finite element simulations”. Parallel
Computing, 26, (2000) pp.1765-1788
[24] Liu J., Chandrasekaran B., Wu J., Jiang W., Kini S., Yu W., Buntinas D., Wyckoff P.,
Panda D. K. “Performance Comparison of MPI Implementations over InfiniBand, Myrinet and
Quadrics”, SuperComputing 2003 Conference, Pheonix, AZ, November, (2003)
[25] Marathe M.V., Panconesi A., Risinger L.D. “An experimental study of a simple, distributed
edge coloring algorithm”. Proceedings of the twelfth annual ACM symposium on Parallel algo-
rithms and architectures, Bar Harbor, Maine (2000) pp.166-175
[26] Mathur K.K, Johnsson S.L. “All-to-All Communication Algorithms for Distributed BLAS”.
Technical Report 07-93, Center for Research in Computing Technology, Harvard University
(1993)
[27] Mathur K.K, Johnsson S.L. “Communication Primitives for Unstructured Finite Element Sim-
ulations on Data Parallel Architectures”. Technical Report 23-92, Center for Research in Com-
puting Technology, Harvard University (1992)
[28] Paszynski M., Kurtz J., Demkowicz L. “Parallel Fully Automatic HP -Adaptive 2D Finite
Element Package”. ICES Report 04-07, University of Texas in Austin (2004)
[29] PLAPACK: Parallel Linear Algebra Package. Available at
http://www.cs.utexas.edu/users/plapack
[30] Risinger L.D., Marathe M.V., Panconesi A. “Edge Coloring Algorithms for Scheduling in
ASCI: Survey and Experimental Results”. LAUR-No-97-2341, Los Alamos National Laboratory,
(1997)
[31] Rizzi R. “Finding 1-Factors in Bipartite Regular Graphs and Edge-Coloring Bipartite Graphs”.
SIAM J. Discrete Math. 15(3) (2002) pp.283-288
[32] Skiena S. “Coloring Bipartite Graphs”, Implementing Discrete Mathematics: Combinatorics
and Graph Theory with Mathematica. Reading, MA: Addison-Wesley, (1990)
[33] Sundar N.S., Jayasimha D.N., Panda D.K., Sadayappan P. “Hybrid Algorithms for Complete
Exchange in 2D Meshes”. IEEE Transactions on Parallel and Distributed Systems, 12(12),
(2001) pp.1201-1218
26
[34] Stewart J.R, Edwards H.C. “SIERRA Framework Version 3: h-Adaptivity Design and Use”.
SAND2002-4016, Albuquerque, NM: Sandia National Laboratories, (2002)
[35] Wang J.-C., Lin T.-H., Ranka S. “Distributed Scheduling of Unstructured Collective Commu-
nication on the CM-5.” Parallel Processing Letters, special issue on Partitioning and Scheduling,
(1995)
[36] Wenjie He, Xiaoling Hou, Ko-Wei Lih, Jiating Shao, Weifan Wang, Xuding Zhu. “Edge-
Partitions of Planar Graphs and Their Game Coloring Numbers”. Journal of Graph Theory,
41(4), (2002) pp.307-317
[37] Zhi-Zhong Chen, Xin He. “Parallel complexity of partitioning a planar graph into vertex-
induced forests”. Discrete Applied Mathematics 69, (1996) pp.183-198
[38] Personal communication with Myricom, Myricom help #29246
27
Solve global interface problemSolve local problems over each subdomain
Coarse grid solve:
Solve local problems over each subdomain
Compute the difference between fine grid and coarse grid solution
Is the difference smallYES
STOP
NO
Perform optimal mesh refinements over each subdomain separately (full parallel)
Perform Schur complement of the global problem
Solve global interface problemPerform Schur complement of the global problem
of computational domainWe assign some weights with finite elements
(computational complexity over the element)
Load balancing using ZOLTAN library
Fine grid solve:
(frontal solver over each subdomain)
(frontal solver over each subdomain)
(over each subdomain − full parallel)
Mesh reconciliation (inter−subdomain communication required)
Each processor reads some initial part
Some elements migrate to other subdomains
hp−refinement over each subdomain(full parallel)
Compute global maximum error
Figure 1: Flow chart for fully automatic, parallel HP -adaptive finite element algorithm.
L, p 4 8 16 32 64
10 0.0001 0.0001 0.0001 0.0001 0.0001
100 0.0002 0.0002 0.0013 0.0005 0.0072
1000 0.0004 0.0002 0.0015 0.0014 0.0033
10000 0.0005 0.0007 0.0016 0.0014 0.0035
100000 0.0033 0.0036 0.0087 0.0097 0.0131
1000000 0.0342 0.0352 0.0868 0.0892 0.1311
10000000 0.3255 0.3421 0.7915 0.9243 1.2204
Table 1: Communication time (in seconds) for scheduling with bipartite graphs edge-coloring algo-rithm. p is the number of processors and L is the message size (in bytes).
28
Figure 2: Two iterations of automatic hp-adaptive strategy: Figure (a): Initial mesh = coarsemesh. Figure (b): Fine mesh in the first iteration. Figure (c): Optimal mesh after the firstiteration. Figure (d): Optimal mesh from previous iteration = coarse mesh for the next iteration.Figure (e): Fine mesh in the second iteration. Figure (f): Optimal mesh after the seconditeration. Figure (g): Different colors denote different orders of approximation over a finiteelement. The brighten colors denote higher orders of approximation.
Figure 3: Assignment of nodes from the global interface to processors.
29
Figure 4: Example of a nice message scheduling for partitioned 2D domain: Figure (a): 2Ddomain partitioned into sub-domains. Figure (b): Planar graph representation of partitioneddomain. Figure (c): Layer and connection subgraphs. Figure (d): Partition of the planar graphinto two bipartite graphs: union of all connection subgraphs B1; union of all layer subgraphs B2.Figure (e): Edge-coloring of bipartite graphs. Figure (f): Six sets of messages to be sent.
30
Figure 5: Example of the worst case of message scheduling for 2D partitioned domain: Figure (a):2D domain partitioned into sub-domains. Figure (b): Planar graph representation of partitioneddomain. Figure (c): Layer and connection subgraphs. Figure (d): Second execution of layer-connection algorithm over layers. Figure (e): Sub-layer and sub-connection subgraphs. Figure(f): Partition of the planar graph into three bipartite graphs: union of all connection subgraphsB1; union of all sub-connection sub-graphs B2; union of all sub-layer subgraphs B3. Figure (g):Edge-coloring of bipartite graphs. Figure (h): Nine messages to be sent.
31
Figure 6: Partition of any graph G with maximum vertex valency h into h− 1 bipartite graphs.
Figure 7: Example of internal path and next layer edges.
32
Figure 8: Covering of the sub-domain vLi
k+1 by the previous side edge connections.
Figure 9: Adjacency cases for vertex from sub-layer: Figure (a): Vertices covered by the edge
{vLm
i
k , vLm
i
l }. Figure (b): Internal edges are from the next sub-layer. Figure (c): Only one edgefrom the same sub-layer through the next sub-layer edge. Figure (d): Only one neighbor fromthe same sub-layer through the internal edge.
33
Figure 10: Partition of planar graph G with maximum vertex valency h into 3 bipartite graphs.
Figure 11: Cluster communication architecture: processors, buses, network adapter cards, andswitch.
Figure 12: Network connectivity for single- and multi- processors nodes.
34
Figure 13: Effective bandwidth profiles for concurrent communication through a Myrinet 2000switch hierarchy.
Figure 14: Point-to-point communication times for message sizes up to 8 MB. Single (1) com-munication and concurrent (30 simultaneous messages) communication are measured in a switchhierarchy.
35
Figure 15: Performance profiles for dual-processor communications. Intra-processor: On Node(Non-Cached removes data from cache before timing). Inter-processor: Between Nodes (1 and 2simultaneous communications).
Figure 16: Measured communication time for scheduling obtained from the bipartite graphs edgecoloring algorithm. Both processor count and message size are varied.
36
Figure 17: Theoretical communication time over various p processors (sub-domains, partitionedaccording to Fig. 4), as determined by tcomm = (τh+ Lσ (M)) 2S, where σ (M) values are deter-
mined from the asymptotes in Fig. 13, M = p1
2 + p is the number of messages in each set, S = 6number of sets, h = 6 maximum number of neighboring sub-domains.
37
top related