parallel algorithms for bipartite matching problems on distributed memory computers

26
Parallel algorithms for bipartite matching problems on distributed memory computers Johannes Langguth , Md. Mostofa Ali Patwary, Fredrik Manne University of Bergen, Department of Informatics, Thormøhlensgate 55, N-5008 Bergen, Norway article info Article history: Available online 5 October 2011 Keywords: Bipartite graphs Parallel algorithms Matching abstract We present a new parallel algorithm for computing a maximum cardinality matching in a bipartite graph suitable for distributed memory computers. The presented algorithm is based on the PUSH-RELABEL algorithm which is known to be one of the fastest algorithms for the bipartite matching problem. Previous attempts at develop- ing parallel implementations of it have focused on shared memory computers using only a limited number of processors. We first present a straightforward adaptation of these shared memory algorithms to dis- tributed memory computers. However, this is not a viable approach as it requires too much communication. We then develop our new algorithm by modifying the previous approach through a sequence of steps with the main goal being to reduce the amount of communi- cation and to increase load balance. The first goal is achieved by changing the algorithm so that many push and relabel operations can be performed locally between communication rounds and also by selecting augmenting paths that cross processor boundaries infre- quently. To achieve good load balance, we limit the speed at which global relabelings tra- verse the graph. In several experiments on a large number of instances, we study weak and strong scalability of our algorithm using up to 128 processors. The algorithm can also be used to find -approximate matchings quickly. Ó 2011 Elsevier B.V. All rights reserved. 1. Introduction The bipartite cardinality matching problem is defined as follows: Given an undirected, bipartite graph G =(V 1 , V 2 , E), E # {{v 1 , v 2 }: v 1 2 V 1 , v 2 2 V 2 }, find a maximum subset M # E of pair- wise nonadjacent edges. A set M is called a perfect matching iff jV 1 j = jV 2 j = jM j. Clearly, not all bipartite graphs have a per- fect matching. Bipartite matching is a classical topic in combinatorial optimization and has been studied for almost a century and it has many applications. We are especially interested in finding maximum transversals i.e., obtaining the maximum number of nonzeros in the diagonal of a sparse matrix by permuting its rows and columns. This problem can be solved via bipartite cardinality matching algorithms [1]. Therefore, bipartite matching is an important problem in combinatorial scientific com- puting. Other applications can be found in many fields such as bioinformatics [2], statistical mechanics [3], and chemical structure analysis [4]. For a given n n 0 matrix A we define G A =(V 1 , V 2 , E) where jV 1 j = n, jV 2 j = n 0 , and E = {{v i , v j }, v i 2 V 1 , v j 2 V 2 : a ij 0} as the graph derived from A. Assuming A belongs to a specified systems of linear equations, A is a square matrix having full 0167-8191/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2011.09.004 Corresponding author. Tel.: +33 0472728436; fax: +33 0472728806. E-mail address: [email protected] (J. Langguth). Parallel Computing 37 (2011) 820–845 Contents lists available at SciVerse ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco

Upload: johannes-langguth

Post on 11-Sep-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Parallel algorithms for bipartite matching problems on distributed memory computers

Parallel Computing 37 (2011) 820–845

Contents lists available at SciVerse ScienceDirect

Parallel Computing

journal homepage: www.elsevier .com/ locate /parco

Parallel algorithms for bipartite matching problems on distributedmemory computers

Johannes Langguth ⇑, Md. Mostofa Ali Patwary, Fredrik ManneUniversity of Bergen, Department of Informatics, Thormøhlensgate 55, N-5008 Bergen, Norway

a r t i c l e i n f o a b s t r a c t

Article history:Available online 5 October 2011

Keywords:Bipartite graphsParallel algorithmsMatching

0167-8191/$ - see front matter � 2011 Elsevier B.Vdoi:10.1016/j.parco.2011.09.004

⇑ Corresponding author. Tel.: +33 0472728436; faE-mail address: [email protected] (J. L

We present a new parallel algorithm for computing a maximum cardinality matching in abipartite graph suitable for distributed memory computers.

The presented algorithm is based on the PUSH-RELABEL algorithm which is known to be oneof the fastest algorithms for the bipartite matching problem. Previous attempts at develop-ing parallel implementations of it have focused on shared memory computers using only alimited number of processors.

We first present a straightforward adaptation of these shared memory algorithms to dis-tributed memory computers. However, this is not a viable approach as it requires too muchcommunication. We then develop our new algorithm by modifying the previous approachthrough a sequence of steps with the main goal being to reduce the amount of communi-cation and to increase load balance. The first goal is achieved by changing the algorithm sothat many push and relabel operations can be performed locally between communicationrounds and also by selecting augmenting paths that cross processor boundaries infre-quently. To achieve good load balance, we limit the speed at which global relabelings tra-verse the graph. In several experiments on a large number of instances, we study weak andstrong scalability of our algorithm using up to 128 processors.

The algorithm can also be used to find �-approximate matchings quickly.� 2011 Elsevier B.V. All rights reserved.

1. Introduction

The bipartite cardinality matching problem is defined as follows:Given an undirected, bipartite graph G = (V1,V2,E), E # {{v1,v2} : v1 2 V1,v2 2 V2}, find a maximum subset M⁄ # E of pair-

wise nonadjacent edges. A set M⁄ is called a perfect matching iff jV1j = jV2j = jM⁄j. Clearly, not all bipartite graphs have a per-fect matching.

Bipartite matching is a classical topic in combinatorial optimization and has been studied for almost a century and it hasmany applications. We are especially interested in finding maximum transversals i.e., obtaining the maximum number ofnonzeros in the diagonal of a sparse matrix by permuting its rows and columns. This problem can be solved via bipartitecardinality matching algorithms [1]. Therefore, bipartite matching is an important problem in combinatorial scientific com-puting. Other applications can be found in many fields such as bioinformatics [2], statistical mechanics [3], and chemicalstructure analysis [4].

For a given n � n0 matrix A we define GA = (V1,V2,E) where jV1j = n, jV2j = n0, and E = {{vi,vj},vi 2 V1,vj 2 V2 : aij – 0} as thegraph derived from A. Assuming A belongs to a specified systems of linear equations, A is a square matrix having full

. All rights reserved.

x: +33 0472728806.angguth).

Page 2: Parallel algorithms for bipartite matching problems on distributed memory computers

J. Langguth et al. / Parallel Computing 37 (2011) 820–845 821

structural rank [5] and GA has a perfect matching. As any edge included in a matching on GA corresponds to an entry of Abeing permuted to the main diagonal, a perfect matching corresponds to a transversal for A.

Many sophisticated sequential implementations solving this problem, such as MC21 [1,6], are available, but for parallellinear solvers where A is typically distributed among several processors, it is desirable to use a matching algorithm thatworks directly on the distributed matrix to avoid the memory limitations of a single node and that scales reasonably wellwith the number of processors.

Bipartite matching is a special case of the maximum flow problem, for which sequential polynomial time algorithms havelong been known [7], and many specialized algorithms for finding bipartite matchings have been proposed in the past [8,9].Among these, the PUSH-RELABEL algorithm [10] by Goldberg and Tarjan has proven to be one of the fastest sequential algo-rithms [11–13], and it also exhibits a structure that makes it more amenable to parallelization than other matching algo-rithms. Parallelizations of the PUSH-RELABEL algorithm for maximum flow on shared memory computers have beenpresented by Bader and Sachdeva [14] and by Anderson and Setubal [15]. The latter work has been adapted to bipartitematching and studied in [16].

Other approaches to distributed parallel matching have used auction algorithms [17,18] to solve the weighted bipartitematching problem. Recently, in [19] an implementation of such an auction based algorithm for finding a perfect matching of�-approximate maximum weight on a distributed parallel computer was presented. This algorithm is used as a subroutinefor the Pspike [20] hybrid linear solver. It uses techniques similar to our parallel algorithm, but it is not designed to find amatching of maximum cardinality in sparse graphs. In [21,22] a linear time approximation algorithm for the same problemwas presented. Even though this only guarantees a factor 2 approximation, experiments indicate that it is likely to provideapproximations of high quality.

Even though the maximum cardinality matching problem can be solved using weighted matching, such algorithms are ingeneral not well suited for finding maximum cardinality matchings in sparse graphs. The outbidding technique used in auc-tion algorithms resembles pushes in the PUSH-RELABEL algorithm. However, auction algorithms do not employ global relabel-ings, and it is not clear whether an equivalent technique can be implemented efficiently in parallel. We therefore opt topursue a different route.

In this paper we show how an efficient parallel algorithm suitable for distributed memory can be obtained from the par-allel PUSH-RELABEL algorithm suggested in [10]. To do so, we first adapt ideas for shared memory computation from [15] to thedistributed memory model, thereby obtaining a simple parallel bipartite matching algorithm labeled Algorithm 1. This ispresented in Section 4. In Section 5 we show how to modify Algorithm 1 in order to obtain the more competitive Algorithm2. The improvement is achieved by relaxing the requirement for labels to constitute a global lower bound on the distance to afree vertex, a central invariant of the PUSH-RELABEL algorithm. This allows us to execute a large number of pushes and relabel-ing operations outside of their standard order, thereby creating large batches of operations that can be processed locally. Tomodify the communication to computation ratio, the size of these batches is adjusted depending on the number of availableprocessors. This allows us to control, and thereby optimize load-balancing. In Section 5.4 we show that the algorithm re-mains correct under this relaxation.

Section 6 describes the experimental setup and Section 7 the experimental results for Algorithm 2 on real world in-stances. Further experiments focusing on the study of weak scaling using generated graphs are discussed in Section 8. Final-ly, Section 9 presents conclusions and further work.

2. Preliminaries

In the following, V1 is designated as the left side and V2 as the right side. A vertex v 2 V1 is a left vertex and u 2 V2 is a rightvertex. The set of matched edges, i.e. the matching is denoted by M. In slight abuse of notation, we also use M as the set ofvertices covered by the matching when needed, as the matching function. Thus, if v 2M, M(v) is the vertex matched to v. Anunmatched vertex w R M is called a free vertex with M(w) = ;. In general, we will refer to free left vertices as active vertices. IfM(v) = u then M(u) = v and the edge {u,v} is a matched edge. By the definition of a matching, any other edge incident on eitheru or v cannot be a matched edge, and is thus referred to as unmatched. A path P that alternates between matched and un-matched edges is called an alternating path. If both endpoints of P are free, it is also an augmenting path for M because switch-ing all matched edges of P to unmatched and vice versa results in a matching of cardinality jMj + 1. For a vertex v 2 V1, letC(v) # V2 be the neighborhood of v, i.e., the set of all vertices adjacent to v. For u 2 V2, define C(u) # V1 analogously.W.l.o.g. we assume n P n0. We also define m = jEj, i.e. the number of edges or nonzeroes in the input.

2.1. Edge-based partitioning

Both our algorithms assume that the input matrices are partitioned using a 2-D partitioner, such as Mondriaan [23] orZoltan [24]. They allow a more fine-grained partitioning than 1-D, i.e., column or row-based partitioning and have provenuseful for reducing communication [25]. In graph terms, 2-D matrix partitioning amounts to edge-based partitioning, whichmeans that a single edge is owned by only one processor, but a vertex v may be shared among multiple processors. In thiscase, the processor on which v has maximum degree treats v as an original vertex, while all other processors that have aninstance of v treat it as a ghost vertex. We will refer to such a split vertex, along with all its ghost instances as a connector,

Page 3: Parallel algorithms for bipartite matching problems on distributed memory computers

822 J. Langguth et al. / Parallel Computing 37 (2011) 820–845

denoted as C(v). For a connector that is split among k vertices, C0(v) will refer to the original vertex in the connector, i.e., v,while C1(v), . . . ,Ck�1(v) refers to the k � 1 ghost instances that are part of the connector. Communication during the algo-rithms is performed only between processors that share connectors. We will distinguish between left connectors, i.e., verticesv such that C0(v) 2 V1 and right connectors, i.e., vertices u such that C0(u) 2 V2. To simplify notation, for a vertex v that is notpart of a connector, let C0(v) = v. Such a vertex is designated as local. An edge {C0(v),Ci>0(u)} is called a crossing edge, while anedge {C0(v),C0(u)} is called local. In our experiments, we used Mondriaan to partition the test instances. For very large in-stances, the input will be parallel and thus a parallel partitioner such as Zoltan is needed.

3. The PUSH-RELABEL algorithm for bipartite matching

The PUSH-RELABEL algorithm by Goldberg and Tarjan [10] was originally designed for the maximum flow problem. Sincebipartite matching is a special case of maximum flow, it can be solved using the PUSH-RELABEL algorithm. In fact, it is oneof the fastest know algorithms for bipartite matching, as was shown in [11].

Unlike the PUSH-RELABEL most algorithms for bipartite matching are based on repeated searches for augmenting paths [8,9].Every time an augmenting path is found, the cardinality of M is increased as described in Section 2. However, this assumesthat no edge of M along the entire path remained unchanged since the search started, and while this causes no difficulty forsequential algorithms, it introduces the need for an extensive locking mechanism in parallel algorithms. The PUSH-RELABEL

algorithm on the other hand has no such constraints, as it repeatedly performs a set of independent operations that onlyaffect the neighborhood of a vertex. Therefore, we find it more amenable to parallelization.

PUSH-RELABEL works by defining a distance labeling w : V1 [ V2 ! N which constitutes a lower bound on the length of analternating path from a vertex v to the next free right vertex. Note that if v is a free left vertex, such a path is also an aug-menting path. We initialize w(v) = 0 "v 2 V1 [ V2. Now, as long as there are free left vertices, we pick one of these v andsearch its neighborhood for a vertex u with minimum w(u). If w(u) = 0 then u must be unmatched, and matching it to v in-creases the size of M by one. Otherwise, let u be matched to w. In that case we match v and u, rendering w unmatched. Be-cause now any augmenting path from a free right vertex to u must contain v, we update the distance labels by settingw(v) = w(u) + 1 and increasing w(u) by 2. If w(v) > 2n, we instead mark v as unmatchable and cease to consider it any further.

In maximum flow terms, during this operation a unit of flow entering v from the source has been transferred from v to u. Ifu was unmatched, it now has one unit of incoming flow which is routed to the sink. Otherwise, u is left with one unit ofexcess flow which has to be pushed back to a left vertex from which it currently receives flow, i.e., v or w. Since the algorithmcannot progress by immediately undoing the push from v to u, flow is pushed back from u to w, making w active immedi-ately. The entire operation is referred to as a double push if M(u) – ;, otherwise it is called a single push. Since G is bipartite,this technique ensures that matched right vertices can never become active again because excess flow will immediately betransferred back due to the double push [11]. Therefore, implementation of the PUSH-RELABEL algorithm for bipartite matchingis significantly easier and more efficient than using the standard PUSH-RELABEL algorithm for maximum flow.

In the bipartite matching context, a push from v to u means matching the edge {v,u}. If M(u) – ; was true prior to thepush, for M(u) = w we also unmatch the edge {w,u}. Thus, u is added to M while w is removed. We also set M(v) = u,M(u) = v, and M(w) = ; thereby making w active instead of v. Updating w ensures that the now free vertex w does not pushto u again unless there is no better alternative.

The push operation is repeated until there are no further active vertices. It is easy to show that in this case M is a max-imum matching. As observed in [10] pushes can be performed in arbitrary order, making the algorithm a prime candidate forparallelization. In the sequential case the order of pushes seems to influence performance strongly [11]. However, the algo-rithm can be initialized by starting with a matching M of high cardinality which can be obtained easily using various heu-ristics [26]. Doing so can level the difference between various orders of operations.

Although not necessary for the algorithm, performance is enhanced greatly by periodic updates of the distance labels. Thisis achieved by starting a breadth first search along alternating paths from each unmatched right vertex to set the labels w tothe actual alternating path distance from the next free right vertex. This operation is called a global relabeling. After perform-ing such a global relabeling, augmenting paths consisting of edges {u,w} with w(w) P w(u) + 1 from the free right to allreachable free left vertices exist. These edges are called admissible, since they are correctly aligned with the distance labeling.Assuming w is a valid distance labeling, w(w) > w(u) + 1 cannot occur. In contrast to the original PUSH-RELABEL algorithm forthe general maximum flow problem, the behavior of the above algorithm for bipartite matching does not depend on the la-bels of left vertices since a push can only reach them through a matched edge which will always be admissible.

Like other matching algorithms, the PUSH-RELABEL algorithm with global relabeling searches for augmenting paths. How-ever, it does not augment the matching immediately along paths so discovered. Instead, it uses the distance labels to simul-taneously guide the ‘‘flow’’, i.e. the unmatchedness of all free left vertices towards the unmatched right vertices, increasingthe size of the matching when a free left vertex and an adjacent free right vertex are matched.

4. The PUSH-RELABEL algorithm for distributed memory

In [10] the framework for a synchronous parallel algorithm relying on repeated application of a pulse operation was gi-ven. Such a pulse operation consists of each active vertex v attempting to push to a neighbor u with minimum w(u). Due to

Page 4: Parallel algorithms for bipartite matching problems on distributed memory computers

J. Langguth et al. / Parallel Computing 37 (2011) 820–845 823

the distributed nature of the algorithm, this can result in multiple active vertices pushing to the same right vertex. Ifthis happens, only one of these pushes is considered successful, and all other pushes are considered to have failed. Thusthe active vertices that initiated the failed pushes remain active and must try to push again during the next pulse, possiblyto a different vertex. Similar to the sequential PUSH-RELABEL algorithm, the pulse operation is repeated until no active verticesremain.

From this framework, we derive a simple parallel distributed memory algorithm called Algorithm 1. In [10], following thePRAM model, one processor per vertex is available. In the distributed memory setting, we instead assume an edge basedpartitioning of G among p processors, where p is several orders of magnitude smaller than n.

As was suggested in [10], Algorithm 1 works in rounds. During each round, for each active vertex v, the algorithmqueries for the lowest labeled neighbor u of v, attempts to match v with u and, if successful, updates labels and matchingedges. Rounds are repeated until a maximum matching is found. We refer to the two parallel operations in this algorithmas QUERY and PULSE. In shared memory models such as PRAM, the procedure QUERY does not appear explicitly since dataaccess is trivial. In the following, pseudocodes for procedures belonging to Algorithm 1 are listed with Algorithm 1 in theirnames.

Algorithm 1. Procedure QUERY

1: for each active vertex v do2: Find local neighbor u0(v) of minimum w(u0) where x(u0) = x(v)3: Send Request signal to each ghost vertex Ck>0(v)4: Exchange Request messages with other processors5: for each ghost vertex Ck>0(v) receiving a Request signal do6: Find local neighbor uk(v) of minimum w(uk) where x(uk) = x(v)7: Send Response signal (uk(v),w(uk)(v)) to C0(v)8: Exchange Response messages with other processors9: for each active vertex v do

10: Set u�ðvÞ ¼ arg minfui j06i6pgwðuiðvÞÞ

To find a neighbor of minimum label, every original left vertex in a left connector needs to query all its neighbors, includ-ing those on other processors. Only vertices in the same stage of a global relabeling can be considered as possible matchingpartners (see Global Relabeling below for more information on wave numbers x). Queries to other processors are initiated bysending request signals. For efficient communication, procedure QUERY gathers all request signals that refer to ghost verticesand thus require communication. It then transfers the request signals using point to point messages. Thus, in every round allsignals from processor pi to pj are bundled in one message for every pair of processors between which communication takesplace. Throughout this paper, all communication will follow this paradigm. Conceptually, this resembles the BSP model (see[27] for further details), although the implementation is geared towards higher flexibility in superimposing communicationand computation.

Requests are answered by returning a local neighbor of minimum label. All such responses are bundled and transferred inthe same manner as the queries. After receiving the responses, each active vertex v selects a neighbor u with minimum w(u)among all the responses received. Clearly, u is a minimum labeled neighbor of the entire connector C(v). Vertex u is thendesignated as the push target u⁄ for v.

For a left connector C(v), only the original vertex C0(v) carries a label w(v). The ghost vertices Ck>0(v) in a left connector donot carry any information except for a link to C0(v) and their respective adjacency lists. In a right connector C(u), all verticesCkP0(u) carry a label w(CkP0(u)) which is updated along with other information as described below. Thus, every vertex in aleft connector can obtain the minimum label in its local neighborhood without further communication. Fig. 1 illustrates thisstructure.

In procedure PULSE every active vertex v sends a match signal to its push target u⁄ with messages exchanged as describedabove. A push target u will always accept the first push received, sending a success signal to the source and, if necessary anunmatch signal to its previous matching partner. Following a successful push, the processor owning vertex v setsw(v) = w(u⁄) + 1, where w(u⁄) is the local value of the minimum neighboring label determined in procedure QUERY. Further-more, the owner of a push target u locally sets w(u) w(u) + 2 and, assuming u is part of a connector C(u), transfers the up-dated w(u) to all vertices Ck(u) in the connector.

If further push requests for u⁄ arrive during the same round, they are denied and a reject signal is sent back to the sourceof the push. Again, all these signals are gathered so that at most one message is exchanged between each pair of processors.

In the next round, previously active vertices whose pushes were rejected, along with vertices that received unmatch sig-nals, become active, while those that pushed successfully become inactive. If no active vertices remain on any processor, thealgorithm terminates, returning a perfect matching. To terminate the algorithm in graphs without a perfect matching, weremove any left vertex v where w(u) > 2n "u 2 C(v), from the set of active vertices. This technicality is omitted in the pseudo-code. As there is no way to increase the size of the matching using such a vertex v the end result will always be a maximumcardinality bipartite matching [10].

Page 5: Parallel algorithms for bipartite matching problems on distributed memory computers

Fig. 1. Sequential and distributed graph. In the sequential graph (a) each vertex v carries its own label w(v). In the distributed graph (b), only the originalvertex v of a left connector carries a label. The ghost vertex v0 merely queries its neighbors labels w(x) and w(y) and reports them to v. In a right connector,both the original vertex y and the ghost vertex y0 carry the label w(y) which is updated every time its value changes.

824 J. Langguth et al. / Parallel Computing 37 (2011) 820–845

Algorithm 1. Procedure PULSE

1: for each active vertex v do2: Send Match signal (v) to u⁄(v)3: Make v inactive4: Exchange Match messages with other processors5: for each vertex u receiving a Match signal do6: if u has not received a push in the current round do7: if M(u) – ; then8: Send Unmatch signal to M(u)9: M(u) v

10: Send Accept signal to v11: w(u) w(u) + 212: Set a flag indicating that u received a push in the current round13: else14: Send Reject signal to v15: Exchange Response messages with other processors16: for each vertex v receiving Accept signal do17: M(v) u⁄(v)18: w(v) w(u⁄(v)) + 119: for each vertex v receiving a Reject or an Unmatch signal do20: Make v active21: Remote update Ci>0(u) = C0(u) for all u 2 V2

Assuming G is edge-partitioned and distributed among p processors, we initialize w(v) :¼ 0 "v 2 V1 [ V2 and make allv 2 V1 active. Repeatedly calling the procedure QUERY and then PULSE on each processor until no processor has any active ver-tices left yields a maximum matching. However, the performance is weak. Even in the sequential case, the PUSH-RELABEL algo-rithm requires global relabelings in order to make it competitive with other sequential matching algorithms, and the parallelcase is no different.

Page 6: Parallel algorithms for bipartite matching problems on distributed memory computers

J. Langguth et al. / Parallel Computing 37 (2011) 820–845 825

4.1. Global relabeling

The application of parallel global relabelings is necessary in order to obtain a scalable parallel algorithm as described in[15]. Performing such a relabeling in parallel is relatively simple. We start from each free right vertex u and run a parallelalternating breadth first search, i.e. a BFS which alternately uses unmatched and matched edges, through the entire graph.Each time we follow an edge {u,v} to an unvisited vertex, its label w(v) is set to w(u) + 1, the minimum BFS distance froma free right vertex. Because the graph is distributed among the processors, every time the search reaches a connectorC(w), a signal carrying the label w(w) is sent to other vertices in C(w). This communication is done in a manner similar tothe communication in QUERY and PULSE. To ensure that w(v) is actually a lower bound on the distance between v and the near-est free right vertex, the relabeling must progress at constant speed in all directions in the same way any BFS does.

While it is possible to mimic the sequential algorithm, i.e., stop the QUERY and PULSE rounds, perform such a global relabelingand then resume, it was pointed out in [15] that in order to obtain good scaling in the shared memory model, global relabelingshould be interleaved with pushes. In the distributed memory model we face similar challenges. Stopping the QUERY and PULSE

rounds to perform a complete global relabeling yields only mediocre load balance and poor scaling. There are two main reasonsfor this. The first is due to the fact that communication costs are likely to increase with the number of processors. The other rea-son is that all processors that have relabeled their vertices, as well as those that have not been reached by the relabeling yet, arenecessarily idle and no processor can resume QUERY and PULSE rounds until all vertices have been relabeled. Thus, we implementglobal relabeling in such a way that it can be processed interleaved with the local computation of the QUERY and PULSE procedures.

This strategy is implemented using two new procedures, RELABELWAVE and PROPAGATE. The procedure RELABELWAVE is calledperiodically. It increments the current wave number x(u) for each free right vertex u and puts these vertices in a local prop-agate queue W. When starting the algorithm, all values of x are initialized to 0.

Algorithm 1. Procedure RELABELWAVE

1: Initialize local queue of vertices to relabel W ; when first called2: for each free right vertex w do3: x(w) x(w) + 14: push w onto W

The procedure PROPAGATE is called every round following QUERY and PULSE. For every vertex u in the relabeling queue it rel-abels each neighbor v 2C(u) with x(v) < x(u), setting w(v) = w(u) + 1. For matched left vertices v, w(M(v)) is set to w(u) + 2and M(v) is put into a temporary queue. When the propagate queue is empty, the current relabeling process stops and ver-tices in the temporary queue are transferred to the propagate queue for use in the next round. If a left ghost vertex Cj(v)would be relabeled from some vertex u, a Propagate signal with the wave number x(u) and the new label w(v) is sent toits owner C0(v) instead. If x(u) > x(v), C0(v) is relabeled and M(v), assuming it exists, is relabeled as described above andenqueued in the propagate queue by the owner of C0(v).

Algorithm 1. Procedure PROPAGATE

1: Initialize temporary local queue Q ;2: while W not empty do3: w pop(W)4: for each v in C(w) with x(v) < x(w) do5: Send Left Propagate signal (v,w(w),x(w)) to C0(v)6: for each remote processor k where a Ci(w) exists do7: Send a Right Propagate signal (w,w(Ci(w)),x(Ci(w))) to k8: Exchange Left Propagate messages with other processors9: for each Left Propagate signal (v,w(w),x(w)) received do

10: w(v) w(w) + 111: w(M(v)) w(v) + 112: x(v) x(w) + 113: x(M(v)) x(v) + 114: push M(v) onto Q15: Exchange Right Propagate messages with other processors16: for each Right Propagate signal (w,w(Ci(w)),x(Ci(w))) received do17: if x(w) < x(Ci(w)) then18: push w onto Q19: w(w) w(Ci(w))20: x(w) x(Ci(w))21: W Q

Page 7: Parallel algorithms for bipartite matching problems on distributed memory computers

If a member of a right connector Cj(u) is relabeled, again a Propagate signal with the new label w(Cj(u)) and wave numberis sent to its owner C (u), which, assuming it has not been relabeled in the current wave, broadcasts the new label to all ghost

826 J. Langguth et al. / Parallel Computing 37 (2011) 820–845

0

vertices Ck>0(u). Their labels w(Ck(u)) are updated, and they are enqueued in the propagate queue. In both cases, relabelingfrom the remotely relabeled right vertices starts in the same manner as that from the locally relabeled right vertices in thenext round when PROPAGATE is called again. In any case, a vertex is never relabeled twice during the same relabeling wave.Correctness of the algorithm is ensured by restricting pushes in PULSE to edges (v,u) where x(v) = x(u). See [15] for a proofwhich also applies to Algorithm 1 since the push and relabel strategies in both algorithms are essentially equivalent.

The sequential algorithm executes a global relabeling after performing H(n) local relabelings. This turns out to be inad-equate for the parallel implementation. Instead, RELABELWAVE is called after every r rounds of QUERY and PULSE to start a newglobal relabeling wave while PROPAGATE is called every round to spread out existing waves. This allows the algorithm to inter-leave push operations with relabelings, although in any given round for some processors there will not be any vertices topush or relabel. Combining procedures QUERY, PULSE, RELABELWAVE, and PROPAGATE we obtain Algorithm 1. Note that we donot use the gap-relabeling heuristic described in [14], as this requires maintaining a global structure which is unsuitablefor distributed memory computations.

5. A new algorithm

As described above, Algorithm 1 is a distributed memory version of the parallel algorithm described in [15]. However, itsperformance is neither competitive with sequential nor with the shared memory parallel code. The reason for this lies in thefact that in most graphs, after matching the majority of vertices, the few remaining unmatched vertices are connected byrelatively long augmenting paths. Even after suitable labels for such a path have been assigned, i.e., all edges on the pathare admissible, augmenting along this path still takes a large number of consecutive pushes. On a sequential or shared mem-ory machine, a single push can be performed quickly, but in Algorithm 1 procedures QUERY and PULSE must be called once perconsecutive push. Since these procedures require communication, they are slower than a single push operation by severalorders of magnitude. This, along with an equally slow global relabeling routine, makes Algorithm 1 incompetitive.

In this section we describe how some of these weaknesses of Algorithm 1 can be overcome by introducing the followingthree modifications. First by performing many push and relabel operations locally between communication rounds, then byrouting augmenting paths in such a way that the number of processor jumps is minimized, and finally by balancing theamount of work across processors. In the remainder of this section, we show how to use these techniques in order to derivea new Algorithm 2 from Algorithm 1.

5.1. Local work

One of the main drawbacks of Algorithm 1 is the high number of pulse operations. However, assuming the input graph is wellpartitioned, almost all of these pulses move a vertex on the same processor which suggests that communication is not required.Thus, by processing these pushes locally, it should be possible to speed up the algorithm significantly. However, in order to pre-serve correctness of the algorithm in the PUSH-RELABEL scheme, one must ensure that one always pushes to a neighbor with a glob-ally minimum label. To achieve this without resorting to communication for querying all neighbors, we only perform local pushoperations along admissible edges, i.e., edges {v,u} where w(v) P w(u) + 1 (see Section 5.2). Thus, unlike in the PULSE procedurewhere an active vertex v simply sets its labelw(v) tow(u) + 1, we only push locally ifw(v) P w(u) + 1 was already the case beforethe push, thereby following an assumed alternating path towards a free right vertex.

Assuming w is a valid distance labeling, we have w(v) 6w(u) + 1"v 2 V1, u 2C(v) (see [28] for a complete proof). Thus, ifan edge (v,u) is admissible, u is always a neighbor of minimum w(u) for v. Therefore a push along an admissible edge is al-ways legal for the PUSH-RELABEL algorithm. Since the PUSH-RELABEL algorithm remains correct for any ordering of legal pushes, itfollows that we preserve a correct PUSH-RELABEL strategy by applying local pushes. However, if such a push targets the originalvertex in a right connector C0(u), w(C0(u)) is increased. Consequently, labels of the ghost vertices w(Ci>0(u)) need to be up-dated. But since ghost vertices cannot be the target of a local push, it is not necessary to perform these remote update untilafter all local pushes have been executed.

Procedure LOCALWORK implements this idea. It is called during each round before QUERY and PULSE. Even though we will relaxthe notion of a valid relabeling in the next section, LOCALWORK will remain applicable.

Algorithm 2. Procedure LOCALWORK

1: while $ an active vertex v that has an edge {v,C0(u)}2: with w(C0(u)) < w(v) and x(v) 6x(C0(u)) do3: if M(u) – ; then4: Make M(u) active5: M(M(u)) ;6: M(u) v

Page 8: Parallel algorithms for bipartite matching problems on distributed memory computers

J. Langguth et al. / Parallel Computing 37 (2011) 820–845 827

Algorithm 2 (continued)

7: M(v) u8: Make v inactive9: w(v) w(C0(u)) + 1

10: w(C0(u)) w(C0(u)) + 211: Remote update Ci>0(u) = C0(u) for all u 2 V2

5.2. Fast relabeling

Performance can be increased by the addition of local work as described above, but the global relabeling as described inSection 4 is still very slow because it progresses by a single vertex per communication round. Also, it finds augmenting pathshaving a minimum number of edges, but with the introduction of LOCALWORK such paths are no longer necessarily optimal as itis now desirable to have augmenting paths containing a minimum number of processor crossing edges. As LOCALWORK allowspushes along paths on a single processor to be performed quickly, while pushes in PULSE that cross processors remain expen-sive, optimal augmenting paths will cross processors as infrequently as possible to minimize the number of communicationrounds required for augmentation. Even on a distributed memory machine with fast interconnects, we determined the actualdifference in cost between local and processor crossing pushes to be at least a factor of 103. Thus, in the following we assumethat an optimal augmenting path has a minimum number of edges incident to at least one ghost vertex. We call this numberthe processor distance of an augmenting path.

We now show how to modify the global relabeling presented in Section 4 to deal with both of the above issues. We do thisby changing the PROPAGATE procedure so that all right vertices that have been relabeled are immediately enqueued in the localpropagate queue W again which renders the temporary queue Q obsolete. This allows relabeling of all reachable vertices on agiven processor in a single round without further communication. The transmission of Propagate signals is performed as de-scribed in Section 4, but not until after all local propagate queues are empty. Thus, in every round a global relabeling wavecan progress by one processor, while relabeling all vertices it can reach on that processor that have not yet been relabeled inthe current wave. As before, no vertex is relabeled more than once by the same wave.

In effect, the global relabeling has now become a parallel BFS traversal of a graph G0 consisting of metanodes of verticesfrom G. All vertices reachable by local alternating paths from the unmatched right vertices on a processor form a singlemetanode, and all vertices on one processor that are reachable via processor crossing edges from an existing metanode Vare assigned to a new metanode U. The metanodes V and U are connected by an edge in G0. There are no edges between meta-nodes on the same processor. Once assigned to a metanode, a vertex is not assigned again. Note that given G, G0 is not unique.

Clearly, such a traversal generates paths of minimum processor length, i.e., containing a minimum number of nonlocaledges. Since all vertices in a metanode are relabeled during the same communication round, this relabeling can be performedmuch faster as long as metanodes contain a large number of vertices.

As a side effect of this technique, we have to account for the possibility that the relabeling may not be valid anymore, i.e.,there could be unmatched edges (v,u) with v 2 V1 such that w(v) > w(u) + 1. These edges are treated as admissible edges. Werefer to a directed path from an active vertex consisting entirely of edges that are admissible using the labels assigned byrelabeling wave k as a k-admissible path. PULSE and LOCALWORK can push along such edges, but labels of left vertices are neverreduced as a result of such a push, since this might cause cyclic behavior in the algorithm. The existence of edges wherew(v) > w(u) + 1 implies that a push in LOCALWORK might no longer target a neighbor of minimum label. However, as we showin Section 5.4 we still retain correctness of the algorithm.

As a further modification, we add the setting of backpointers in the global relabeling. This means every time a vertex v isrelabeled, we store the vertex from which v was reached in the variable b(v). We refer to the edge (v,b(v)) as the back edge ofv. In procedure LOCALWORK, an active vertex v will now always push to b(v) as long as w(v) > w(b(v)) and x(v) 6x(b(v)). Doingso avoids the search for a neighbor of minimum label (See Section 5.3 for modified wave restrictions). After a push from v, weset b(v) = ;.

Algorithm 2. Procedure PROPAGATE

1: while W not empty do2: w pop(W)3: for each v in C(w) with x(v) < x(w) do4: if v is original vertex C0(v) then5: w(v) w(w) + 16: w(M(v)) w(v) + 17: x(v) x(w) + 18: x(M(v)) x(v) + 1

(continued on next page)

Page 9: Parallel algorithms for bipartite matching problems on distributed memory computers

828 J. Langguth et al. / Parallel Computing 37 (2011) 820–845

Algorithm 2 (continued)

9: push M(v) onto W10: else11: Send Left Propagate signal (v,w(w),x(w)) to C0(v)12: for each remote processor k where a Ci(w) exists do13: Send a Right Propagate signal (w,w(Ci(w)),x(Ci(w))) to k14: Exchange Left Propagate messages with other processors15: for each Left Propagate signal (v,w(w),x(w)) received do16: w(v) w(w) + 117: w(M(v)) w(v) + 118: x(v) x(w) + 119: x(M(v)) x(v) + 120: push M(v) onto W21: Exchange Right Propagate messages with other processors22: for each Right Propagate signal (w,w(Ci(w)),x(Ci(w))) received do23: if x(w) < x(Ci(w)) then24: push w onto W25: w(w) w(Ci(w))26: x(w) x(Ci(w))

We now show that this modified relabeling procedure indeed finds augmenting paths of minimum processor distance. Todo so, let b(v,k) denote the vertex from which relabeling wave k reached v and let f(v) be the free right vertex from which vwas relabeled. We then call a path P = (v0,v1,v2, . . . ,vl), where v0 = v, vi+1 = b(vi,k), and vl = f(v), i.e., the path actually taken bywave k from f(v) to v, a k-relabeling path.

Also, let jCj be the number of connectors in the partitioned graph G. Define d(v,k) =1 for all vertices v that have not beenrelabeled by wave k. For a given set of free right vertices, the minimum processor distance of v is the minimum processorlength of all alternating paths that connect v to some free right vertex.

Lemma 5.1. For each vertex v relabeled by global relabeling wave k, there is no k-admissible path P0(v,u) for any free right vertex uwith lower processor distance than the k-relabeling path.

Proof. The proof is by induction on the number of rounds l following the start of wave k.If l = 0 then every vertex v of minimum processor distance 0 will be relabeled during the first call to PROPAGATE after the

start of wave k, thus producing an augmenting path P(v, f(v)) consisting entirely of local edges. It also follows that no vertex wof minimum processor distance greater than 0 can be relabeled during the current call to PROPAGATE as this would require aPropagate signal to be sent to the processor owning w.

Now assume that l > 0 and that each vertex w of processor distance below l has been relabeled correctly by wave k whileno vertex of processor distance at least l has been relabeled so far. Let v be a vertex on processor a with a processor distanceof l. There must exist a connector C(u) on a minimum processor distance path from v such that an instance Cb(u) of C(u) hasminimum processor distance l � 1 on some processor b. In slight abuse of notation, let Ca(u) be the instance of C(u) onprocessor a. By induction it then follows that Cb(u) has been relabeled during round l � 1 and that prior to round l a Propagatesignal is sent to Ca(u).

Thus since there exists an augmenting path from Ca(u) to v it follows that v will be relabeled during the lth round. Moreover,no vertex of processor length greater than l can have received a Propagate signal from wave k during round l as this would haverequired that wave k had already reached some vertex of minimum processor distance at least l prior to round l. h

Since a vertex is only relabeled once per wave, the global relabeling assigns an alternating path of minimum processorlength, but not necessarily of minimum edge length to each vertex. As the relabeling is interleaved with pushes, the pathso created might be changed by pushes and updates of w before it is complete.

5.3. Load balance

The introduction of LOCALWORK and the modified global relabeling increase performance by performing as much work aspossible locally and without communication, but it might lead to poor load balance when only few processors are able toperform local computations. However, the load balance of the interleaved global relabeling presented in Section 5.2 canbe improved by modifying the ratio between communication and local relabeling work.

To do so, we introduce a relabeling propagation speed s. Again every r rounds, a new wave is sent out from every free rightvertex, but every processor relabels only a 1

s fraction of local vertices per round. It is easy to see that this strategy allows farbetter load distribution. However, for values of s > 1, this might require an increased number of communication rounds.

Page 10: Parallel algorithms for bipartite matching problems on distributed memory computers

J. Langguth et al. / Parallel Computing 37 (2011) 820–845 829

Also, the paths found by the global relabeling are no longer guaranteed to be of minimum processor length because onlypart of a processor’s vertices can be relabeled. Thus the algorithm faces a tradeoff here. Effects of various values of s on per-formance are studied in Section 7.

To prevent relabeling waves from locking out pushes towards the free right vertices, we allow a push along an edge {v,u}if x(v) 6x(u). The reason for this is that because wave x(u) is more recent than x(v), it follows that u and M(u) are likely tobe closer to a free right vertex because otherwise v would have been relabeled by wave x(u) already. In Section 5.4 we showthat the algorithm remains correct nonetheless.

Algorithm 2 implements the presented ideas. It calls LOCALWORK, QUERY, PULSE, and PROPAGATE every round, and RELABELWAVE

every r rounds. Procedure QUERY now allows pushes along edges {v,u} with x(v) 6x(u), but otherwise procedures QUERY,PULSE, and RELABELWAVE are identical to the versions used in Algorithm 1. Therefore, we do not present their pseudocode again.Procedure PROPAGATE was modified as described in Section 5.2. Just like in the sequential PUSH-RELABEL algorithm where findinga good relabeling frequency was crucial, setting a correct propagation speed s and relabeling frequency r is paramount forperformance. See Section 7 for a detailed evaluation of the effects the parameters have on performance.

Like the sequential PUSH-RELABEL algorithm, our Algorithm 2 terminates when no active vertices remain in the graph. Sinceactive vertices are kept in a queue, detecting this locally is trivial. To check the termination condition globally the number ofactive vertices is exchanged along with other data in Procedure PROPAGATE. If the sum of all active vertices is zero, all proces-sors stop.

5.4. Correctness of Algorithm 2

It remains to show that Algorithm 2 remains correct, even though the main invariant of PUSH-RELABEL algorithms, i.e., main-taining a valid relabeling, can be violated. In the following we show that Algorithm 2 in fact terminates. The worst case run-ning time we obtain in this way is weak compared to other algorithms because the proof only considers augmenting pathsfound directly by global relabelings, not those found by labels alone. To show correctness, we first consider the effect ofpushes on augmenting paths. For simplicity, we assume s = 1, i.e., no special load balancing. Our first step is to show thatonce we have an augmenting path, pushes and relabels to the vertices on this path will always result in a new and possiblyshorter augmenting path. To count processor jumps, let d(v,k) be the processor distance between v when it is relabeled bywave k and the source of the relabeling in wave k, and let d(v) = d(v,k⁄) for the latest relabeling wave k⁄ that relabeled v.

Lemma 5.2. Let v be an active vertex and let P(v, f(v)) be an augmenting path connecting v to some free right vertex f(v). Then ifu 2 P, u – f(v) is the target of a successful push we obtain a suffix path P0(v0, f(v)) � P where v0 is active.

Proof. Since P can span multiple processors, several such pushes can happen at the same time. Given multiple possibilities,let u be the vertex closest to f(v) on P. Since u is the target of a push originating at some active vertex w, u is a right vertex andbecause we assumed u – f(v) by the definition of an augmenting path, u is matched to some M(u) 2 P. Let M(u) = v0. Since v0 isa left vertex, the matching edge (u,v0) enters v0 when traversing P from v to f(v). Since the push operation changes M(u) to w,it renders v0 unmatched. Now, as the only edges affected by the push are (w,u) and (v0,u) all edges of the suffix path P0(v0, f(v))remain unaffected by the push. Thus, since v0 is active and f(v) is a free right vertex, P0 is an augmenting path. h

The next step is to show that for k-admissible paths, the suffix path cannot increase in processor length.

Lemma 5.3. Let v be some active vertex, and let P(v, f(v)) be a k-relabeling path connecting v to some free right vertex f(v), and letu 2 P, u – f(v) be the target of a successful push. Then we obtain a suffix path P0(v0, f(v)) � P with d(v0, k) 6 d(v,k) with a processorlength smaller or equal to that of P.

Proof. Since the global relabeling progresses only along alternating paths, P(v, f(v)) must be an augmenting path and thus weinvoke Lemma 5.2 to show existence. From Lemma 5.1 it is clear that there is no shorter k-admissible path from v to f(v) w.r.t.processor distance than P. And as processor distance cannot increase when following P from v to f(v), we obtaind(v0,k) 6 d(v,k). h

The third step establishes existence and minimality of such paths during the algorithm:

Lemma 5.4. Let v be the first active vertex to be relabeled by relabeling wave k, and let round R1 be the round in which v is sorelabeled. Also, let R2 be the round in which the first free right vertex has been matched after round R1. At any time during thealgorithm between rounds R1 and R2, there is an active vertex v0 of minimum d(v0) in G that is the endpoint of an admissibleaugmenting path.

Proof. Consider a global relabeling wave k after it relabels v in round R1. It has created a k-relabeling path that links theactive vertex v to some free right vertex f(v). Note that v has minimum d(v,k) among all active vertices.

Now, if the k-relabeling path to v is an augmenting path we only need to show that it remains admissible. In that case, letv0 = v. Otherwise, by Lemma 5.3, there is some v0 on an augmenting suffix path P0(v0, f(v)) with d(v0,k) 6 d(v,k).

Page 11: Parallel algorithms for bipartite matching problems on distributed memory computers

830 J. Langguth et al. / Parallel Computing 37 (2011) 820–845

To see that there is a P0(v0, f(v)) which is admissible, remember that P0 was a part of the k-relabeling path to v0. Since nopushes can have affected vertices on P0(v0, f(v)) and it was marked by relabeling wave k with backpointers, it must beadmissible unless the backpointers were changed by some global relabeling wave k0 > k. But if this has happened, v0 must bereachable via a k0-relabeling path which again is admissible.

Since f(v) was not matched, d(v0,k0) 6 d(v0,k) because otherwise the k0-relabeling path to v0 would have started fromf(v). h

Now we need to bound the time between R1 and R2.

Lemma 5.5. Let v be the first active vertex to be relabeled by relabeling wave k, and let round R1 be the round in which v is sorelabeled. After round R1 + d(v,k) + 1, at least one free right vertex must have been matched since the start of relabeling wave k.

Proof. By Lemma 5.4, at any time between R1 and R2, there is some active vertex v0 of minimum d(v0,k) that is the endpointof an augmenting path consisting entirely of admissible edges marked with backpointers.

During every round, procedure LOCALWORK will perform a series of pushes along such a path until it reaches a connector.Thus, after LOCALWORK finishes, there is some active vertex w that is a member of a left connector. It will either be pushed alongits backedge in procedure PULSE, or b(w) was the target of a successful push after it was relabeled by wave k. In both cases,there must now be some active vertex v0 2 P0 with d(v0,k) < d(w,k).

Again v0 will be the endpoint of an augmenting path marked with backpointers. Thus, after at most d(v,k) rounds, therewill be such a vertex v0 on an augmenting admissible path P0(v0, f(v0)) with d(v0,k) = 0. The next time LOCALWORK is called,augmentation along P0 causes v0 and f(v0) to be matched. Since f(v0) was a free right vertex, this concludes the proof of thelemma. h

We call a round in which jMj has increased by at least one jMj-increasing. Let k be the first global relabeling wave aftersome jMj-increasing round, and let k0 > k be the first global relabeling wave following the first jMj-increasing round afterthe start of k. We call the rounds between k and k0 a phase. Clearly, the number of phases is bounded by n. Using Lemma5.5, it is easy to show that each phase terminates.

Lemma 5.6. A phase takes at most 2jCj + 2 + r rounds.

Proof. Global relabel wave k takes at most jCj + 1 rounds to relabel its first active vertex v, since jCj is an upper bound on allfinite processor distances. By Lemma 5.5, at most d(v,k) + 1 < jCj + 1 rounds later a free right vertex is matched, thus increas-ing jMj. A new global relabel wave happens at most r rounds after jMj is increased. Thus, a phase takes at most 2jCj + 2 + rrounds. h

Lemma 5.6 shows that the modified algorithm is guaranteed to terminate, and since moving free vertices along paths ofminimum processor-length maximizes the opportunity for local work while minimizing the necessity for communication,the modified wave based relabeling should improve performance. Note that the relabeling might jump back to unreachedvertices on a processor that was reached by the same wave earlier, possibly resulting in paths of processor length up tojCj. If a perfect matching does not exist, the algorithm terminates after a complete wave failed to relabel any active vertex.Clearly, this cannot take more time than a single phase.

Using Lemma 5.6, we can derive the running time of Algorithm 2. To this end, we assume an evenly balanced partitioning,i.e. the number of edges on each processor is equal up to a constant.

Theorem 5.7. Given an evenly balanced partitioning and parameters s = 1 and r < n, Algorithm 2 runs in O(n2m/p) time on pprocessors.

Proof. By Lemma 5.6, a phase takes at most 2jCj + 2 + r rounds. In any round, the procedures LOCALWORK, QUERY, PULSE, andPROPAGATE access every edge only a constant number of times. Due to edge partitioning, the number of edges does not increasein the distributed graph. All operations in these procedures incorporate edges. Therefore, only O(m) operations are per-formed per round, and they are distributed asymptotically evenly due to the assumption of evenly balanced partitioning.Thus, a round takes O(m/p) time. Since jCj 6 n and r < n by assumption, this means that a phase takes at most O(nm/p) time.Every phase increases jMj by at least 1. Therefore, the number of phases is bounded by n, yielding the total running time ofO(n2m/p). h

Of course, Theorem 5.7 assumes n 2 O(m) which is no restriction since singletons can be trivially removed. Clearly, in awell-partitioned instance where the maximum processor-length of augmenting paths jCj is bounded by p, we obtain a worstcase running time of O(nm) which is on par with many simple yet effective sequential algorithms. To obtain a better runningtime, we would require an extended analysis showing that either the amortized number of augmentations per phase is inX(1) or that the amortized length of augmenting paths is in x(p). However, doing so would likely require changes to thealgorithm that incorporate ideas from the Hopcroft–Karp algorithm [9]. It is currently not clear whether this can be doneefficiently. If the partitioning is not evenly balanced, running time increases by a factor equal to the unbalance in thepartitioning.

Page 12: Parallel algorithms for bipartite matching problems on distributed memory computers

J. Langguth et al. / Parallel Computing 37 (2011) 820–845 831

For values of s > 1, the proof has to be modified slightly. Since it is not specified which 1s fraction of the vertices of a given

processor are relabeled in each round, a global relabeling can take up to s rounds to traverse one processor. Thus, processordistance of the k-relabeling path P(v, f(v)) can be s times greater than that of a minimum augmenting path from v to f(v).However, with slight modification Lemma 5.4 still applies since an augmenting path exists. Lemma 5.5 can be extendedto show that longer paths are also shortened by LOCALWORK. Thus, it is possible to derive a 2s(jCj + 1) + r time bound for eachphase. Note that Lemma 5.5 holds even if free vertices can decrease in processor distance due to a new relabeling wave,which is likely to happen for s > 1.

Using the BSP model, we can further analyze running time by incorporating communication cost. We have establishedalready an O(n2) bound on the number of rounds given s = 1 and r < n. Following the notation in [29], we use l to denotethe global synchronization cost and g for the communication cost per data word, both measured in compute operationsof the parallel system. We have O(m/p) compute operations per round, each of which can result in at most one message con-taining one data word. A processor can only receive one message per edge. Thus, given an evenly balanced partitioning, aprocessor can only receive O(m/p) messages per round. We assume that the parallel system is capable of concurrent all-to-all communication, and thus we obtain a running time bound of O(n2((m/p)g + l)) compute operations. Due to the over-lapping of communication and computation, it is harder to estimate the actual effect of the global synchronization cost lcompared to a pure BSP implementation. Note however that the cost for detecting termination can be subsumed in l.

The sequential PUSH-RELABEL algorithm requires O(m + n) memory to store the input graph, plus some O(n) overhead tomaintain labels and matched edges. Note that in order to perform global relabelings efficiently, adjacency lists for boththe left and the right side must be stored, thereby doubling the memory required to store the graph structure.

In the parallel case, extra memory is required to store the connectors. Every connector takes at most O(p) memory. SincejCj 2 O(n), this amounts to a space requirement of O(pn + m). For sparse graphs, we have m 2 O(n). However, a connector canonly take up memory proportional to the degree of its underlying vertex. Thus, overall extra memory for the connectors isO(m), yielding a total space requirement of O(n + m). The extra memory for each connector must be available on the owner ofthat connector. However, given an evenly balanced edge partitioning and the way in which connector ownership is assigned,connector ownership is evenly distributed. This means that no processor spends more than O(m/p) memory for its connec-tors. Furthermore, the edge partitioning guarantees O(m/p) edges per processor. Therefore, the parallel algorithm requiresonly O(m/p) memory per processor which means perfect memory scalability.

From a practical point of view, due to the fact that most connectors contain only one or two vertices, the extra memoryincurred by keeping track of the connectors is negligible. However, there is a noticeable overhead due to message passingand parallel control.

6. Experimental setup

In the following we describe the experiments that were used to test the performance of Algorithm 2, to compare it tosequential performance, and to measure performance scaling for increasing number of processors.

All experiments were performed on a Cray XT4 equipped with AMD Opteron quad-core 2.3 GHz processors. Codes arewritten in C++ with MPI using the MPICH2 based xt-mpt module version 5.0.2 and the PathScale Compiler Suite 3.2.99. Testswere performed on configurations of 1, 2, 4, 8, 16, 32, 64, and 128 compute nodes Since we intend to test distributed memoryperformance, we run only one MPI thread per node. Thus, a processor is a single core with exclusive access to all of its node’smemory. Computers with multiple sockets per node should run one MPI thread per socket.

The test set consists of a group of large square matrices from the University of Florida Sparse Matrix Collection [5]. Matri-ces were partitioned using the Mondriaan graph partitioning package version 2.0 [23]. The main program reads the parti-tioned matrix as input and distributes it among the processors to build the parallel data structures. When measuringrunning time, this part was not included in the measurement since the algorithm is intended for inputs that are distributedto begin with.

The graphs are derived from the matrices as follows: For a given square matrix A of size n � n its bipartite graphG = (V1 [ V2,E) is obtained by setting jV1j = jV2j = n and E = {{vi,vj}, vi 2 V1, vj 2 V2:aij – 0}.

The parallel computation starts by initializing the matching using a sequential KARP–SIPSER style heuristic [30] locally oneach processor. The heuristic maintains a priority queue of local vertices, sorted by degree. Vertices of degree one arematched immediately. Each time a vertex is matched, it is removed along with its incident edges, thereby reducing the de-grees of some remaining vertices and thus changing their position in the priority queue. If no vertices of degree one are avail-able, the heuristic selects a random local edge and matches the incident vertices. If no edge remains, the heuristic terminates.Ghost vertices and their incident edges are not considered in the heuristic.

For sequential computations, this heuristic can speed up matching algorithms greatly [13,26], usually producing a match-ing within 99.5% of the optimum. Sometimes, it finds a perfect matching immediately, resulting in very fast sequential run-ning times. If an input matrix admits a trivial perfect matching, the initialization will always find it. This does not hold for thedistributed graph, because ghost vertices cannot be considered. Therefore the parallel initialization will often be of lowercardinality, which decreases with an increasing number of processors.

However, using this initialization does not always increase performance. We will study the effect of such initializationmore closely in Section 7.3 and in Section 8. After the initialization, Algorithm 2 is executed as described in Section 5.

Page 13: Parallel algorithms for bipartite matching problems on distributed memory computers

832 J. Langguth et al. / Parallel Computing 37 (2011) 820–845

Optimal values for the two relabeling parameters, i.e., the frequency of global relabelings r and s, the fraction of local ver-tices that can be relabeled in each round, are not known. In a first set of experiments we determine good values for r and sand use the values so obtained in a second set of experiments to evaluate performance and scaling of Algorithm 2. For com-parison with sequential performance, we implemented a PUSH-RELABEL algorithm as described in Section 3. The implementa-tion uses a FIFO order of pushes, and a KARP–SIPSER heuristic initialization as described above. We also ran Algorithm 2 usingonly one processor. This is essentially a sequential computation, but our code is not designed to adapt to this. Therefore, itstill calls all the normal communication routines and follows a pattern optimized for parallel computations, making it infe-rior to the purely sequential code. It does keep the advantage derived from the sequential initialization though.

7. Results on real-world instances

In this section we describe the results of our experiments on real-world matrices.

7.1. Parameter experiment

Our first set of experiments aims to explore the effect relabeling frequency r and relabeling speed s (see Section 5.3) haveon the overall performance. To do so, we selected the seven large matrices Hamrle3, cage14, ldoor, kkt_power, parabolic_fem,av41092 and rajat31 from the University of Florida Sparse Matrix Collection [5]. The sizes of these matrices are listed in Ta-ble 3. We then ran the algorithm on each matrix for all combinations (s,r) 2 {1,2,4,8,16,32,64,128}2 and all configurationsp 2 {1,2,4,8,16,32,64,128}, obtaining 7⁄83 individual timings. The range of s and r was chosen after preliminary experi-ments identified it to contain the most interesting cases. Since presenting 3584 numbers would yield little insight, we aggre-gate these results in order to present our conclusions.

p=11 2 4 8 16 32 64 1281

2

4

8

16

32

64

128

s = r

p=21 2 4 8 16 32 64 1281

2

4

8

16

32

64

128

s = r

p=41 2 4 8 16 32 64 1281

2

4

8

16

32

64

128

s = r

p=81 2 4 8 16 32 64 1281

2

4

8

16

32

64

128

s = r

Fig. 2. Relative performance for varying parameters and processor numbers. High performance is shown in red. Colder colors indicate lower performance. ris shown on the X-axis, s on the Y-axis. With increasing values of p, high performance moves from high to low values of s and r. For all plots except p = 128 asharp drop in performance below the s = r line is visible. Relative performance for 16, 32, 64, and 128 processors is shown on page 833. The underlyingnumerical values for the plots are given in Tables 11–18.

Page 14: Parallel algorithms for bipartite matching problems on distributed memory computers

p=161 2 4 8 16 32 64 1281

2

4

8

16

32

64

128

s = r

p=321 2 4 8 16 32 64 1281

2

4

8

16

32

64

128

s = r

p=641 2 4 8 16 32 64 1281

2

4

8

16

32

64

128

s = r

p=1281 2 4 8 16 32 64 1281

2

4

8

16

32

64

128

s = r

Fig. 2 (continued)

J. Langguth et al. / Parallel Computing 37 (2011) 820–845 833

We first note that the matrix rajat31 behaved in a different way from the other instances. Here the KARP–SIPSER initializa-tion matched either all or, for values of p above 4 almost all vertices. Thus, running times were very low and completely inde-pendent of the relabeling parameters. Therefore, rajat31 is not considered in the analysis of parameter effects. When using asingle processor, ldoor and cage14 showed similar behavior.

Since running times vary widely over instances and processor configurations, they need to be normalized for compar-ison. For each of the 48 combinations of the 6 remaining instances and the 8 different p values, we do this by dividing therunning time of each (r,s) combination by the best running time for this (instance,p) combination. This gives the bestsettings of r and s in each (instance,p) combination a performance of 1. For other (r,s) combinations, the normalizedperformance is a number between 0 and 1. Experiments that failed because of memory shortage were assigned a perfor-mance of 0.

To obtain parameter dependent performance over multiple test instances, we averaged the performance values for eachcombination of r, s, and p over all matrices except rajat31. The results are shown in Fig. 2. Each chart in the figure is a map ofthe effect of s and r for a particular processor configuration. We observe that high performance, i.e., dark red areas can usu-ally be found along and above the s = r line, with optimal values of r decreasing with increasing processor numbers. It is easyto see that setting s > r yields weak performance.

Thus, to obtain maximum performance when increasing p, the value of r should be halved for each doubling of p, and sshould be modified accordingly. Doing so will ensure that the relabel work per processor per round remains roughly con-stant, since the partition of the input assigned to a single processor decreases. This allows the algorithm to maintain a goodcomputation to communication ratio. Since relabeling become more frequent, the expected number of rounds decreases,thereby allowing the algorithm some performance gain with increasing p. This also suggest a natural scaling limit whichis reached at s = 1 and r = 1. For some matrices this can clearly be observed. For example parabolic_fem at p = 64 performsbest for s = 1 and r = 1, and does not scale beyond this number of processors.

From these results, we obtained good guidelines for setting parameter values, but the actual optimal values for a giveninstance are not clear. For the second experiment, we used the optimum values of r and s which we estimated in Experiment1 and ran Algorithm 2 without further input parameters. The average optimum values for (r,s) are shown in Table 1.

Page 15: Parallel algorithms for bipartite matching problems on distributed memory computers

Table 1Optimum parameters by number of processors and instance. The table shows the combinations of (r, s) at which performance in the individual experiments wasmaximum. Average optimum shows the optimum (r,s) values for normalized running that were averaged over the test instances. An entry of all indicates thatthe parameter is irrelevant for performance. Parameter influence ratio is the performance averaged over all (r, s) combinations divided by the performance ofthe average optimum.

Matrix p = 1 p = 2 p = 4 p = 8 p = 16 p = 32 p = 64 p = 128

Optimum (r,s) by instanceHamrle3 128,16 128,128 128,8 64,64 16,16 2,4 1,2 1,2ldoor all,all 64,1 128,8 128,4 64,8 16,16 2,2 2,2cage14 all,all 128,128 128,64 64,16 64,32 32,8 16,4 1,2kkt_power 4,128 128,64 128,32 64,32 1,64 16,8 1,16 1,2parabolic_fem 128,16 64,32 128,16 2,4 2,4 1,2 1,1 1,1av41092 128,1 64,64 4,64 2,16 16,16 8,8 4,2 4,2Average optimum (r, s) 128,64 128,128 64,64 32,32 16,16 8,8 4,4 2,2Parameter influence ratio 0.867 0.540 0.533 0.552 0.509 0.441 0.361 0.270

Table 2Best running times over all values of s and r in the parametrized experiment compared to running times in the non-parametrized experiment with fixed s and rfor each instance and processor number. Times are in seconds.

Matrix p = 1 p = 2 p = 4 p = 8 p = 16 p = 32 p = 64 p = 128

Hamrle3 4.55 � 4.61 13.1 � 13.1 8.01 � 12.8 6.96 � 7.32 5.88 � 5.88 2.87 � 3.02 3.21 � 3.43 2.75 � 3.55ldoor 1.98 � 1.98 8.09 � 8.99 4.75 � 5.97 2.82 � 3.68 1.96 � 2.39 1.33 � 1.33 1.13 � 1.13 1.17 � 1.35cage14 2.68 � 2.68 7.76 � 7.76 6.18 � 9.40 5.03 � 5.91 3.20 � 4.31 2.39 � 3.32 2.20 � 2.34 1.49 � 2.19kkt_power 4.08 � 4.51 7.43 � 8.52 6.49 � 8.93 3.47 � 6.87 3.16 � 5.23 1.69 � 2.10 1.53 � 1.98 1.49 � 1.85parabolic_fem 0.59 � 0.59 3.26 � 3.73 1.72 � 2.30 1.04 � 1.48 0.84 � 1.14 0.38 � 0.74 0.25 � 0.59 0.37 � 0.52av41092 0.25 � 0.25 0.47 � 0.62 0.46 � 0.56 0.34 � 0.43 0.41 � 0.41 0.57 � 0.57 0.90 � 0.92 1.50 � 1.76

834 J. Langguth et al. / Parallel Computing 37 (2011) 820–845

Table 1 also shows the optimal values of r and s for each individual (instance,p) combination in Experiment 1, whileTable 2 shows the timing for these (r,s) combinations compared to non-parametrized running times from Experiment 2.On average, the optimal running times are 83% of the non-parametrized running times. From these values we computedthe Parameter influence ratio which is also shown in Table 1. It is the performance averaged over all (r,s) combinations di-vided by the performance of the average optimum. It measures the inverse of the influence parameter settings have on per-formance. For p = 1 the influence is quite low, but it gradually grows and for p = 128 it is very high.

7.2. Performance experiment

For Experiment 2 the test set consists of 22 matrices from the University of Florida Sparse Matrix Collection [5]. Table 3gives sequential and parallel running times. Due to the difficulties incurred by the distributed instances, the results showlittle strong scaling in the traditional sense. Therefore, we study performance compared to running the algorithm on 2 pro-cessors for increasing values of p.

The first group consists mostly of smaller instances with up to 106 nonzeros. The algorithm shows little or no perfor-mance gains for increasing p. Matrix av41092, the largest in this group, reacts poorly to increasing values of p. This is con-sistent with the behavior of other parallel algorithms [19], although it is far less drastic here.

The second group contains medium sized instances with more than 106 nonzeros. In this group most instances show rea-sonable performance gains which usually peak at p = 16 or p = 32. Running times increase noticeably at p = 64. The largestmatrix in this group, Hamrle3, is a very difficult instance even for sequential algorithms, i.e. it requires a high running timerelative to its size. Unlike other instances in this group, it shows a good performance increase that peaks at p = 32.

For the two remaining groups, good performance gains can be observed. In the last group, running time drops by approx-imately 25% on average for each doubling of p. From the size of these instances, we expect Algorithm 2 to perform reasonablywell on average matrices of at least 107 nonzeros. However, since the performance gains from increasing p already declinesat p = 128 we believe that even larger matrices are necessary to obtain any performance improvement when increasing pbeyond 512. The largest matrix in the fourth group, Audikw_1, could not be partitioned using Mondriaan due to lack of mem-ory and is therefore only block partitioned. Performance compared to the sequential algorithm would most likely be better ifit was properly partitioned (see Section 7.3).

We observe that the sequential algorithm is about three times faster than Algorithm 2 using one processor. There arethree distinct reasons for this. First, the structure of the sequential algorithm is far more simple, as it does not call MPI rou-tines. Second, the sequential algorithm can automatically adapt the relabeling frequency, and finally, the sequential algo-rithm uses the FIFO push order, which was shown to be superior in [13]. Since the parallel algorithm is designed to pushall free vertices during the same round, its actual push order is based on the vertex numbering instead of being based onlabels.

Page 16: Parallel algorithms for bipartite matching problems on distributed memory computers

Table 3Results for the performance experiment. Instances are grouped by parallel performance increase and ordered by number of nonzeros. Running times are inseconds.

Matrix #Rows #Nonzeros Seq. p = 1 p = 2 p = 4 p = 8 p = 16 p = 32 p = 64 p = 128

No increaseNo increase Zhao2 33,861 166,453 0.01 0.03 0.09 0.12 0.13 0.15 0.16 0.21 0.40ncvxqp5 62,500 237,483 0.03 0.08 0.26 0.21 0.22 0.22 0.23 0.22 0.48c-71 76,638 859,554 0.03 0.08 0.25 0.27 0.27 0.25 0.24 0.27 0.42kim1 38,415 933,195 0.02 0.09 0.05 0.07 0.09 0.12 0.24 0.16 0.40twotone 120,750 1,224,224 0.04 0.11 0.29 0.32 0.24 0.21 0.24 0.23 0.55av41092 41,092 1,683,902 0.10 0.25 0.62 0.56 0.43 0.41 0.57 0.92 1.76

Increase up to 32scircuit 170,998 958,936 0.04 0.14 0.44 0.29 0.25 0.15 0.17 0.23 0.51ibm_matrix_2 51,448 1,056,610 0.02 0.06 0.38 0.33 0.33 0.25 0.27 0.46 0.53crashbasis 160,000 1,750,416 0.04 0.16 0.31 0.14 0.1 0.04 0.05 0.23 0.17matrix_9 103,430 2,121,550 0.02 0.13 1.13 0.38 0.41 0.33 0.30 0.41 0.61ASIC_680ks 682,712 2,329,176 0.15 0.41 0.21 0.14 0.15 0.09 0.06 0.08 0.17poisson3Db 85,623 2,374,949 0.09 0.27 0.36 0.37 0.26 0.19 0.19 0.26 0.48barrier2–4 113,076 3,805,068 0.05 0.16 0.56 0.46 0.35 0.41 0.30 0.52 0.76Hamrle3 1,347,360 5,514,242 2.43 4.61 13.12 12.81 7.32 5.88 3.02 3.43 3.55

Increase up to 64bone010_M 986,703 23,888,775 0.43 1.61 10.5 6.32 3.68 2.53 1.23 0.95 1.56ldoor 952,202 42,493,817 0.73 1.98 9.38 5.97 3.68 2.39 1.33 1.13 1.35

Increase at 128parabolic_fem 525,825 3,674,625 0.16 0.59 3.73 2.30 1.48 1.14 0.74 0.59 0.52kkt_power 2,063,494 12,771,361 1.68 4.51 8.52 8.93 6.87 5.23 2.10 1.98 1.85af_shell2 504,855 17,588,875 0.27 0.77 5.00 3.12 2.06 2.07 1.10 0.95 0.78rajat31 4,690,002 20,316,253 0.87 2.96 1.54 0.85 0.99 0.57 0.30 0.18 0.12cage14 1,505,785 27,130,349 0.69 2.68 7.76 9.40 5.91 4.31 3.32 2.34 2.19Audikw_1 943,695 39,297,771 1.17 2.82 14.24 8.95 4.11 3.90 3.73 2.64 1.96

J. Langguth et al. / Parallel Computing 37 (2011) 820–845 835

Furthermore, using a single processor Algorithm 2 is again approximately twice as fast as using two processors. The rea-son for this lies in the fact that the sequential initialization is far superior. As shown in [12,13,26], initialization can have astrong effect on running time.

However, the difference in sequential and parallel performance decreases for larger matrices, but, due to the inherentsequentiality of matching a small number of remaining vertices, scaling will most likely be limited for any parallel algorithm.For the test matrices ASIC_680ks, kkt_power, and rajat31, the parallel algorithm delivers performance that is superior to thatof the sequential algorithm. For rajat31, this happens because even for the parallel case the initialization matches most ver-tices, and the remaining vertices are matched without starting a global relabeling.

It is interesting to note that for difficult matrices such as Hamrle3 the performance of the sequential algorithm is onlyslightly better than that of Algorithm 2 using at least 32 processors. This is most likely due to the fact that both algorithmsperform a large number of global relabelings here, but the parallel algorithm can perform a single relabeling faster. Thus,Algorithm 2 becomes more competitive for very easy and for very difficult instances.

7.3. Stability testing

In Experiment 1 and 2, we always used KARP–SIPSER initialization and Mondriaan partitioning on unmodified matrices. Inthis section, we check these assumptions by testing how much results differ if the above parameters are modified.

Table 4Results for the stability test on av41092. KS initialization affects running times in an unpredictable fashion. The transposed instance is more difficult than theoriginal, but only slightly so. Mondriaan partitioning often improves running time noticeably.

Instance Init. Partition p = 1 p = 2 p = 4 p = 8 p = 16 p = 32 p = 64 p = 128

Original KS Block 0.59 0.78 0.96 0.58 0.60 0.62 0.85 1.64NONE Block 0.55 0.68 0.79 0.62 0.54 0.51 0.81 1.50KS Mondriaan 0.59 0.78 0.56 0.43 0.41 0.57 0.92 1.76NONE Mondriaan 0.55 0.48 0.53 0.49 0.48 0.63 1.24 1.95

Transposed KS Block 0.83 1.04 1.13 1.26 0.83 0.78 0.83 1.10NONE Block 0.85 1.08 1.43 1.45 1.16 0.85 0.85 1.23KS Mondriaan 0.83 1.00 0.77 2.69 0.32 1.45 2.51 1.49NONE Mondriaan 0.85 1.09 1.01 0.57 0.38 0.46 0.57 0.99

Page 17: Parallel algorithms for bipartite matching problems on distributed memory computers

836 J. Langguth et al. / Parallel Computing 37 (2011) 820–845

To estimate the effect of KARP–SIPSER initialization, we performed an experiment by running the uninitialized algorithm onthe test matrices used in the parametrized experiment. On average, running times were 39% slower than for the initializedversion. Detailed results can be found in Table 19 in the Appendix.

We estimate the effects of partitioning and transposition using the matrix av41092. It proved to be difficult in the perfor-mance experiment, taking long running times relative to its size. Furthermore, for parallel auction algorithms in theweighted case, it has proven to be exceedingly difficult in its unmodified form, and even more so after taking the transpose[18,19]. In Table 4, we give running times for this instance and its transpose using a varying number of processors and com-pare them with the running times on the Mondriaan partitioned version from Table 3. We also test to what extent the KARP–SIPSER heuristic influences total running time here.

We observe that the running times on the transposed matrix are slightly longer. However, we do not observe the extremeincreases in running time described in [18] for the weighted case. The effect of KARP–SIPSER initialization seems to be erratichere, sometimes increasing and sometimes decreasing running time. Thus, it is unclear under which circumstances the KARP–SIPSER initialization improves performance here.

Concerning the effect of the partitioning, we observe a strong positive effect for small processor counts and a slight det-rimental effect for large processor counts. However, this might be due to the fact that Mondriaan is unable to find a goodpartitioning for 64 or even 128 processors on a matrix of this size. For practical instance sizes and processor counts, wedo not assume that this effect plays a large role. For the transposed partitioned matrix, running times were very good with-out KARP–SIPSER initialization. The initialized runs showed erratic timings however. Still, we can conclude that the 2-D matrixpartitioning is useful here, although the computational cost of partitioning is high.

Parallel partitioners such as Zoltan [24] can reduce the computational cost. Since Algorithm 2 strongly exploits opportu-nities for local computations, a high quality partitioning can be expected to increase performance.

7.4. Analysis of runtime spending

In Experiment 1 and 2 we observed that running times vary strongly among instances of comparable size. To gain furtherinsight as to where running time is spent we studied the round-by-round behavior of the algorithm. While it is not possibleto present such an in-depth analysis for all the matrices used in the experiments, we observed very similar behavior in allinstances, although the number of rounds varies greatly.

We present the round-by round analysis of the algorithm on av41092 here. We select the 16 processor case, using KARP–SIPSER initialization and Mondriaan partitioning. This instance shows a typical behavior which we encountered in almost allinstances, with the exception of those instances where the initialization found a perfect or almost maximum matching.

Fig. 3. Round by round progress on instance av41092 using 16 processors. In (a), the time per round is shown. Clearly the first rounds take much more timethan others. After that, when smoothing over small regular spikes induced by global relabelings, the progress in rounds is relatively even, as shown in (b). In(c) the remaining unmatched vertices are plotted against the rounds. Clearly, almost all vertices are matched in the beginning. The remaining unmatchedvertices are plotted against the time spent in the algorithm in (d). More than half of the running time is spent matching a very small number of vertices.

Page 18: Parallel algorithms for bipartite matching problems on distributed memory computers

J. Langguth et al. / Parallel Computing 37 (2011) 820–845 837

Results are shown in Fig. 3. Initialization took 0.002 s and matched 23,380 out of 41,092 vertices per side. The first roundthen took 0.01056 s and matched another 9567 vertices, leaving 8145 unmatched. The entire run took 215 rounds. Whilethese numbers sound impressive, in a test without KARP–SIPSER initialization the first round took 0.011 s, leaving only 8143vertices unmatched. All numbers of vertices are given per side. This suggests that KARP–SIPSER initialization has next to noeffect on this instance. Still, the uninitialized run took 250 rounds and about 5% more time in total.

The rounds following the first quickly match most of the remaining vertices. The time taken per round decreases with thenumber of active vertices, but due to the time required for communication, it never falls below 0.00075 s, with most roundstaking about 0.001 s and approximately twice as long for rounds in which a global relabeling starts. These rounds form fairlyregular peaks in the otherwise smooth curve. For a higher value of the relabeling frequency r, the distance of the peaks wouldincrease. A lower value of s would increase the amplitude.

The size of the matching increases slowly in the later rounds. Between rounds 42 and 215, an average of one vertex perround is matched. Thus, for about 0.4% of the total vertices or 2% of the vertices unmatched after round 1, the algorithm takesabout 75% of its running time. This long tail can be curbed by setting a higher relabeling frequency, but this increases thetime required for each round and thereby total running time. However, these findings suggest that our algorithm can be usedto provide �-approximate cardinality matchings. Doing so would simply require termination as soon as a 1 � � fraction of thevertices is matched, thereby accelerating the algorithm immensely.

Almost all matching algorithms that successively increase matching size allow a similar technique, although unlike theHopcroft–Karp algorithm [9], PUSH-RELABEL does not provide successively improving approximation guarantees. However,due to communication latency, the cost of matching the remaining vertices is comparatively higher for a parallel algorithm.Thus, this technique is especially effective here.

8. Results on artificial instances

In addition to the experiments on real-world instances, we used generators capable of creating similarly structuredgraphs of different sizes in order to test the weak scaling behavior of the algorithm by running it with an increasing numberof processors on instances of increasing size. We use two generators, one for Erd}os–Rényi style random graphs and one forregular grid graphs.

8.1. Erd}os–Rényi graphs

We present the first set of experiments for studying the weak scaling behavior of our algorithm. To measure weak scaling,we require instances of increasing size in order to obtain a constant workload per processor. Therefore, we generate Erd}os–Rényi random graphs of varying size using the generator provided in the Stanford Network Analysis Library (SNAP) [31].However, the actual work required for solving a given class of instances is not generally known. Thus, we need to considerthe worst case running times of O(nm) or Oð

ffiffiffinp

mÞ for the sequential algorithms and the parallel running time given in The-orem 5.7 and estimate actual changes in running time. We study algorithmic behavior for scaling n, m, and the combinationof n and m. To this end we incorporate experience from the above experiments and from sequential studies [11,26,32].

The worst case running times suggest that doubling the number of edges doubles the computational cost. Therefore, weexamine weak scaling on a series of graphs where the number of edges per processor and the total number of vertices re-mains constant, thereby increasing average degree �d and relative density q. For both sequential and parallel algorithms, thismakes the problem potentially easier since short augmenting paths are more likely to exist. On the other hand, for the par-allel algorithm this increases the ratio of connectors among the vertices, thereby increasing communication requirements. Italso limits the effectiveness of the KARP–SIPSER initialization. To estimate the latter effect, we run the algorithm both with andwithout KARP–SIPSER initialization in this experiment. We study graphs with 24

6 �d 6 29, which suggests 216 as a suitablenumber for n. Thus, m ranges between 220 and 225.

Table 5Weak scaling of edges in the Erd}os–Rényi graphs experiment. Number of edges grows proportional to the number of processors. Sizes are given as powers of 2.All running times are in seconds.

Edges 18 19 20 21 22 23 24 25Avg. degree 2 3 4 5 6 7 8 9

Sequential time, KS 0.080 0.110 0.160 0.260 0.460 0.830 1.220 2.370Sequential time, no KS 0.140 0.090 0.100 0.150 0.220 0.400 1.690 1.260

Processors 1 2 4 8 16 32 64 128Parallel time, KS 0.446 0.824 1.620 1.610 2.078 3.657 2.571 2.715Parallel time, no KS 0.295 0.634 1.559 0.912 0.857 1.224 1.537 1.871Efficiency, KS 0.179 0.067 0.025 0.020 0.014 0.007 0.007 0.007Efficiency, no KS 0.474 0.071 0.016 0.021 0.016 0.010 0.017 0.005

Page 19: Parallel algorithms for bipartite matching problems on distributed memory computers

838 J. Langguth et al. / Parallel Computing 37 (2011) 820–845

The results of this experiment are given in Table 5. Interestingly, using KARP–SIPSER initialization seems to be clearly det-rimental to performance. And although running times vary considerably, there is no clear strong trend of an increasing run-ning time with increasing instance size. To quantify weak scaling, we computed efficiency values. The efficiency for a parallelexperiment using p processors is defined as:

Table 6Weak svertices

VertEdgeAvg.

SequSequ

ProcParaParaEfficEffic

Table 7Weak sincreas

VertEdge

SequSequ

ProcParaParaEfficEffic

E ¼ Ts

pTp

where Ts is the sequential and Tp the parallel running time. We always compare the sequential algorithm with KARP–SIPSER ini-tialization to the parallel algorithm using KARP–SIPSER and the uninitialized sequential with the uninitialized parallel algo-rithm. Due to the overhead incurred by the parallel algorithm, the efficiency is small at p = 1 and initially drops quiterapidly. This indicates that the increased communication requirements incurred by using more processors roughly cancelsout the fact that denser instances are proportionally easier due to the increased number of augmenting paths, a fact fromwhich the sequential algorithm profits. Efficiency stabilizes at p = 4 though. From this point, we see a decrease in efficiencyby about 0.25 per scaling step i.e. per doubling of m, which shows that the parallel algorithm exhibits reasonably good weakscaling on m.

From the sequential times, we observe almost a doubling of the running time for each doubling of the number of edges.This indicates that for these relatively dense graphs, m seems to be a suitable scaling parameter. KARP–SIPSER initializationseems to have little effect here.

The weak scaling results above were obtained by using m as a scaling parameter. This prompts the question about howthe algorithm behaves when scaling n as well. To test this, we generate a series of four random graphs by doubling the num-ber of vertices n and also doubling �d, thereby quadrupling the number of edges in each scaling step. The objective of thisscaling is to keep q constant. For an O(nm) sequential algorithm, this scaling increases the expected running time by a factorof eight. For an Oð

ffiffiffinp

mÞ algorithm, this reduces to 4ffiffiffi2p¼ 5:66. Of course, we expect the algorithms to perform better than

their worst-case running time. Therefore, we quadruple the number of processors for each scaling step, giving us a set of fourdatapoints for 2, 8, 32, and 128 processors and a second set for 1, 4, 16, and 64 processors. Both are grouped together in Ta-ble 6. The test instances range from 213 vertices and 219 edges to 216 vertices and 225 edges.

For the sequential algorithm, we observe an average increase in running time by a factor of more than four per scalingstep. Thus, we cannot hope to obtain constant running time over these instances. The drop in efficiency is only slightly highercompared to the edge scaling experiment because here every scaling step quadruples the number of processors. This indi-cates that the algorithm scales reasonably well w.r.t. increasing values of m and n. At 128 processors, we obtain a smallspeedup compared to the sequential algorithm. Still, the scaling is far from perfect, as evidenced by the decreasing parallelefficiency.

caling of edges and vertices in the Erd}os–Rényi graphs experiment. Number of edges grows proportional to the number of processors and number ofincreases at half that rate. Sizes are given as powers of 2. All running times are in seconds.

ices 13 14 15 16 13 14 15 16s 19 21 23 25 19 21 23 25degree 6 7 8 9 6 7 8 9

ential time, KS 0.030 0.120 0.430 2.370 0.030 0.120 0.430 2.370ential time, no KS 0.010 0.400 0.430 1.260 0.010 0.400 0.430 1.260

essors 2 8 32 128 1 4 16 64llel time, KS 0.157 0.381 1.458 2.715 0.179 0.592 1.223 2.571llel time, no KS 0.075 0.306 0.717 1.871 0.043 0.266 0.932 1.537iency, KS 0.095 0.039 0.009 0.007 0.167 0.051 0.022 0.014iency, no KS 0.066 0.163 0.019 0.005 0.232 0.376 0.029 0.013

caling of vertices in the Erd}os–Rényi graphs experiment. Number of edges grows proportional to the number of vertices and the number of processorses at the same rate. Sizes are given as powers of 2. All running times are in seconds. For p = 1, the graph is nearly complete.

ices 9 10 11 12 13 14 15 16s 18 19 20 21 22 23 24 25

ential time, KS 0.005 0.010 0.020 0.070 0.150 0.360 1.120 2.370ential time, no KS 0.005 0.010 0.010 0.020 0.060 0.160 0.410 1.260

essors 1 2 4 8 16 32 64 128llel time, KS 0.015 0.031 0.220 0.323 0.436 0.940 1.281 2.715llel time, no KS 0.005 0.040 0.070 0.113 0.216 0.601 1.138 1.871iency, KS 0.329 0.162 0.023 0.027 0.022 0.012 0.014 0.007iency, no KS 0.954 0.124 0.036 0.022 0.017 0.008 0.006 0.005

Page 20: Parallel algorithms for bipartite matching problems on distributed memory computers

J. Langguth et al. / Parallel Computing 37 (2011) 820–845 839

A third possibility of scaling the random instances consists of increasing n while keeping �d constant. Doing so increases malong with n and decreases q, making longer augmenting paths more likely. In this scaling model, a doubling of n would qua-druple the running time of an O(nm) algorithm. For a Oð

ffiffiffinp

mÞ running time, the expected worst-case increase would be2ffiffiffi2p¼ 2:83. Since the graph becomes sparser by this scaling, instances should increase in difficulty quicker than in the constant

density case discussed above. We experiment on this type of scaling by staring with a graph of 211 vertices on 4 processors and goup to 216 vertices on 128 processors. Vertices in all graphs have an average degree of 29. Results are shown in Table 7.

As expected, the sparser graphs tend to be difficult for both the sequential and the parallel algorithm. On the average,sequential time increases by a factor of 2.48 per doubling of n and m. Compared to the factor of about 4.8 for a doublingof n and �d obtained in the previous scaling experiment, the corresponding increase by 2.482 = 6.15 we obtain here is signif-icantly higher.

For the parallel algorithm, we observe a similar effect as in Table 5. After a sharp dropoff for the first few entries, the de-crease in parallel efficiency is relatively stable, at about 0.3 per scaling step. Therefore, the algorithm scales slightly better form than n. Note that performance without KARP–SIPSER initialization is clearly superior.

We have seen that for Erd}os–Rényi random graphs, the parallel algorithm scales reasonably well for all three differentmethods of increasing instance size. Interestingly, running the algorithm with KARP–SIPSER initialization usually yields lowerperformance than running without. Since the local KARP–SIPSER heuristic employed in the parallel algorithm is very fast, thesedifferences cannot be explained by the extra time the heuristics take. Instead, it is likely that after matching most vertices,only a few long augmenting paths remain, which causes the parallel algorithm to take comparatively long running times.

8.2. Grid graphs

To estimate the effects of graph structure on the scaling behavior, we used instances generated by a custom grid graphgenerator. The generator creates grid graphs of k � k vertices. Each vertex v in such a grid graph is connected to 4 neighbors(3 for boundary vertices and 2 for corners), for a total of 2(k2 � k) edges. The grid graph is partitioned for a

ffiffiffipp � ffiffiffi

pp

processorgrid such that each of the p processors is assigned an exclusive contiguous block forming a ðkþ 1Þ= ffiffiffi

pp � ðkþ 1Þ= ffiffiffi

pp

subgrid.Between every horizontally adjacent pairs of grid blocks is a grid column consisting of shared vertices, and between everyvertically adjacent pair is a grid row of shared vertices. Edges incident to these vertices are divided evenly among the pro-cessors owning the adjacent blocks. Clearly, p must be a square number. A similar generator was used in [22].

In order to obtain blocks of equal size for each processor,ffiffiffipp

must divide k + 1. Since k will always be significantly largerthan

ffiffiffipp

, this guarantees an almost identical workload for all processors. Clearly, there are no odd cycles in such a graphwhich means that it is always bipartite.

To test the performance of our parallel algorithm on such instances we generate grid graphs of moderate size. We set5 6

ffiffiffipp6 10 and k � 160

ffiffiffipp

. The actual value of k varies slightly sinceffiffiffipp

must divide k + 1 in order to obtain exclusiveblocks of equal size. Since the KARP–SIPSER initialization is likely to have a strong effect on such sparse graphs, we study itsinfluence by running the code with and without it. For the relabeling parameters r and s, we approximate the average opti-mum values from the results in Table 1. The experimental results are given in Table 8.

Clearly, using KARP–SIPSER initialization is extremely helpful in the grid instances. With the exception of the slight outlier atffiffiffipp ¼ 7, the running times and efficiency are identical except for experimental random variations, which indicates that thecode shows near-perfect weak scaling as is to be expected for such instances. We also observe strong scaling, and a speedupof more than 10 at p = 100. Still, the high overhead prevents perfect strong scaling. Without KARP–SIPSER initialization, runningtimes vary widely, but we can still observe some weak scaling, although the running times are much higher. Due to the goodrunning times of the sequential algorithm, efficiency is very low, indicating that KARP–SIPSER initialization is much more ben-eficial for the parallel algorithm than for the sequential.

The near-perfect weak scaling results suggest that the running time using KARP–SIPSER initialization is proportional to thenumber of edges per processor in the grid experiment. To test this, we also performed a strong scaling experiment by gen-erating grid graphs with 1 6

ffiffiffipp6 10; k � 1600, and exclusive blocks of equal and even size. Table 9 lists the results. Using

KARP–SIPSER initialization, we attain strong scaling i.e. the efficiency remains constant except for random variations in the

Table 8Weak scaling in the grid experiment. Instance size grows proportional to the number of processors, rounded for equal block size. Using KARP–SIPSER initialization,running time is almost constant, indicating perfect weak scaling. Running times are in seconds.

ffiffiffipp

5 6 7 8 9 10k 799 959 1119 1279 1439 1599

Edges 1,275,204 1,837,444 2,502,084 3,269,124 4,138,564 5,110,404Vertices 638,400 919,680 1,252,160 1,635,840 2,070,720 2,556,800

Sequential time, KS 0.050 0.070 0.080 0.110 0.140 0.170Sequential time, no KS 0.030 0.040 0.060 0.080 0.090 0.110

Processors 25 36 49 64 81 100Parallel time, KS 0.012 0.012 0.015 0.012 0.012 0.012Parallel time, no KS 0.978 1.347 0.794 1.707 1.233 1.200Efficiency, KS 0.165 0.161 0.112 0.141 0.142 0.138Efficiency, no KS 0.001 0.001 0.002 0.001 0.001 0.001

Page 21: Parallel algorithms for bipartite matching problems on distributed memory computers

Table 9Strong scaling in the grid experiment. Instance size is approximately 2,560,000 vertices and 5,116,800 edges for all processors configurations, rounded for equalblock size. Using KARP–SIPSER initialization, running time decreases linearly with the number of processors. As efficiency is almost constant we have strongscaling here. Without initialization, running times are significantly higher. All running times are in seconds.

ffiffiffipp

1 2 3 4 5 6 7 8 9 10k 1600 1598 1607 1599 1599 1595 1595 1599 1601 1599

Processors 1 4 9 16 25 36 49 64 81 100Parallel time, KS 1.296 0.321 0.144 0.080 0.051 0.036 0.026 0.020 0.016 0.012Parallel time, no KS 5.276 6.244 5.314 5.069 4.084 3.557 2.722 2.392 2.230 1.744Efficiency, KS 0.131 0.133 0.132 0.133 0.133 0.132 0.133 0.132 0.134 0.138Efficiency, no KS 0.021 0.004 0.002 0.001 0.001 0.001 0.001 0.001 0.001 0.001

Table 10Difficult instances in the grid experiment. With KARP–SIPSER initialization, the sequential code is significantly faster than the parallel code using 1 processor, andmuch slower without. The other columns show the effect small changes to the input have on the KARP–SIPSER initialization. By going from odd to even size of theexclusive blocks, running time increases by up to a factor of 32. Running times are in seconds.

k 1600 1600 1607 1598 1595 1601 1595 1602ffiffiffipp

seq 1 3 3 6 6 7 7Block size 1600 � 1600 1600 � 1600 535 � 535 532 � 532 265 � 265 266 � 266 227 � 227 228 � 228

Edges 2,560,000 2,560,000 2,582,449 2,553,604 2,544,024 2,563,200 2,544,024 2,566,403Vertices 5,116,800 5,116,800 5,161,684 5,104,012 5,084,858 5,123,198 5,084,858 5,129,602

Processors 1 1 9 9 36 36 49 49Running time, KS 0.180 1.296 0.144 0.577 0.036 0.725 0.026 0.857Running time, no KS 0.120 5.276 5.314 3.735 3.557 2.376 2.722 1.816Efficiency, KS 1.000 0.139 0.139 0.035 0.140 0.007 0.140 0.004Efficiency, no KS 1.000 0.023 0.003 0.004 0.001 0.001 0.001 0.001

840 J. Langguth et al. / Parallel Computing 37 (2011) 820–845

measurements under an increasing number of processors. Still, the parallel algorithm incurs a significant overhead, whichmeans that the efficiency cannot come close to 1. The sequential baseline runtime was given in the k = 1599 column of Ta-ble 8. Since all instances are nearly of identical size, using only one value is sufficient.

On the other hand, without KARP–SIPSER initialization the running time increases significantly, and only slight scaling canbe observed. Just like the running times of the performance experiment shown in Table 3, these values are highly dependenton the selection of the relabeling parameters r and s.

A comparison between the sequential and the parallel algorithm running on one processor is given in the first columns ofTable 10. The sequential code is significantly faster than the parallel code using 1 processor. Still, it takes at least 0.12 s whilethe parallel code using 100 processors took 0.012 s, giving a speedup of 10. On the other hand, the fact that the sequentialcode is faster than the parallel code using only 1 processor by a factor of 10 suggests that a more efficient implementation ofthe parallel algorithm is possible.

To highlight the volatility of the running times in the grid experiment, we changed instance sizes from Table 9 slightly.Results are given in the remaining columns of Table 10. These results show how much the KARP–SIPSER initialization dependson the partitioning. For exclusive blocks of odd dimensions, it yields a maximum matching using local computations onlybecause, in addition to its exclusive block, each processor owns the connectors adjacent to its block, and thus has local accessto a grid of even size. If the total number of vertices in the grid is odd, the processor in the lower right corner has a singleunmatched vertex remaining. The algorithm then detects this and terminates. If the total number of vertices was even, a per-fect matching has been found. Of course, this means that grid graphs with odd exclusive blocks are extremely easy instances,which explains the excellent scaling results above.

On the other hand, due to the connectors, even block dimensions result in an odd number of local original vertices, makeit impossible to find a perfect matching via local computations and thus cause a significant amount of parallel matchingwork. This shows that after partitioning a given matrix among a variable number of processors as it was done in Experiment2 (see Table 3), the success of the KARP–SIPSER initialization can vary widely, with practically unpredictable effects on totalrunning time. This fact poses a fundamental difficulty in the performance analysis of our algorithm.

9. Conclusions and further work

From the results of the experiments, we can conclude that:

� On the real-world instances, the algorithm does not show strong scaling or speedup. However, parallel performanceimproves with increasing p to some degree.� For low values of p, global relabels should be infrequent. Since the processor length of augmenting paths will be low, the

need for global relabelings is likely to be lower compared to experiments with higher processor count. Also, the number ofoperations per processor in a global relabeling is high, thus making it costly. For high p, the opposite is the case, whichexplains the better performance of low values of r there.

Page 22: Parallel algorithms for bipartite matching problems on distributed memory computers

J. Langguth et al. / Parallel Computing 37 (2011) 820–845 841

� Using low relabeling speeds, i.e., high values of s generally increases performance by improving load balancing. However,setting s > r yields weak performance. This is to be expected because in this case a relabeling wave might not be able torelabel all vertices on a processor before a new wave starts. For s = 128, running times were in many instances slowerthan the best running times by three orders of magnitude.Thus, setting s = r becomes optimal since this provides the strongest load balance possible that avoids the above problem.However, this is somewhat instance dependent. Often, lower values of s provide performance similar to that for s = r, butthe performance when setting s = r is rarely exceeded significantly, which suggests using this relation for furtherexperiments.� The optimum number of global relabels depends to some degree on the instance, but this cannot be analyzed beforehand

and thus Algorithm 2 cannot take this into account. However, on difficult instances Algorithm 2 performs reasonably welland shows performance comparable the sequential algorithm.� As a rule of thumb, doubling the processor number halves the optimal values of r and s.� With a larger number of processors the effect of parameters on performance increases (see Table 1). For p = 128, average

performance over all values of s and r examined is only 27% of the optimum performance.� Performance of the sequential algorithm is usually superior to performance of Algorithm 2, unless instances are very

large, very difficult, or very easy.� The performance of the algorithm could be improved by about 20% by automatically adapting s and r towards the opti-

mum for the current instance. The sequential algorithm does this by starting a global relabeling after O(n) local relabels.Attempts to introduce a similar mechanism on the parallel algorithm were not successful (see Table 2). It is possible thatvarying s and r during a run of the algorithm results in even larger performance gains.� The effect of the KARP–SIPSER heuristic on the parallel algorithm is unpredictable. In some cases it finds perfect matchings.

In other cases it increases running time substantially. Still, our findings suggest that using it pays off on average.� The parallel algorithm matches most vertices in the beginning and then takes comparatively long time to match remain-

ing vertices. Therefore, it produces an �-approximation quickly.� For instances that do not suffer from long augmenting paths such as grid graphs, the algorithm shows perfect weak

scaling.� The transpose experiment on av41092 indicates that the algorithm is quite stable. It does not suffer from the same prob-

lems associated with auction algorithms.� Algorithm 2 can be transformed into an �-approximation algorithm simply by stopping as soon as a 1 � � fraction of the

vertices is matched. The round-by-round analysis in Section 7.4 indicates that such an algorithm would be significantlyfaster than Algorithm 2.� The experiments on Erd}os–Rényi random graphs indicate that the algorithm has reasonably good weak scaling. Therefore,

we expect it to be suitable for very large instances.

We have seen that Algorithm 2 showed reasonably good weak scaling in the experiments. However, for the relativelysmall real world matrices, almost no speedup compared to the sequential PUSH-RELABEL algorithm was measured. Strongscaling w.r.t. the one processor case was found, although the effect was relatively weak in most cases. The main useful-ness of the algorithm is in parallel applications where the data is already partitioned and distributed on the processors.In this setting the alternative to a parallel algorithm would be to gather the data on one processor before applying asequential algorithm and then again distributing the solution, although this might not be possible because of memorylimitations on a single compute node. Even if it is possible, doing so would increase the sequential running time bya small factor. Preliminary experiments suggest that this factor lies between 2 and 3, making the parallel algorithm morecompetitive on large instances.

When comparing experimental results for Algorithm 2 to those for the shared memory maximum flow algorithms thatAlgorithm 2 is based on, we note that relative to processor speed, memory access on the shared memory machines is stillfaster than using the interconnects on a distributed memory supercomputer. Furthermore, bipartite matching tends to ex-hibit a smaller amount of parallelism than maximum flow [15]. Still, with the exception of the one processor case, the scalingbehavior of Algorithm 2 compares favorably to the results presented in [14]. However, Algorithm 2 was unable to provide aspeedup comparable to that reported in [16], but this might be due to the fact that sequential algorithms profit dispropor-tionately from recent advances in processor technology.

Furthermore, as [11,13] point out, selecting the optimum order of pushes is very important for performance in thesequential PUSH-RELABEL algorithm. Since pushes are grouped together during the communication rounds, their order cannotbe controlled in the same way it the parallel algorithm, which might be an obstacle for improved parallel performance.

If one were to improve further on the parallel running time of Algorithm 2 there are different options that one could pur-sue. The main weakness of Algorithm 2 is the high cost of performing a global relabeling. A relabeling wave touches everyreachable vertex independent of whether it is on an alternating path to an active vertex or not. Also, a vertex will be con-tinuously relabeled even if its path to a free right vertex has not changed. One way to speed up global relabelings mightbe to use local graph compression techniques such as those discussed in [26,33,34]. In that case, it is feasible to increasethe relabeling frequency, thereby drastically decreasing the number of rounds.

We note that the ideas of local work leading up to Algorithm 2 might also be applied in the shared memory model givingan algorithm where more work could be performed by each processor in between synchronizations.

Page 23: Parallel algorithms for bipartite matching problems on distributed memory computers

842 J. Langguth et al. / Parallel Computing 37 (2011) 820–845

A closely related problem in combinatorial scientific computing is finding a matching of maximum weight in addition tomaximum cardinality. One way to obtain an approximation for this problem would be to first apply Algorithm 2 to computea maximum matching without regards to the weights of the edges. Based on this solution one can then search for weightaugmenting cycles and augment along these. Depending on the length of the longest cycle one searches for and whether suchcycles can span several processors or not, one obtains algorithms with different solution quality and timing properties.Exploring such strategies is a topic for further study but we note that preliminary results indicate that this is a promisingapproach.

Acknowledgements

The second author of this work is supported in part by NSF award numbers: OCI-0724599, CNS-0830927, CCF-0621443,CCF-0833131, CCF-0938000, CCF-1029166, and CCF-1043085 and in part by DOE Grants DE-FC02-07ER25808, DE-FG02-08ER25848, DE-SC0001283, DE-SC0005309, and DE-SC0005340.

Appendix A

Tables 11–19.

Table 11Table of the base values in Fig. 2, p = 1. Values are averages over the ratios of optimum performance obtained for a given combination of r and s on a testmatrices from Table 1 at p = 1.

r s

1 2 4 8 16 32 64 128

128 0.93 0.93 0.91 0.96 0.98 0.97 0.98 0.9664 0.92 0.92 0.90 0.95 0.95 0.96 0.97 0.9432 0.89 0.89 0.89 0.92 0.93 0.95 0.95 0.9316 0.89 0.89 0.88 0.92 0.94 0.94 0.90 0.88

8 0.84 0.84 0.83 0.87 0.86 0.82 0.79 0.804 0.84 0.84 0.83 0.85 0.82 0.77 0.79 0.772 0.82 0.82 0.83 0.81 0.79 0.77 0.77 0.741 0.82 0.82 0.80 0.80 0.78 0.77 0.75 0.73

Table 12Table of the base values in Fig. 2, p = 2. Values are averages over the ratios of optimum performance obtained for a given combination of r and s on a testmatrices from Table 1 at p = 2.

r s

1 2 4 8 16 32 64 128

128 0.71 0.74 0.82 0.87 0.92 0.94 0.94 0.9064 0.65 0.65 0.71 0.79 0.82 0.87 0.87 0.7532 0.54 0.55 0.69 0.72 0.75 0.76 0.72 0.3516 0.53 0.51 0.59 0.64 0.70 0.71 0.36 0.39

8 0.48 0.50 0.54 0.62 0.64 0.28 0.36 0.304 0.44 0.45 0.51 0.58 0.30 0.35 0.32 0.292 0.40 0.39 0.48 0.27 0.30 0.33 0.22 0.221 0.35 0.36 0.52 0.24 0.30 0.32 0.24 0.20

Table 13Table of the base values in Fig. 2, p = 4. Values are averages over the ratios of optimum performance obtained for a given combination of r and s on a testmatrices from Table 1 at p = 4.

r s

1 2 4 8 16 32 64 128

128 0.62 0.63 0.72 0.85 0.91 0.87 0.85 0.7664 0.56 0.62 0.71 0.81 0.80 0.84 0.73 0.6432 0.54 0.57 0.65 0.76 0.76 0.74 0.61 0.4016 0.47 0.51 0.62 0.68 0.75 0.62 0.38 0.36

8 0.42 0.47 0.56 0.64 0.57 0.37 0.39 0.344 0.42 0.45 0.56 0.50 0.34 0.38 0.35 0.352 0.39 0.39 0.48 0.29 0.40 0.44 0.31 0.231 0.36 0.40 0.31 0.30 0.41 0.35 0.35 0.28

Page 24: Parallel algorithms for bipartite matching problems on distributed memory computers

Table 14Table of the base values in Fig. 2, p = 8. Values are averages over the ratios of optimum performance obtained for a given combination of r and s on a testmatrices from Table 1 at p = 8.

r s

1 2 4 8 16 32 64 128

128 0.62 0.66 0.76 0.76 0.80 0.73 0.74 0.7864 0.59 0.63 0.70 0.76 0.76 0.80 0.81 0.4932 0.58 0.60 0.70 0.74 0.76 0.75 0.55 0.4516 0.53 0.53 0.61 0.63 0.72 0.54 0.42 0.41

8 0.49 0.53 0.65 0.67 0.55 0.46 0.39 0.324 0.47 0.48 0.62 0.68 0.44 0.43 0.41 0.272 0.42 0.47 0.59 0.40 0.47 0.47 0.36 0.231 0.42 0.50 0.38 0.39 0.46 0.46 0.33 0.28

Table 15Table of the base values in Fig. 2, p = 16. Values are averages over the ratios of optimum performance obtained for a given combination of r and s on a testmatrices from Table 1 at p = 16.

r s

1 2 4 8 16 32 64 128

128 0.50 0.57 0.61 0.67 0.69 0.67 0.61 0.5964 0.49 0.55 0.61 0.65 0.64 0.64 0.54 0.2832 0.55 0.54 0.63 0.73 0.75 0.76 0.32 0.3116 0.49 0.52 0.59 0.68 0.80 0.42 0.36 0.26

8 0.47 0.54 0.60 0.75 0.43 0.44 0.33 0.264 0.45 0.50 0.65 0.59 0.53 0.44 0.31 0.252 0.48 0.53 0.66 0.42 0.47 0.47 0.36 0.241 0.46 0.57 0.48 0.46 0.53 0.42 0.30 0.19

Table 16Table of the base values in Fig. 2, p = 32. Values are averages over the ratios of optimum performance obtained for a given combination of r and s on a testmatrices from Table 1 at p = 32.

r s

1 2 4 8 16 32 64 128

128 0.36 0.35 0.36 0.41 0.40 0.38 0.39 0.3564 0.39 0.41 0.52 0.55 0.51 0.49 0.41 0.1632 0.47 0.51 0.58 0.71 0.69 0.60 0.27 0.1616 0.57 0.64 0.64 0.76 0.76 0.23 0.24 0.16

8 0.54 0.64 0.70 0.83 0.41 0.29 0.24 0.164 0.52 0.61 0.80 0.52 0.46 0.30 0.24 0.162 0.54 0.63 0.73 0.47 0.47 0.35 0.25 0.161 0.54 0.69 0.40 0.50 0.40 0.31 0.23 0.11

Table 17Table of the base values in Fig. 2, p = 64. Values are averages over the ratios of optimum performance obtained for a given combination of r and s on a testmatrices from Table 1 at p = 64.

r s

1 2 4 8 16 32 64 128

128 0.22 0.21 0.25 0.26 0.26 0.23 0.19 0.1564 0.29 0.34 0.35 0.39 0.35 0.29 0.23 0.0832 0.41 0.41 0.46 0.51 0.42 0.37 0.11 0.0816 0.49 0.51 0.63 0.62 0.48 0.16 0.13 0.10

8 0.59 0.61 0.69 0.72 0.24 0.23 0.12 0.094 0.60 0.68 0.82 0.48 0.30 0.21 0.12 0.082 0.65 0.74 0.56 0.49 0.36 0.23 0.14 0.091 0.69 0.73 0.49 0.51 0.41 0.26 0.16 0.09

J. Langguth et al. / Parallel Computing 37 (2011) 820–845 843

Page 25: Parallel algorithms for bipartite matching problems on distributed memory computers

Table 18Table of the base values in Fig. 2, p = 128. Values are averages over the ratios of optimum performance obtained for a given combination of r and s on a testmatrices from Table 1 at p = 128.

r s

1 2 4 8 16 32 64 128

128 0.05 0.06 0.06 0.06 0.06 0.06 0.05 0.0464 0.16 0.16 0.18 0.17 0.17 0.16 0.11 0.0532 0.28 0.27 0.27 0.29 0.27 0.19 0.09 0.0516 0.41 0.40 0.41 0.44 0.33 0.15 0.08 0.05

8 0.50 0.53 0.57 0.54 0.24 0.14 0.08 0.054 0.57 0.70 0.69 0.41 0.25 0.15 0.10 0.052 0.66 0.78 0.53 0.41 0.26 0.16 0.10 0.041 0.78 0.89 0.53 0.44 0.29 0.18 0.10 0.01

Table 19Running times of the parallel algorithm using KARP–SIPSER initialization and no initialization. On average, the uninitialized algorithm is 39% slower.

Matrix Initialization Processors

4 8 16 32 64 128

Hamrle3 NONE 14.77 10.67 7.17 4.56 2.95 4.02KS 12.81 7.32 5.88 3.02 3.43 3.55

ldoor NONE 8.25 5.22 4.32 2.94 2.12 1.84KS 5.97 3.68 2.39 1.20 1.06 1.35

cage14 NONE 15.84 7.96 6.22 4.22 3.51 2.97KS 9.40 5.91 4.31 3.32 2.34 2.19

kkt_power NONE 14.46 7.68 8.05 3.99 4.41 3.26KS 8.93 6.87 5.23 2.10 1.98 1.85

parabolic_fem NONE 2.94 1.86 1.08 0.78 0.67 0.93KS 2.30 1.48 1.14 0.74 0.59 0.52

av41092 NONE 0.79 0.62 0.54 0.51 0.81 1.50KS 0.56 0.43 0.41 0.57 0.92 1.76

844 J. Langguth et al. / Parallel Computing 37 (2011) 820–845

References

[1] I.S. Duff, On algorithms for obtaining a maximum transversal, ACM Transactions on Mathematical Software 7 (3) (1981) 315–330.[2] A. Azad, J. Langguth, Y. Fang, A. Qi, A. Pothen, Identifying rare cell populations in comparative flow cytometry, in: V. Moulton, M. Singh (Eds.),

Workshop on Algorithms in Bioinformatics, Lecture Notes in Computer Science, vol. 6293, Springer, Berlin/Heidelberg, 2010, pp. 162–175.[3] R.J. Baxter, Exactly Solved Models in Statistical Mechanics, Academic Press, 1982.[4] J.R. Dias, G.W.A. Milne, Chemical applications of graph theory, Journal of Chemical Information and Computer Sciences 32 (1) (1992) 1.[5] T.A. Davis, Y. Hu, The University of Florida sparse matrix collection, ACM Transactions on Mathematical Software, <http://www.cise.ufl.edu/research/

sparse>, in press.[6] I.S. Duff, Algorithm 575: Permutations for a zero-free diagonal [F1], ACM Transactions on Mathematical Software 7 (3) (1981) 387–390.[7] J. Edmonds, R.M. Karp, Theoretical improvements in algorithmic efficiency for network flow problems, Journal of the ACM 19 (2) (1972) 248–264.[8] H. Alt, N. Blum, K. Mehlhorn, M. Paul, Computing a maximum cardinality matching in a bipartite graph in time Oðn1:5

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffim=logn

pÞ, Information Processing

Letters 37 (4) (1991) 237–240.[9] J.E. Hopcroft, R.M. Karp, An n5/2 algorithm for maximum matchings in bipartite graphs, SIAM Journal on Computing 2 (4) (1973) 225–231.

[10] A.V. Goldberg, R.E. Tarjan, A new approach to the maximum flow problem, in: proceedings of the 18th Annual ACM Symposium on Theory ofComputing, 1986, pp. 136–146.

[11] B.V. Cherkassky, A.V. Goldberg, P. Martin, J.C. Setubal, J. Stolfi, Augment or push: A computational study of bipartite matching and unit-capacity flowalgorithms, ACM Journal of Experimental Algorithmics 3 (1999).

[12] I.S. Duff, K. Kaya, B. Uçar, Design, implementation, and analysis of maximum transversal algorithms, Tech. Rep. TR/PA/10/76, CERFACS, Toulouse,France, 2010. URL <http://www.cerfacs.fr/algor/reports/2010/TR_PA_10_76.pdf>.

[13] K. Kaya, J. Langguth, F. Manne, B. Uçar, Experiments on push-relabel based maximum cardinality matching algorithms for bipartite graphs,Tech.Rep.TR/PA/11/33, CERFACS, Toulouse, France, 2011. URL<http://www.cerfacs.fr/algor/reports/2011/TR_PA_11_33.pdf>

[14] D.A. Bader, V. Sachdeva, A cache-aware parallel implementation of the push-relabel network flow algorithm and experimental evaluation of the gaprelabeling heuristic, in: proceedings of the 18th International Conference on Parallel and Distributed Computing Systems, ICPDCS 2005.

[15] R. Anderson, J.C. Setubal, A parallel implementation of the push-relabel algorithm for the maximum flow problem, Journal of Parallel and DistributedComputing 29 (1) (1995) 17–26.

[16] J.C. Setubal, New experimental results for bipartite matching, in: proceedings of Network Optimization, Theory and Practice, NETFLOW 1993.[17] L. Bus, P. Tvrdík, Distributed memory auction algorithms for the linear assignment problem, in: IASTED Parallel and Distributed Computing and

Systems, IDCS 2001, 2002, pp. 137–142.[18] J. Riedy, Making static pivoting scalable and dependable, Ph.D. thesis, EECS Department, University of California, Berkeley (Dec 2010).[19] O. Schenk, M. Manguoglu, A. Sameh, M. Christen, M. Sathe, Parallel scalable PDE-constrained optimization: antenna identification in hyperthermia

cancer treatment planning, Computer Science – Research and Development 23 (2009) 177–183.[20] M. Manguoglu, A.H. Sameh, O. Schenk, PSPIKE: A parallel hybrid sparse linear system solver, in: proceedings of the 15th International European

Conference on Parallel Processing, Euro-Par 2009.[21] F. Manne, R.H. Bisseling, A parallel approximation algorithm for the weighted maximum matching problem, in: proceedings of the 7th International

Conference on Parallel Processing and Applied Mathematics, vol. 4967 of PPAM 2007, Springer Berlin/Heidelberg, 2007, pp. 708–717.

Page 26: Parallel algorithms for bipartite matching problems on distributed memory computers

J. Langguth et al. / Parallel Computing 37 (2011) 820–845 845

[22] Ümit V. Çatalyürek, F. Dobrian, A. Gebremedhin, M. Halappanavar, A. Pothen, Distributed-memory parallel algorithms for matching and coloring, in:proceedings of IPDPS Workshop on Parallel Computing and Optimization, PCO 2011, 2011, pp. 1966–1975.

[23] B. Vastenhouw, R.H. Bisseling, A two-dimensional data distribution method for parallel sparse matrix-vector multiplication, SIAM Review 47 (1) (2005)67–95.

[24] Ümit V. Çatalyürek, E. Boman, K. Devine, D. Bozdag, R. Heaphy, L. Riesen, Hypergraph-based dynamic load balancing for adaptive scientificcomputations, in: proceedings of the 21st International Parallel and Distributed Processing Symposium, IPDPS 2007, IEEE, 2007.

[25] M.M.A. Patwary, R.H. Bisseling, F. Manne, Parallel greedy graph matching using an edge partitioning approach, in: proceedings of the 4th ACM SIGPLANWorkshop on High-level Parallel Programming and Applications, HLPP 2010, 2010, pp. 45–54.

[26] J. Langguth, F. Manne, P. Sanders, Heuristic initialization for bipartite matching problems, ACM Journal of Experimental Algorithmics 15 (2010) 1.3:1–1.3:22.

[27] J.M.D. Hill, B. McColl, D.C. Stefanescu, M.W. Goudreau, K. Lang, S.B. Rao, T. Suel, T. Tsantilas, R.H. Bisseling, BSPlib: the BSP programming library, ParallelComputing 24 (14) (1998) 1947–1980.

[28] B.H. Korte, J. Vygen, Combinatorial Optimization: Theory and Algorithms, Birkhäuser, 2006.[29] R.H. Bisseling, Parallel Scientific Computation: A Structured Approach Using BSP and MPI, Oxford University Press, 2004.[30] R.M. Karp, M. Sipser, Maximum matchings in sparse random graphs, in: proceedings of the 22nd Annual Symposium on Foundations of Computer

Science, FOCS ’81, IEEE, 1981, pp. 364–375.[31] J. Leskovec, Stanford network analysis platform. <http://snap.stanford.edu>, July 2009.[32] J. Magun, Greedy matching algorithms, an experimental study, ACM Journal on Experimental Algorithmics 3 (1997).[33] T. Feder, R. Motwani, Clique partitions, graph compression and speeding-up algorithms, Journal of Computer and System Sciences 51 (2) (1995) 261–

272.[34] M. Löhnertz, Algorithmen für Matchingprobleme in speziellen Graphklassen, Ph.D. thesis, Universität Bonn, 2010.