parallel detection of strongly connected components with

Vrije Universiteit Amsterdam Universiteit van Amsterdam

Master Thesis

Parallel Detection of StronglyConnected Components with

Prioritized Vertices

Author: Benjamin Guicherit (s2562694)

1st supervisor: W.J. Fokkink2nd reader: F. van Raamsdonk

A thesis submitted in fulfillment of the requirements for

the joint UvA-VU Master of Science degree in Computer Science

July 12, 2019

2

Abstract

This work describes the development of a new parallel Strongly Con-nected Component (SCC) decomposition algorithm by adapting the ex-isting MultiStep algorithm by including a logic that prioritizes verticeswith large numbers of edges. The new algorithm is implemented along-side the FW-BW, Hong, and MultiStep algorithms and benchmarked onrandomly generated graphs. To facilitate meaningful results, a randomgraph generation algorithm was developed that generates random graphswith different layouts and clique prevalence.

1 Introduction

In graph theory a strongly connected component (SCC) is a maximal subsetof vertices whereof each vertex can reach every other vertex in that subset.From the definition of an SCC follows that a graph can be fully decomposedinto disjoint SCCs. Authors from related work mention several fields in whichSCC decomposition finds application. [8] mentions reinforcement learning, 3Dmesh element refinement and complex food web analysis, [11] lists compilerconstruction and bio-informatics, [12] brings up data mining, scientific comput-ing, computer-aided design and model checking (see section 3.2), [15] namesweb graph and social network analysis, and [7] and [9] (see section 3.1) dis-cuss radiation transport using the discrete ordinates method. Computers canassist by algorithmically finding the SCC decomposition of a graph. In 1972,Robert Tarjan proposed a linear-time SCC decomposition algorithm that relieson depth-first search (DFS) (see [16]). Although DFS is a difficult to parallelizesubroutine by nature, contributors were not deterred from developing parallelDFS-based algorithms [13][10][4][5]. These algorithms do not parallelize DFSitself however, but rather deploy parallel instances of a DFS-based routine. Thehigher scalability requirements of modern applications, ever growing graph sizes,and the poor parallelizability of DFS however, has raised the need for different,more efficiently parallelizable algorithmic techniques.

In the year 2000, [7] introduced Forward-Backward (FW-BW): a recursivebreadth-first search (BFS) and divide-and-conquer based algorithm. SCCs arefound by computing the forward reachability and backward reachability setsfrom a pivot vertex. The intersection of these two sets is an SCC, while theremaining vertices can be split into independent subgraphs (they share no ver-tices of any SCC) that can be used as input for recursive calls. Several authorshave contributed to the parallelized SCC detection problem since. [9] enhancedFW-BW with trimming logic. It quickly decomposes size 1 components, im-proving runtimes by preventing recursive instances from being created for suchcomponents, which are numerous in typical graphs. [14] created the MultiPivotalgorithm, which outputs SCCs in topological ordering. This is achieved bycomputing the forward reachability set of a set of pivots, rather than a singlepivot. Then five different sets of vertices are identified with properties thatensure the topological ordering of the SCCs output by the recursive calls done

3

on those sets. [6] Proposed optimizations to [14], and also adapted the Mul-tiPivot algorithm to find terminal SCCs. [3] developed the OBF algorithm,which decomposes the graph into disjoint sets called slices that are defined bythe forward reachability sets of random starting vertices. These slices are thenassigned to different worker threads. The worker threads then run in parallel tothe main thread that continues to decompose the graph into slices. [4] proposedthe UF-SCC algorithm. As a DFS-based algorithm, it specializes in on-the-flySCC decomposition which means it can decompose graphs as the graph is beinggenerated, and not explicitly stored in memory. The Union-Find data structureis used to efficiently communicate descendant sets between workers, and mergethe partial SCCs discovered by them. UF-SCC outperforms Tarjan’s algorithmat on-the-fly SCC decomposition on graphs that do not contain a giant SCC.The algorithm was later enhanced in [5] by combining the Union-Find datastructure with a cyclic list. This allows the algorithm to additionally performwell on graphs with a large component, resulting in a 10 to 30 times decom-position speedup on such graphs. [8] noted that the algorithm suggested by[9] performs poorly on real-world graphs, which typically have the small worldproperty. Small world graphs typically have one huge SCC and several smallerSCC connecting to the large one in only one direction. The authors proposemethods to efficiently decompose a large SCC. Novel concepts by this author aresize two component trimming, and WCC decomposition preprocessing to facili-tate better parallelism in the subsequent recursive FW-BW phase. [15] designedthe current most generally applicable and efficient algorithm called MultiStep.It employs several techniques from other contributors to eliminate inefficienciesduring different stages of the SCC decomposition. The efficiency of decompo-sition procedures depends among other things on graph density (the numberof edges relative to the number of vertices). For coloring, graph diameter (thelongest shortest path between any two vertices) also plays a role. Among thesetechniques are FW-BW, trimming ([9]), coloring ([12]), and Tarjan’s algorithm.The resulting algorithm performs well on all types of graphs.

This work attempts to improve upon the MultiStep algorithm with an al-gorithmic adaptation. The coloring subroutine is changed to assign the highestcolors to vertices with the highest product of degrees (product of the numberof in-neighbors and out-neighbors) during color initialization. This increasesthe likelihood that larger components will be detected during the first color-ing iteration. The new algorithm (MultiStep+) as well as the FW-BW, Hong,and MultiStep algorithms are all implemented and benchmarked on randomlygenerated graphs. A random graph generation algorithm is developed to gen-erate both sparse and dense graphs, as well as graphs with various numbersof SCCs, and SCCs of various sizes. Naive random graph generation producesgraphs with either one large SCC and the rest of the SCCs orders of magnitudesmaller, or graphs with a vast majority of trivial components. The developedgraph generation algorithm allows for interesting graph layouts. All of bench-marked algorithms are implemented and tested on these generated graphs.

Section 2 explains terminology, section 3 describes the different SCC detec-tion algorithms as well as some of their applications, section 4 describes the

4

implementation details and the random graph generation algorithm, section 5describes the experiments, section 6 discusses results, section 7 discusses imple-mentation and methodological issues, section 8 concludes, and section 9 suggestsfuture work.

2 Preliminaries

2.1 Graphs

Let G = (V,E) be a graph, where V denotes the set of vertices, and E the setof edges. The graph has |V | = n vertices and |E| = m edges. Edges are ofthe form (x, y) with x, y ∈ V and represent an edge from vertex x to vertexy. In directed graphs, the existence of edge (x, y) does not imply the existenceof edge (y, x) and can thus be considered a one-way edge. One can travelfrom a vertex to another vertex if there exists an edge between the startingand the ending vertex. Vertex x can reach vertex y if there exists a series ofedges such that one can (indirectly) travel from x to y in one or more steps.i.e. ∃i1, i2, ..., ik ∈ V : {(x, i1), (i1, i2), ..., (ik−1, ik), (ik, y)} ⊆ E where k is aninteger ≥ 0. Moreover, a vertex is always able to reach itself. An SCC is amaximal subset of vertices where each vertex can reach all other vertices inthe subset. Every vertex in a graph is in exactly one SCC. If vertex x werein multiple SCCs, then the two SCCs are in fact one and the same SCC, aseach of them can reach and be reached from x. As such, graphs can be fullydecomposed into a set of disjoint SCCs. An SCC of size 1, is called a trivialSCC. ”Component” may also be used to reference an SCC.

Weakly connected components (WCC) are like SCCs, but have a less restric-tive definition. If we consider all edges to be two-way edges, and then find anSCC, that SCC is a WCC in the original graph with unchanged edges. In otherwords, if vertices may additionally reach other vertices over incoming edges(as well as over outgoing ones), i.e. we may travel to our vertex’ in-neighbors,then SCCs found using that logic are a WCC using regular connectivity logic.In graphs where edges do not have a direction, but can be travelled over inboth directions (an undirected graph), we do not speak of strongly or weaklyconnected components, but merely of connected components.

Let x ∈ V be a vertex, and O ⊆ V the set of vertices that x has an outgoingedge towards. i.e. O = {u ∈ V | (x, u) ∈ E}. We call all vertices in O out-neighbors of x. Now let I ⊆ V be the set of vertices that have an edge leadingto x. i.e. I = {u ∈ V | (u, x) ∈ E}. We call a vertex in I an in-neighbor ofx. Vertices that are reachable from x are called descendants of x. Vertices thatcan reach x are called ancestors of x.

A subgraph induced by the subset of vertices S ⊆ V is the graph G′ =(V ′, E′), where V ′ = S and E′ = {(a, b) | a, b ∈ S}. The subgraph has onlyvertices that are in S, and only edges that connect two vertices in S. The restof the vertices and edges is discarded.

A topological ordering of a graph is a labeling of vertices with the numbers

5

1 through n, such that each vertex gets a unique number. This is done suchthat each vertex is assigned a lower number than all of its out-neighbors. Let1 ≤ L(x) ≤ n be a mapping of a vertex x to a number. For a topologicalordering, the statement ∀(a, b) ∈ E : L(a) < L(b) must hold. A graph withcycles does not have a topological ordering, as there will inevitably be at leastone edge that breaks the rule.

2.2 Random Graph Generation

The terms in this subsection are only used inside the scope of this work.When generating a random number to determine the success of n indepen-

dent events, the relative probability determines the average number of successesamong n rolls. When generating an Erdos graph with n vertices where theprobability p for any edge to exist equals 1

n , the relative probability for thatgraph is 1. On average, each vertex in this graph will have 1 out-neighbor.

In a graph, the relative size of an SCC is its size divided by the size of thelargest SCC. The relative size representation sequence (RSRS) of a graph is asorted sequence from large to small of the relative sizes of each SCC in thatgraph. For example, a graph with SCCs with the following sizes: 10, 3, 1 willhave an RSRS of (1, 0.3, 0.1).

2.3 Parallelism

Parallelism refers to a task’s ability to be worked on efficiently by different work-ers at the same time. The aim of developing parallel algorithms is to use multipleprocessors to solve a problem faster than a single processor could. As such, eachworker should contribute unique progress towards the task’s completion, uniqueimplying non-duplicated work. To properly support parallelism, a task must beable to be split into independent pieces of work, often called subtasks. Eachsubtask should be completable without requiring the completion of another sub-task’s work. Different workers may exchange information. When two or moreworkers (depending on the type of algorithm) have finished their subtasks, amaster thread combines their solutions into an overall solution. When the so-lutions of all subtasks are combined, the solution of the to-be-parallelized taskhas been found. A non-parallel procedure running on a single processor is calledsequential or serial.

3 Parallel SCC Algorithms

In this section, the papers that originally presented the algorithms and sub-routines implemented in this article are summarized, and the new algorithm isdescribed.

6

3.1 Fleischer et al. (2000) [7] and McLendon III et al.(2005) [9]

The radiation transport problem deals with radiative energy and its absorption,transport, and scattering through different types of objects. Solving this prob-lem can help predicting what will happen in case of an accident releasing dan-gerous radiation such as radioactive. [7] and [9] describe the discrete ordinatesmethod of solving radiation transport. This method involves the following: Ob-jects through which radiation travels are represented by a set of 3-dimensionalpolyhedral bodies made up out of polygons. Each of these bodies is adjacentto its neighbors, with which it shares at least one polygon. This scene of 3Dbodies can now be turned into a graph, where each body is a vertex, and edgesare placed between each pair of bodies that share a polygon. There is a globalvector that indicates the direction of the flow of radiation. Edges are directedsuch that their corresponding polygons align with this vector. Specifically, thedirection corresponds to the polygon’s normal vector that makes the smallestangle (out of the two normal vectors) with the radiation vector. To computethe behavior of radiation within a body or vertex, all of that vertex’ ancestorsmust have finished their computation. This requires the vertices to be treatedin topological order. As a topological ordering cannot be found if the graphhas cycles, SCC detection algorithms are required to find and eliminate thesecycles.

[7] develops the ”divide-and-conquer strong components” (DCSC) proce-dure. Only [7] and [9] refer to the technique in this way. Authors of latercontributions refer to it with ”forward-backward” (FW-BW). The FW-BW al-gorithm manages to find SCCs while avoiding depth-first search (DFS), anddivides the graph into disjoint subsets, that cannot share vertices of the sameSCC, thus supporting parallelism.

The algorithm works as follows: Select a random pivot vertex, and computeits ancestor and descendant sets using breadth-first search (BFS). Now we canidentify four sets of vertices.

1. Vertices that are in both the ancestor and descendant sets.

2. Vertices that are only in the ancestors, but not in the descendants.

3. Vertices that are only in the descendants, but not in the ancestors.

4. Vertices that are in neither the ancestors nor in the descendants.

Output the subgraph induced by set 1 as an SCC, and remove it from the graph.Now recursively call our algorithm on the subgraphs induced by sets 2, 3, and4 in parallel.

The intersection of the ancestor and descendant sets naturally makes anSCC. This SCC is a maximal connected subgraph, because vertices that donot lie in either ancestors or descendants, cannot be in the SCC. The reasonthis parallelism works, is because the remaining SCCs are enclosed within theirrespective sets. SCCs cannot partially lie in one set and have their remainder lie

7

in another. Set 2 has no edges incoming from sets 3 or 4. Set 3 has no outgoingedges to sets 2 or 4. Set 2 and 3 cannot share an SCC with set 1, because set1, being an SCC, is already a maximal subset of connected vertices, and SCCsare disjoint. The resulting algorithm runs in O(m log n) time. Fleischer et al.do not cover the implementation of their algorithm, nor run experiments. Bothof these are done by McLendon III et al.

McLendon et al. make few changes to the algorithm proposed by [7]. Theirmain contributions are implementation, addition of a trim step, and testing.They implement additional parallelism using the discrete ordinates method. Agraph can be created for many different angles of radiation flow. The SCCdetection algorithm can then run in parallel on each graph. This parallelism ishowever not SCC detection related. The trimming step is the most significantchange to the DCSC algorithm suggested by [7]. It involves removing verticeswith only a single edge. These vertices can take up an entire recursive call inthe DCSC algorithm. Removing them can only remove trivial one-sized SCCsfrom the graph, and thus lets the algorithm get to the important work faster.Trimming works as follows: Find all vertices with no in-neighbor, remove themfrom the graph and delete their edges, check whether their removal causes theirout-neighbors to have no in-neighbor, and remove them in turn. This processcontinues until no more of such vertices can be found. This type of trim is calleda complete forward trim. A complete backward trim works in the same way,but targets vertices with no out-neighbor.

This paper’s main algorithm, which they call ModifiedDCSC, works as fol-lows: Do a complete forward trim, then a backward trim. Select a randompivot vertex and compute its ancestor and descendant sets using BFS. Outputtheir intersection as an SCC, and perform parallel recursive calls on the samethree sets as described in DCSC. Their tests were done on the synthetic graphsgenerated by using the discrete ordinates method on a cube and a cylinderwith different levels of geometric distortion. The results indicate an order ofmagnitude speedup resulting from the trimming operation.

3.2 Orzan (2004) [12]

The field of model checking is concerned with formally proving that softwarewill work the way it is expected. Programs can be represented by a labelledtransition system (LTS) in which states are connected by transitions. Thesetransitions represent events in a running computer program, which lead to dif-ferent (memory) states. The link to graph theory is easily made. Suppose thereis a procedure that must check whether the program will terminate. It is im-portant to eliminate cycles from the LTS before such a procedure would be run.SCC detection algorithms are vital here. Orzan mentions that the need for aparallel SCC detection method arose during the development of a state spacereduction algorithm. An LTS describing the state space of a computer programmay contain sections that are duplicated several times in different areas of thegraph. In model checking theory, two states (or graph vertices) are consideredbisimilar if in each state, there are steps (edges) with the same label, leading to

8

a bisimilar state with the same label. There are also special transitions called τ -steps that may, under some conditions, be ignored in what is called a branchingbisimulation. State space graphs can be reduced in size if we detect branchingbisimilar sections, and remove the duplicates. A smaller state space graph canthen facilitate faster execution of other algorithms.

Parallelism is achieved by distributing vertices across machines. Cross tran-sitions are edges that connect vertices owned by different workers. The authornotes that performance of parallel algorithms will largely depend on the num-ber of cross transitions, but concedes that finding a distribution that minimizestheir number is hard, and settles for a random assignment of vertices to workers.

The article proposes several preprocessing steps. Trimming of atomic com-ponents (size 1 SCCs alike [9]), partial SCC detection, coloring, and reflexiveand multiple transition elimination. Each of these can simplify the graph, byeliminating vertices part of a trivial SCC, or by collapsing SCC subcomponents.Partial SCC detection involves workers finding SCCs using Tarjan’s sequentialalgorithm on their owned vertices. Reflexive transitions are edges that havethe same starting and ending node, i.e. (x, x). Multiple transitions are dupli-cate edges, i.e. given the existence of an edge (y, z), another edge (y, z) exists.Reflexive and duplicate transitions can be removed without affecting the SCClayout of the graph.

Coloring is a method that splits the graph into segments that cannot sharevertices of the same SCC. Additionally, it leaves the graph in a state where onecan easily compute SCCs. The procedure starts by assigning a unique color(or number) to each of the vertices. Now for each vertex x ∈ V , check itsout-neighbors’ colors. If the color of an out-neighbor is lower than col(x), setits color equal to col(x). Continue this process until no more vertices changetheir color. The graph is now partitioned into different sets of vertices, wherethe vertices in each set are of the same color. The vertices that never changedtheir color are called roots and have the property that all vertices reachablefrom them (their descendants) have their color. SCCs are contained within acolor, and can be found by finding the vertices that can reach the root of theircolor. This can be done by reversing all the edges, and rerunning the coloringalgorithm where only the roots of the previous run are assigned a color, or bystarting a BFS from each root that only visits vertices with the color of thatroot. When the SCC of each root is found, report those SCCs and start anothercoloring iteration by resetting the color of remaining vertices equal to theirvertex identifier. Note that a single coloring iteration does not necessarily findall SCCs in a graph except in special cases. So several coloring iterations arenecessary to decompose the entire graph. In a sense, this algorithm performsforward and backward searches step-wise, and instead of picking a pivot vertex,the entire graph is processed in parallel, at once.

The partial SCC detection method lets each worker find the SCCs enclosedwithin their assigned graph partition using Tarjan. The workers do not lookoutside of their boundaries and thus, the connected components found will notbe maximal subsets of connected vertices in the whole graph. This method ismerely used to collapse connected subcomponents into one vertex. This process

9

requires the removal and replacement of appropriate vertices and edges. LetS be a subcomponent. The vertices that are found as a subcomponent areremoved and replaced by a single vertex named SCC(x), where x is any vertex∈ S. Now replace all edges (a, b) where a ∈ S with the edge (SCC(x), b), andreplace all edges (a, b) where b ∈ S with (a, SCC(x)).

The author proposes and tests two algorithms. One utilizes the concurrentTarjan algorithm employed by the partial SCC detection method (CE1), an-other utilizes trimming and coloring (CE2). Each of these are tested on ninestate spaces varying in size between a few million states to tens of millions ofstates. The run time for each of these aforementioned cycle elimination (CE )algorithms to decompose SCCs is measured, as well as the run time for a branch-ing bisimulation (BB) reduction algorithm to complete with and without usinga cycle elimination algorithm as a preprocessing step. A notable speed-up of1.74x in favor of CE2 over CE1 is recorded on the so-called screen.1 state space.Thus proving the potential of the coloring method. With the CE algorithmspreprocessing this graph, a speedup of over 1000x is observed vs not prepro-cessing when running the BB algorithm. This state space has very few trivialcomponents (only 1.2%) and many components of size 50 and under. On statespaces with 100% trivial components, the CE algorithms provide no speedupwhen running the BB algorithm.

3.3 Hong et al. (2013) [8]

This author picks up where [7] and [9] left off, and gives brief mention to [3].The author notes that these algorithms were designed for synthetic graphs (suchas those used in the discrete ordinates method for radiation transport, or LTSgraphs used in [12] and [3]), and perform rather poorly on real-world graphs,that have fundamentally different properties. The aim is to develop adaptationsto the FW-BW-Trim algorithm (referred to as ModifiedDCSC by its originaldeveloper [9]) to improve its performance on real-world graphs. The paperfocuses heavily on implementation details and experimentation results, and isthe first to create a parallel SCC detection algorithm that outperforms Tarjan’stried and tested sequential solution.

Real-world graphs (e.g. social networks, citation networks, web graphs) dif-fer from synthetic graphs in multiple ways. The graph diameter (longest short-est path between any two vertices) is much smaller for real-world graphs thansynthetic ones. This positively affects the worst-case runtime of reachabilityqueries and BFS searches. Where synthetic graphs can have a large numberof medium-sized SCCs, real-world graphs often have one giant SCC, possiblytaking up more than half of the vertices, and many small SCCs. This giantSCC can cause major worker load imbalance in the FW-BW-Trim algorithm.Each recursive call only finds one SCC, and the largest SCC being of size O(n)means that the majority of work is done on only a single worker thread while theothers are idle. To alleviate this problem, they add another phase to the algo-rithm. That phase is dedicated to finding the large SCC using multiple workers.This is achieved by creating parallelism in BFSs done during forward and back-

10

ward reachability searches. Trimming is also parallelized in the implementationsuggested by the authors, gaining performance over the implementation of [9]which performs trimming sequentially. In addition to this advanced paralleliza-tion (called data-level parallelism), a data structure is set up so that the graphdoes not have to be physically modified upon vertex deletion. Instead, eachvertex has a ”marked” label, which is set to true if the vertex is trimmed, orfound as part of an SCC.

Apart from these implementation details, the article also introduces newtechniques. Another property of real-world graphs is that they have manysmaller SCCs. These SCCs have very small FW and BW sets, too small forthe recursion of the FW-BW algorithm to create parallelism. Some of thesesmall SCCs have a one-way path to others. Thus they are part of the sameweakly connected component (WCC). Finding WCCs is computationally lessexpensive and can be done as a preprocessing step. Finding the WCCs allowsus to cluster the small SCCs into WCCs (that each consist of one or more SCCs).When the large SCC is found, each WCC is now assigned a different color, suchthat different worker threads will deal with each one. This greatly improvesparallelism. Another technique is the fast detection of size-2 SCCs, which helpsto speed up the WCC detection step.

The article presents several algorithms. Everything comes together in theirfinal algorithm called Method2, which is described below. The procedure startswith a complete forward and backward trim of size-1 SCCs. Parallelism isemployed here. The next step is to execute BFS in parallel in search for the giantSCC. Parallel here points at BFS being parallelized, not at parallel recursiveFW-BW calls. When an SCC is found that contains more than 1% of the graph’svertices, the ”giant” SCC is considered found and this step ends. The next stepis named trimprime and contains three trimming phases: A complete forwardand backward trim, followed by a single iteration of size-2 SCC trimming, andlastly another complete regular iterative trim. The size-2 trim is not appliediteratively like the size-1 trim because it is more computationally expensive.Now the algorithm proceeds to the WCC detection step, where it marks eachWCC with a color. Finally, each WCC is given its own top level recursion ofFleischer et al.’s FW-BW on a different processor. When all of these threadsfinish, the SCC decomposition has completed.

The algorithm applies parallelism in every step. Testing the algorithm’sperformance on various real-world graphs versus the performance of Tarjan72,results in a speedup factor ranging between 5.0 and 29.4. Remarkable is themassive speedup caused by the WCC detection and size-2 SCC detection steps.

3.4 Slota et al. (2014) [15]

This paper proposes an algorithmic framework called the MultiStep method.This framework can be adapted to solve several graph-related problem whileexploiting parallelism. Graph-related tasks are SCC detection, WCC detection,and biconnected component (BiCC) detection. The article also proposes an allnew articulation vertex detection algorithm.

11

An articulation vertex is a vertex in an SCC that, were it to be removed,would cause the SCC to lose members other than the removed vertex. A bicon-nected component is an SCC that has no articulation vertices. Biconnectivitycan be relevant to networks with redundancy. Having a biconnected networktopology ensures that whenever one of the nodes crashes, no connectivity is lostother than to the crashed node.

The authors start out by giving brief summaries of earlier algorithms: Tar-jan, Fleischer et al.’s FW-BW, and Orzan’s coloring methods. Noted is that,in spite of being parallel, the latter two have a worse performance on socialnetworks and webgraphs than Tarjan’s serial algorithm. FW-BW performs wellif the graph has few and large SCCs, as this divides the problem well dur-ing a recursion step. Coloring performs well on graphs with many small anddisconnected SCCs, as the number of coloring iterations required until no morevertices change their color remains low in such a graph. Real-world graphs how-ever, typically have one very large SCC, and many smaller one-way connectedSCCs.

The MultiStep method aims to combine the strengths of the aforementionedprocedures and apply them to most efficiently tackle real-world graphs. Whereone procedure is inefficient, the other can be used efficiently. In order, trimmingis applied, then a single FW-BW search with no recursion, then coloring untila set number of vertices remains in the graph. Finally Tarjan’s algorithm isapplied to finish the remaining vertices.

Trimming can be done for only a single iteration or recursively by repeatedlytrimming new vertices if earlier trimming caused them to have no in- or out-neighbors (complete trimming). The authors decide to go for a single iterationtrim. This is because experiments indicate that coloring and Tarjan phasesare more efficient at eliminating these vertices than complete trimming. Thistrim can be applied in a data-level parallel fashion. To maximize the chancethat the single FW-BW search finds the largest SCC, the pivot vertex with thehighest product of in- and out-degrees is selected. Although this method doesnot guarantee the selection of a pivot that is in the largest SCC, it has provento work for most real-world graphs. Since FW-BW isn’t applied recursively,the backward BFS only has to consider vertices that are part of the forwardBFS. No effort has to be wasted on finding the three independent subgraphs ifthere are no recursive calls. This lets us spend less time during the backwardsearch. Data-level parallelism can be applied for the BFS operations in thisFW-BW step. In the next step, coloring is applied until 100 K vertices remainin the graph. This cutoff is chosen based on experiments. The authors concedehowever, that certain graphs would benefit from applying coloring all the wayto the end, while others would benefit from switching to Tarjan earlier. Thecache size of the hardware can also affect the optimal cutoff. When the cutoffvalue is reached, Tarjan’s algorithm is used to find the remaining SCCs.

The experimental results section heavily focuses on comparing the perfor-mance of MultiStep with Hong et al.’s Method2. Test graphs are several differentgraph types including real-world such as Twitter, LiveJournal, WikiLinks, andFriendster, as well as synthetic graphs. It is found that the MultiStep methods

12

is on average 1.9 times faster than Method2 from [8]. Additionally executiontimes are compared for simple trimming, no trimming, and complete trimming.It is found that for most graphs, simple trimming gives the shortest executiontime.

The article also proposes an algorithm for biconnected component detection,and reports on experiments with it. This is however, outside of the scope of thisarticle.

3.5 MultiStep+

The MultiStep+ algorithm is developed in this work. It aims to address afundamental inefficiency in the coloring algorithm. Even if coloring is imple-mented efficiently (see section 4) and not naively, there is still a small risk oflarge amounts of duplicate and wasted work. Consider the following scenario:A string of a vertices connect to each other with a single directed edge andend in a large SCC. Vertex K is at the start of the string, and vertex K − aat the end, being a vertex part of the large SCC. Vertex K has the highestvertex identifier among vertices in the string and the SCC and thus the higheststarting color. Each next vertex in that string has a vertex identifier that is onelower than its immediate predecessor in the string (see figure 1). Let us runa single coloring iteration on this graph. The vertex at the start of the stringwill have the highest color and will need K color propagation steps to reach thelarge SCC, and then another number of color propagation steps equal to thediameter of the large SCC. This also involves a number of comparisons in theorder of the number of edges in the SCC. When color propagation terminatesand the graph is scanned for roots, the only vertex that is a root is vertex K.Then after the reverse search is performed from each root, the only SCC that isfound is a trivial one containing vertex K. A large number of CPU cycles wereperformed to find a single trivial SCC. The next coloring iteration then startsby re-initializing vertex colors to their vertex identifier, and repeats the sameamount of work to again find a single trivial SCC. A color is propagated throughthe entire large SCC a times, while reporting only a single trivial componenteach time. The large SCC is found only after there are no longer any verticeswith a higher identifier than the highest identifier in the SCC that connect tothe SCC from outside. Even when a = 1, a large number of CPU cycles iswasted to find a mere trivial component.

To combat this potential inefficiency, a technique that is also applied be-fore MultiStep’s single FW-BW pass is used. We identify vertices with a highproduct of in- and out-degrees. At the start of the very first coloring iteration,instead of initializing the color of each vertex equal to their vertex identifier, wefirst find the vertices with the top t highest products of degrees. Those verticesare then given the highest t colors of all vertices in the graph (n ≤ color < n+t).The colors of other vertices are still initialized to their vertex identifier. Whencolor propagation ends, the probability that a root is found inside a large SCCis extremely high, as large SCCs are likely to contain the vertices with a highproduct of degrees. This product of degree counting is only done for the first

13

Figure 1: String of vertices connecting to a large SCC.

coloring iteration. Subsequent iterations are done normally.

4 Implementation

This section describes the implementation details of different algorithms andsubroutines of the implemented algorithms. Facets such as data structures andlogic are described. All algorithms including the random graph generator wereimplemented using Java.

4.1 Sets

The code avoids the use of set lookups, set intersections, and set-theoretic dif-ference (A \B) operations wherever possible. These operations can be compu-tationally expensive, and can be avoided by using an extra integer data field inthe vertex data carrier class as suggested by [8] and [11]. This integer is namedthe color of the vertex. When a vertex is added to a certain set, its color is setto the color of that set. When looking up whether a vertex belongs to a certainset, read its color variable instead of doing a set lookup.

In the FW-BW algorithm, the SCC isn’t in practice found by doing a setintersection operation of the forward and backward reachability sets. Instead,during the backward BFS, vertices that have the color of the forward set willimmediately be placed in the SCC set, and removed from the forward set. Asthe code can check a vertex’ set membership at virtually no cost, the non-recursive FW-BW done in both MultiStep versions and Hong is also able torestrict its backward search to only those vertices that appeared in the forwardsearch, gaining performance in the process. Backward searches done from rootssucceeding color stabilization also apply this technique.

4.2 Trimming

Trimming is done in some form in all algorithms. FW-BW does a completeiterative trim in each recursive step. At the end of each iteration the neighbors

14

of removed vertices are added to the set of vertices that will be examined fortrimmability in the next iteration. This prevents each iteration from checkingtrimmability of vertices that had no neighbors removed in the last iteration.These vertices are naturally never trimmable. Instead, only those vertices thathad a neighbor removed in the last trim iteration will be examined. The in-and out-neighbor counts are also updated in this step such that the next trimiteration doesn’t have to count the neighbors of each vertex, and instead onlyhas to check if any of the neighbor counts is 0. This approach however, forcesupdating the neighbor counts of each vertex adjacent to a removed SCC. Thisis done by either recounting the neighbors of all vertices in the graph not inthe SCC, or by decrementing the neighbor counts of neighbors of vertices in theSCC. Whichever method is used is determined by the size of the SCC versusthe size of the graph remainder. If the decomposed SCC is larger than theremainder, the former method is used, else the latter.

MultiStep (and MultiStep+) does a single trim iteration at the beginningof the algorithm. This is not iterative like trimming done in FW-BW. Instead,vertices are considered for trimmabilty for only a single iteration. No set for anext iteration is populated. After trimming, before vertex v (the vertex withthe highest product of degrees) is identified, neighbor counts are recounted forall vertices that remain after the trim. MultiStep+ does this a second time afterthe SCC containing v is decomposed. This will result in a more accurate top tvertices with their products of degrees.

Hong starts out with a complete iterative trim. This is done in the same wayas the iterative trim done in FW-BW. When the non-recursive parallel FW-BWphase ends after having found the ”large” SCC that contains more than 1%of the graph’s vertices, the trimprime step contains an iterative trim, a singleiteration of trim2 (size-2 component trimming), and another iterative trim. Thelast iterative trim is supplied a set of vertices as a starting point by the trim2step. Neighbors of vertices removed by the trim2 step are collected into this set.This once again, prevents the first iteration of the final iterative trim step fromhaving to scan all of the remaining vertices for trimmability.

Trim2 decides whether a vertex v is part of a size 2 SCC using the followinglogic: If the in-neighbor count of v is equal to 1, let u be that lone in-neighbor.Check whether u is also an out-neighbor of v. Then if the in-neighbor count ofu is equal to 1, report {v, u} as a size 2 component. If the previous logic hasn’treported v as part of a size 2 component, repeat that logic with every instanceof ”in-neighbor” replaced with ”out-neighbor” and vice versa.

Parallelized trimming of size 2 components is, unlike the trimming of size 1components, liable to race conditions. If let’s say, vertices v and u are a size 2component, and there is a thread examining v and another thread examiningu, they will both report {v, u} as a size 2 component, doubly counting thisSCC. This race condition is addressed by making the threads synchronize onthe Vertex object of the vertex with the lowest identifier in the pair that is beingreported as a size 2 component. Inside the synchronized block, the thread checksif the color of the vertices has already been set to the trimcolor of size-2 trimmedcomponents. If so, this thread returns from the method without taking actions.

15

Else, set the color of the pair to the trimcolor. data structures are updated andthe SCC is reported after this synchronized block. This ensures that the pair isreported as a size 2 component only once.

4.3 Breadth-First Search

All algorithms implemented use BFS. Although they take different actions pervisited vertex, they use the same queue-based BFS implementation. FW-BWuses BFS in the forward and backward steps, Hong uses it in the ”parallel FW-BW” step where the algorithm searches for the SCC that is larger than 1% ofthe graph’s size (see section 3.3), as well as during the recursive FW-BW callson each identified WCC. Multistep uses BFS during the single FW-BW passthat is started from the vertex with the highest product of in- and out-degrees,as well as during coloring, after color propagation stabilizes, where a backwardBFS must be performed from each root.

BFS is implemented using a splittable queue data structure. The Splitta-bleSinglyLinkedQueue class is a singly-linked list that implements queue func-tionality, as well as an O(n) split() function that splits the queue in half, cuttingoff the second half from the data structure and returning that half as a newSplittableSinglyLinkedQueue object. The starting vertex of a BFS is placed onthe queue, and then until the queue is empty, dequeue a vertex and enqueueits out-neighbors. When the BFS queue size exceeds a certain threshold size,the queue is split, and a new thread is started operating on the second half ofthe queue. This new thread may in turn spawn more threads if its queue sizeexceeds the threshold and the available thread pool isn’t exhausted. [8] suggeststhat the queue splitting threshold size should be at 100, where [11] suggests thatthis number should be at 128, and that a higher value could be set for largergraphs. In this work, 128 is chosen. With such a parallelized BFS implemen-tation, execution of a large BFS benefits from parallelism while execution of asmall BFS does not suffer from parallelization overhead.

Visited vertices are tracked by setting their color data field in the color ofthe BFS. If a vertex is already colored in that color, it will not be enqueued.It is possible for multiple threads to enqueue the same item if they all readthe uncolored value, and then concurrently write the BFS color to that vertex’color data field. Such race conditions do however not put the functionality orperformance of the BFS at risk. The HashSet data structure that stores visitedvertices automatically resolves duplicate items in O(1) time. Performance wise,suppose that different threads enqueue the same element (vertex v) into theirqueues this way. The first thread to dequeue v will enqueue its appropriateneighbors and set their color. Threads that dequeue v after that will see that thecolor of its neighbors is already set to the BFS color and thus not enqueue them.The probability that two threads dequeue v at the same time is astronomicallylow, as that requires that each of those threads spent the exact same amountof time dequeueing the elements before v. Furthermore, the probability thateach of those threads read the same color value for each of the neighbors of v iseven lower. Doubly enqueued elements such as v are therefore not realistically

16

leading to more doubly enqueued elements and as such not causing significantperformance losses or duplicate work.

4.4 Coloring and WCC Decomposition

Color propagation ends when the vertex colors stabilize. Stabilization meaningthat no vertex changed its colors at the end of a color propagation iteration.Color propagation can take up to a number of propagation iterations equal tothe diameter of the remaining unassigned vertices. (Unassigned meaning thosevertices that are not yet reported as part of some SCC.) We say that a vertexis included in a propagation iteration when we check each of that vertex’ out-neighbors to see if the color of the current vertex can be propagated to thatneighbor.

One could naively implement color propagation by including all unassignedvertices in each propagation iteration until stabilization. This, however, usesmany more CPU cycles than needed. Similarly to the implementation of trim-ming, there is a way to predict which vertices must be included in the nextpropagation iteration. (Or rather, a way to know which vertices may be ex-cluded.) When a thread overwrites the color of a vertex, that vertex is addedto the queue for the next propagation iteration. A global array of boolean flagsis checked to prevent a vertex from being doubly marked for the next iteration.Both the child and the parent whose color was passed onto the child are addedto the next iteration. When the current propagation iteration ends, the mas-ter thread combines the nextIterationQueue of each thread into a single queue,which is used as input for the next iteration. Each thread resets the flags in theboolean array that it set. This implementation is as described in the coloringpseudocode in [15].

When color propagation stabilizes, we scan the unassigned vertices for roots.(Vertices that retained their initial color.) These vertices are added to a rootsqueue. In parallel, threads take an element from this queue and do a backwardBFS. When the queue is empty and each BFS has finished, the current coloringiteration (not to be confused with color propagation iteration) ends, and anew iteration could be started if the MultiStep algorithm still has a number ofremaining vertices above nCutOff.

The Hong algorithm does a WCC decomposition step after the trimprimephase. This is done to increase the benefit of parallelism proceeding into therecursive FW-BW phase. WCC decomposition is implemented using nearlyidentical logic to color propagation in the coloring algorithm. The WCC al-gorithm propagates colors not only to out-neighbors, but also to in-neighbors.When this WCC coloring stabilizes, the graph has been divided into WCCs.This method works because a WCC in a graph would be an SCC if all edgeswere considered bi-directional. After color stabilization, we must still get eachWCC in a Set data structure such that they can be used as input for recursiveFW-BW. This is done by creating a HashMap⟨Integer, HashSet⟨Integer⟩⟩. AHashMap maps a key onto a value. When a key is provided as input, the valueassociated with the key is returned. The keys are the distinct colors among the

17

remaining unassigned vertices, and the values are sets containing all verticeswith the key color. We sweep over all unassigned vertices an add their vertexidentifiers to the appropriate set in the HashMap. The KeySet() of the map isthen put into a queue that is dequeued by multiple threads that each start arecursive FW-BW instance in the WCC of the dequeued element.

4.5 Parallel Iteration

Many of the subroutines across all implemented algorithms iterate over poten-tially large Set or Queue data structures. Some action has to be performed foreach element in such a data structure in each of these subroutines. To facilitateeasy parallelization of all of these subroutines, the ParallelIterator class was de-veloped. It distributes elements of an Iterable over different threads, and passesa callback object to these threads to instruct them what to do with each vis-ited element. Each thread receives a separate Iterator instance originating fromthe same Iterable object. (The collection in question to be iterated over.) Thismeans that one thread progressing through its Iterator while invoking the next()method does not affect the Iterators of the other threads. Threads grab parti-tions of elements from their Iterator. The partition size is set to 20 during theexperiments in this work. (See section 4.6.) Partitions are grabbed by threadsby performing a getAndIncrement() on a shared AtomicInteger. When a threadhas acquired its next partition, it skips through its local Iterator until it reachesthe first element of that partition. Then one by one it invokes the processEle-ment(T element) method of the callback object on each of the elements in thepartition. After finishing, the thread once again does a getAndIncrement() toacquire the next partition. This method involves little to no wait time for slowthreads, similarly to queue-based workload distribution. Larger partition sizesreduce the contention on the AtomicInteger, but increase the chance that a slowthread causes wait time.

Some parallel processes will unavoidably be done on Set data structures. Itwould be possible to first add each element in such a set to a queue, but thiswould use more memory, as well as restrain parallelization by overhead from thesynchronization of the dequeue() method. The parallel iterator avoids the needto copy each element of a set into a queue, and instead uses an AtomicIntegerto divide elements between threads.

The following subroutines are parallelized using the ParallelIterator class:Complete trimming iterations, size 2 trimming iterations, simple (non-iterative)trimming, neighbor recounting update steps post SCC detection, neighbor countupdating in between trim iterations, and finding the vertices with the top tproducts of degrees in MultiStep+. Some other subroutines sweep over a queue,and are parallelized using a queue. Such subroutines include color propagation,including the WCC color propagation, and BFS-based subroutines.

18

4.6 Parallelization Thresholds

Not everything that can be run in parallel should also be run in parallel. Thestartup cost of creating a thread might cost more time than is gained by paral-lelizing a procedure. (The code does not utilize a persistent thread framework,see section 6.) The decision whether to parallelize a subroutine should de-pend on the size of the input data that is operated on by the subroutine. Foreach parallelized subroutine, a parallelization threshold constant Cthreshold isestablished. The number of additional threads (worker threads) invoked to aidparallelization is defined as dNcollection

Cthresholde − 1, where Ncollection is the number of

elements in the input data. For example, a subroutine with Cthreshold = 1000will run sequentially in the main thread, creating zero worker threads for allNcollection ≤ 1000. If 1000 < Ncollection ≤ 2000, a single worker thread willbe created, running the subroutine with two total threads. Cthreshold can in-tuitively be understood as the maximum number of elements that should behandled by a single thread. Below each threshold is listed and explained along-side some other program parameters. The default values were obtained throughmanual testing on the hardware of a DAS4 compute node. (See section 5.1.)For various values, the average run time of 14 runs was taken, and the valuewith the best average run time determined the default value.

parallelTrimThreshold (default: 50,000) : Minimum number of verticesto spawn a worker thread in trimming iterations. Governs iterative trimmingand also size 2 component trimming. The input data is the collection of verticeswhich are to be evaluated for trimmability.

parallelRecountThreshold (default: 10,000) : Minimum number of ver-tices to spawn a worker neighbor recounting thread. After certain graph opera-tions, the neighbor counts of unassigned vertices have to be recounted. This isdone after SCC removal in FW-BW in case the just decomposed SCC is largerthan the remaining graph. In MultiStep and MultiStep+, neighbor counts arerecounted after the simple trimming step, and after the single FW-BW pass, incase the size of the SCC found by the pass is larger than the remaining unas-signed vertices. In Hong, after the parallel FW-BW phase that ends when a”large” SCC is found, neighbor counts are recounted before starting the trim-prime step. The input data is the collection of vertices whose neighbors have tobe recounted.

parallelNeighborDecrementingThreshold (default: 2,500) : Minimumnumber of vertices to spawn worker neighbor decrementor thread. Functional-ity serves a similar purpose as the subroutine associated with parallelRecount-Threshold, except instead of recounting the neighbors of unassigned vertices, itdecrements the neighbor counts of neighbors of vertices in a decomposed SCC.This is done in both FW-BW and MultiStep(+) if the size of a decomposedSCC is smaller than the size of the remaining graph. The input data is thecollection of vertices in a recently decomposed SCC.

parallelNextIterationSetPopulationThreshold (default: 75,000) : Min-imum number of vertices to spawn a worker nextIterationSet populator thread.In between iterative trim iterations, the neighbors of trimmed vertices form the

19

set to be evaluated for trimmability in the next iteration. Different threads col-lect the neighbors of trimmed vertices they evaluate in a thread-local collection.Neighbors of these vertices also have their neighbor counts decremented. Themain thread merges their collections and begins another trim iteration with themerged collection as input. The input data is the collection of vertices that weretrimmed in a single trim iteration.

BFSQueueSplittingThreshold (default: 128) : While technically not be-ing a threshold like the other numbers in this list, as they govern functions thatdecide how many threads to process a collection with before starting to execute,this number is still listed here. For any BFS-based function, split the BFS queuewhen it exceeds this number, and create a worker thread to continue the BFSon the second half. See section 4.3 for uses of BFS.

nPerGrab (default: 20) : Not a parallelization threshold. Decides howmany data elements are inside one partition for a thread invoked by the Par-allelIterator class. Small values might increase AtomicInteger contention whilehigher values might induce wait time for slow threads. See section 4.5.

parallelColoringThreshold (default: 10,000) : Minimum number of ver-tices to spawn a worker color propagation thread. This governs the coloringdone in both MultiStep versions as well as the WCC decomposition in Hong.The input data is the collection of vertices that are to be evaluated for colorpropagatability.

parallelTopNThreshold (default: 75,000) : Minimum number of verticesto start a worker TopNCollector thread. Governs the search for vertices with thehighest product of in- and out-degrees. In both MultiStep versions, the vertexfrom which the single FW-BW iteration is done is the top 1 vertex with thehighest product of degrees, and in MultiStep+, the top t vertices with highestproducts of degrees are found using the same function. The input data is thecollection of remaining vertices from which the highest product of degrees vertexor vertices must be obtained.

parallelSimpleTrimThreshold (default: 200,000) : Minimum number ofvertices to spawn a worker thread for simple trimming. A different class is usedfor simple trimming than for iterative trimming, as this type of trimming doesnot worry about collecting a nextIterationSet. As such it does not store trimmedvertices in a collection. The input data is the collection of vertices which are tobe evaluated for trimmability.

parallelWccThreshold (default: 1,562) : Minimum number of verticesto start a worker WCC decomposition thread. After the trimprime step inHong, WCC decomposition begins to facilitate good parallelism in the followingrecursive FW-BW step. The input data is the collection of remaining verticesthat must be decomposed into WCCs.

4.7 Random Graph Generation

The Hong algorithm is targeting ”small-world” property graphs. They typicallyhave one large SCC with more than half the vertices, and several smaller SCCsthat are weakly connected to the large one. Naively random graph generation

20

does not result in graphs with interesting layouts. They either have very littleand small SCCs, or one large SCC, but no smaller ones. Naively randomlygenerated graphs are unlikely to have a large amount of medium-sized SCCs.They do not resemble the graph type targeted by the Hong algorithm, andas such wouldn’t produce a meaningful performance comparison between thealgorithms. If Hong were to be run on a naively randomly generated graph,the majority of its runtime would be either in the parallel FW-BW phase, orin the trimming phase. There wouldn’t be any meaningful parallelism in theFW-BW instances that are run on each identified WCC. The imbalance affectsthe MultiStep algorithm in a similar way. A single huge SCC is highly likelyto be completely decomposed by the large SCC detection step that follows thesimple trim (the single FW-BW search from the vertex with highest productof degrees), and little to no work will be done in the coloring or Tarjan steps.Conversely, a sparse naively randomly generated graph will have all of the workdone during the simple trim. To properly compare the performance of multi-staged algorithms raises the need for a graph generator that can generate graphswhere a more balanced number of SCC decompositions occurs in each of thevarious stages. For this purpose a dedicated random graph generation algorithmwas developed.

Naively generating a random directed graph involves doing a random rollfor each of the n ∗ (n− 1) possible edges. If the roll for a pair of vertices (a, b)succeeds, an edge is placed between those vertices and (a, b) is added to E.Each edge has the exact same probability p of success. Such graphs are alsocalled Erdos graphs. With a relative probability k < 1 (i.e. p < 1

n , see section2.2) the majority of vertices will be trivial components (more than 99%). Fork > 1 the percentage of trivial components drops quickly. For these values of p,each vertex will on average have more than 1 out-neighbor, and also more than1 in-neighbor. The majority of vertices that are in a non-trivial component,however, are in the largest component (more than 99%). The graph’s RSRS(see preliminaries) will not have any number between 0.99 and 0.01. For largervalues of k the largest SCC in the graph will quickly take up over 99% of thegraph. The percentage of vertices in the largest SCC grows faster with k forsmaller graphs. No matter what value is picked for k, naive graph generationdoes not result in interesting graphs. It does not generate graphs that have avariety of SCC sizes in them, and neither have several significantly sized non-trivial SCCs.

The goal of the new graph generation system is to generate graphs with aninteresting SCC layout. For example, we might want to have a large number of”islands”, or we might want a large ”mainland” surrounded by smaller compo-nents. To generate a many-island layout, clustering vertices into ”buckets” bytaking their vertex identifier modulo the number of desired islands is a simplestrategy. Instead of having an equal chance for each edge to exist, we give edgesbetween vertices in the same bucket a bonus to their probability of placement.The probability for edge e to exist is pe = bbucket where bbucket = C when thevertices connected by e are in the same bucket and bbucket = 0 if not. Thisway of generating graphs will result in most SCCs being approximately equal in

21

size. The following system generates graphs with both large and small islands:Vertices can be placed in several levels of buckets. After the first-level buckethas been determined by taking the vertex identifier modulo Nb1 (the number ofbuckets in bucketing level 1), the index of the second level bucket is determinedby taking the index of the first level modulo Nb2. Pairs of vertices that fall in thesame level 2 bucket get a further increased probability of having an edge placedin between them. An intuitive way to think of this is as if the islands createdby the first bucketing level are treated as single vertices, and then those verticesgo through another bucketing process. At each bucketing level, the probabilitybonus is normalized to account for the fact that the ”island” from the last buck-eting level will have multiple members that are eligible for the probability bonusin the current bucketing level. An arbitrary number of bucketing levels can beperformed this way. When the parameters are set up correctly (the number ofbuckets at each bucketing level as well as the probability bonuses), this methodproduces graphs where most - if not all - of the ten 0.1 sized domains between 0and 1 (0.0− 0.1, 0.1− 0.2..0.9− 1.0) are represented in that graph’s RSRS. Toreference this method, type 1 bucketing is used from this point.

Real-world social network graphs often have vertices with an extremely highin- or out-degree. These vertices can be simulated with an increasing probabilitybonus function. The probability bonus depends on the vertex identifier, is 0 forvertex 0, and is a constant for vertex n. The probability bonus assigned toother vertices increases linearly with their vertex identifier. The probabilitybonus for an edge (a, b) is then decided by the highest vertex identifier amonga and b. This way, the linearly increasing bonus gives the same vertices anincreased chance to get out-neighbors as well as in-neighbors. The resultingprobability for an edge e is then pe = bbucket + blinear. Where bbucket is the totalprobability bonus of the different bucketing levels, and blinear is the linearlyincreasing probability bonus. Combining these bonuses allows for the generatedgraphs to resemble small-world property graphs. The RSRS for such generatedgraphs generally resembles the following form: (1, x, ..., 0). There is typically alarge gap between 1 and x. The relative sizes of the SCCs other than the largesttwo are uniformly distributed between x and 0.

We define a bucketing process with y bucketing levels using two arrays oflength y. The bucket definition array determines the number of buckets ateach bucketing level. The index of each element in the array corresponds tothe bucketing level that that element defines the number of buckets for. Themagnitude definition array determines the magnitude of the probability bonusfor edges between vertices that fall in the same bucket. Once again, the indicesof elements in this array match the bucketing level that the element definesthe probability bonus for. As mentioned before, numbers in the magnitudedefinition array are normalized to account for the fact that multiple membersexist within the supervertex that is defined by the last bucketing level. Arelative probability of 1 will on average result in one extra edge placed as aresult of the probability bonus of that bucketing level. For illustration, anexample of a 2-leveled bucketing graph of size 30,000 with bucket definitionarray {100, 20} and magnitude definition array {5, 1.5} is given below. This

22

example has the following SCC sizes in descending order: 1178, 877, 875, 584,583. The remainder is 86 SCCs of size around 300, and 705 trivial components.Notice how the approximate sizes of the first 5 SCCs are respectively 4, 3, 3, 2,and 2 in multiples of N

100 . Also notice how 86 + 4 + 3 + 3 + 2 + 2 = 100. At thefirst bucketing level the graph was divided into 100 islands. Given the extremelyhigh relative probability bonus of 5 at that bucketing level, most of the graphconsists of near N

100 -sized islands. Each vertex has an average of 5 outgoingedges to other vertices in the same bucket. This explains the low number oftrivial components. The second bucketing level divides the 100 islands into 20”continents”. Islands that are members of the same continent on average areconnected to 1.5 other such members. The lower relative probability of 1.5means that not all members of the same continent will be in the same SCC.The largest SCC turns out to be the result of 4 islands that were part of thesame continent being successfully connected by the random number generator.

4.8 Other Graph Generation Variants

Type 2 bucketing : During a multi-leveled bucketing process, at each level, in-stead of giving each pair of vertices with matching bucket indices the sameprobability bonus, scale the probability bonus based on the index of the bucketthat the pair is in linearly. ((bucketIndex+ 1)/nBuckets), where bucketIndexis the vertex identifier modulo nBuckets if this is the first bucketing level, orbucketIndex of the last level modulo nBuckets if this is a later bucketing level,and nBuckets is the number of buckets at this level. This gives pairs of verticeswith their bucketIndex 0 a bonus of 1

nBuckets and pairs with their bucketIndexequal to nBuckets − 1 (the highest bucket) a maximum bonus. This methodcreates graphs with a smoothly distributed RSRS, but not graphs of which onewould say that they possess the small-world property.

Type 3 bucketing : This variant scales the probability bonus that verticesin matching buckets receive by the vertex identifier of the highest of the twovertices. Vertex 0 will have a modifier of 0, and vertex N − 1 will have amaximum modifier of 1.0, resulting in the maximum bonus as defined by themagnitudes definition array. Notice how the scaling of the probability bonusis the same as for the blinear bonus. This method can mimic the blinear bonusby putting 1 as the last element in the bucket definition array. All vertices willbe considered part of that one bucket and thus be eligible for the bucketingbonus at that bucketing level. This method is still different from using type 1bucketing in combination with blinear, as the bonuses for vertices in matchingbuckets are a constant and not scaled to the vertex identifier using that method.Type 3 bucketing is capable of generating small-world graphs this way withoutthe use of the blinear bonus. Furthermore, it can generate n-world graphs bysetting the last element in the bucket definition array to n, and then assigningthat bucketing stage a high relative probability bonus. The resulting graph isone with n large SCCs surrounded by smaller islands if other bucketing levelswere defined.

23

5 Experiments

5.1 Hardware

Experiments were run on DAS-4 compute nodes (see [2] and [1]). A computenodes has two Intel E5620 quad core CPUs running at 2.4Ghz. As this chip iscapable of hyperthreading, the maximum number of logical threads supportedby the node is 16. 24 GB of main memory is present on the node. The numberof nodes used by experimental runs does not exceed one. (See section 6.) Allexperiments were also run on exactly the same node (node 23), eliminating thepotential of silicon and other material differences to influence results.

5.2 Test Cases

Tests were done using randomly generated graphs using type 3 bucketing. Eachgraph was run 28 times by each algorithm for each thread count. The resultsshow the average run time in milliseconds of the remaining 24 runs after trim-ming the two highest and two lowest values.

5.2.1 Solo Giant

Vertices Non-trivial Trivial

200K 34 41170

One large and twenty-seven medium components constitute the majority ofthis graph’s SCC scape. It was generated using the type 3 bucketing parametersgiven below (see section 4.7). For MultiStep(+), nCutoff was chosen at 5000.

Bucket Definition Array: {100, 25, 1}Magnitudes Array: {4.0, 5.0, 9.25}

The non-trivial SCC sizes from large to small are listed below. Each entryis a unique SCC.

92358, 6389, 6338, 6319, 4760, 4733, 3187, 3182, 1634, 1621, 1606, 1605, 1591,1587, 1581, 1581, 1581, 1580, 1578, 1578, 1575, 1572, 1571, 1556, 1549, 1543,

1539, 1524, 2, 2, 2, 2, 2, 2.

Below, a table listing the run times (milliseconds) for each of the algorithmsfor different numbers of threads.

Algorithm/nThreads 4 8

FW-BW 602 670MultiStep 588 625

Hong 693 757MultiStep+ (top t 25) 619 661MultiStep+ (top t 50) 624 663

24

5.2.2 Twin Giants


200K 25 39984

Two large and eighteen medium components constitute the majority of thisgraph’s SCC scape. It was generated using the type 3 bucketing parametersgiven below. For both MultiStep versions, nCutoff was chosen at 5000.

Bucket Definition Array: {100, 20, 2}Magnitudes Array: {4.0, 5.0, 9.0}


64013, 57707, 6367, 6359, 1641, 1633, 1618, 1614, 1614, 1605, 1604, 1603, 1598,1598, 1596, 1582, 1581, 1568, 1567, 1538, 2, 2, 2, 2, 2.

Below, a table listing the run times for each of the algorithms for differentnumbers of threads.



Hong 812 860MultiStep+ (top t 25) 719 745MultiStep+ (top t 50) 719 749

5.2.3 Islands


200K 52 50951

The significant components in this graph are made up of 50 rougly equally sizedislands. For MultiStep(+), nCutoff was chosen at 2500.

Bucket Definition Array: {50}Magnitudes Array: {3.75}


3060, 3052, 3050, 3042, 3040, 3037, 3032, 3021, 3013, 3011, 3009, 3005, 3000,3000, 3000, 2996, 2996, 2995, 2993, 2992, 2991, 2991, 2988, 2986, 2986, 2985,2982, 2980, 2976, 2973, 2973, 2969, 2966, 2966, 2963, 2963, 2960, 2959, 2957,

2955, 2953, 2951, 2935, 2932, 2927, 2925, 2921, 2920, 2919, 2849, 2, 2.


25



Hong 670 657MultiStep+ (top t 25) 641 611MultiStep+ (top t 50) 629 611MultiStep+ (top t 100) 648 632

5.2.4 Z-Islands


200K 54 48235

Just as the previous graph, the significant components in this graph are 50roughly equally sized islands. The graph is also generated with the same buck-eting parameters. This graph was generated to better compare the performancedifference for MultiStep and MultiStep+. The graph generation logic for thisgraph is different than for the other graphs. Type 3 bucketing involves a simplelinearly scaling function that scales the probability bonus for an edge to exist.This function scales from 0 to 1 for a highest vertex identifier between 0 and n.The function used for generating this graph, however, is defined as follows: forvertex identifiers between 0 and n/2, the function grows linearly from 0.5 to 1.Then for vertex identifiers between n/2 and n, the function grows from 0 to 0.5,making a Z shape that’s rotated counter-clockwise. The vertex identifier that ischosen from among the two vertices that are involved in an edge is determinedby whichever one would create the largest modifier. Graphs generated by thisfunction are more neutral to the run times of MultiStep vs. MultiStep+ thangraphs generated by the normal type 3 bucketing function. (Type 3 bucketingis implicitly holding MultiStep+ back, see section 6.) NCutoff was chosen at2500.

Bucket Definition Array: {50}Magnitudes Array: {3.75}


3148, 3124, 3116, 3115, 3090, 3089, 3073, 3063, 3062, 3061, 3060, 3059, 3058,3057, 3052, 3052, 3047, 3046, 3045, 3043, 3042, 3041, 3036, 3033, 3029, 3029,3027, 3027, 3025, 3025, 3023, 3022, 3022, 3019, 3019, 3017, 3015, 3015, 3013,2998, 2998, 2994, 2993, 2992, 2988, 2983, 2975, 2975, 2971, 2949, 4, 2, 2, 2.


26


MultiStep 705 674MultiStep+ (top t 25) 724 688MultiStep+ (top t 50) 683 661MultiStep+ (top t 100) 647 647MultiStep+ (top t 200) 648 650

6 Discussion

Some programming oversights were encountered late in development. Notably,the code creates Thread objects and assigns it a Runnable whenever it needsone or more worker threads. When that thread is finished, the garbage collectorcleans up the object. Creating and tearing down threads in this way has amuch larger overhead cost than using a thread pool with persistent threads. Ina thread pool, the main thread would submit Runnable tasks to a task queue.Idle threads take Runnable from this task queue and execute them. A threadthat finishes a task isn’t destroyed, but checks the task queue for more tasksand goes back to idle state if there are no tasks. This method doesn’t incur thecost of creating and registering the thread in the host operating system everytime a worker thread is needed to execute a task. The parallelization constantsin section 4.6 are expected to be much lower had the code employed persistentthreads, allowing a greater speedup from parallelism.

Another such oversight was encountered when entering the testing stage.The code does not support a shared memory architecture, and therefore themaximum number of threads that can be experimented with is the maximumallowed by the hardware of a single compute node. This limits the ability to testthe performance gains from scaling to large thread counts. The random graphgenerator was also limited in this way, making a graph size of 200K vertices thelargest practically feasible size. The larger a graph, the better one can measurethe speedup of parallelization. Larger graphs also have greater potential forwasteful color propagation (see fig. 1 in section 3.5), possibly sporting betterrelative run times for MultiStep+.

When doing runtime tests for the ”Islands” graph in section 5.2.3, one wouldexpect MultiStep+ to perform at its best in relation to the other algorithmscompared to other graphs. But as it turns out, even on a graph whose SCClayout should favor MultiStep+ the most, MultiStep+ does not outperformMultiStep on 8 threads. This is when I realized that the continuously increasinglinear function in type 3 bucketing creates a bias against MultiStep+. Coloringassigns each vertex an initial color equal to its vertex identifier. During coloringpropagation steps, vertices propagate their color to their out-neighbors witha lower color. Let us consider the roots after color propagation. These rootsmaintained their initial color as their color wasn’t overwritten, which impliesthat in their backward closure no vertex exists that had a higher initial colorthan theirs. This means that that vertex is likely to have either of the followingproperties:

27

1. the vertex has few in-neighbors and/or a small backward closure.

2. the vertex has a high vertex identifier.

If the root does not have the second property, it is unlikely that it will wastefullypropagate its color through a large component. Let’s consider a root that hasthe second property. Wasteful color propagation occurs when this vertex hasa large forward closure that isn’t part of its SCC. The chance that a largepart of its forward closure isn’t part of its SCC diminishes if the vertex has alarge backward closure. So if this root does not have the first property, it isless likely to be the root of a wasteful propagation sequence. Hence, wastefulpropagation is most likely to occur when roots have both the first and thesecond property. Type 3 bucketing assigns the highest probability modifier toedges between a pair of vertices with at least one high vertex identifier. Thus,type 3 bucketing assigns a large number of in- and out-neighbors to vertices withhigh identifiers. Vertices with a large forward closure are thus also likely to havea large backward closure using type 3 bucketing, forfeiting the need to insureagainst wasteful color propagation, the very thing MultiStep+ attempts to dowith its top t logic. A simpler argument one could make is that there is littlevirtue to manually giving vertices with a high product of degrees the highestcolors when vertices that receive high colors during normal color initializationare already likely to be high in product of degrees due to type 3 bucketing.The ”Z-Islands” graph was generated using a function that gives an averagenumber of incoming and outgoing edges to vertices with the highest identifiers(a modifier of 0.5), thus creating minimal bias against MultiStep+. The Z shapedoes assign a probability modifier less than 0.5 to vertices with identifiers belowthe highest (but higher than n/2). So one could argue that there is a slight biasin favor of MultiStep+ for this graph.

The FW-BW and Hong algorithms in the literature select a random pivotvertex. The code in this implementation selects the first vertex given by theIterator of the collection of remaining vertices, which is a HashSet. The runtimes of Hong specifically may be adversely affected by this, as Hong stops itsparallel FW-BW phase after finding a ”large” SCC that contains at least 1% ofthe graph’s vertices. The Hong algorithm will likely stop this phase at the exactsame component every run, with no guarantee that this SCC is the largest, asall of the test graphs contain several SCCs that match the ”large” criterium.The other significantly sized components will then be clustered into WCCs andlater processed by a recursive FW-BW thread. Hong is either lucky or unluckyon a specific graph, and doing multiple runs will not likely change its pivotseach run.

While the parallelization thresholds in section 4.6 aim to prevent threadsfrom being started for tasks with input sizes that are not worth parallelizingfor, not all algorithmic subroutines and phases are inhibited this way. Somesimply consume all available thread slots. The input sizes for some tasks aren’tnecessarily a clear indicator of the amount of work to be done for those inputs.For example, when dequeueing WCCs from the WCC queue in the recursive FW-BW phase of Hong, the length of the WCC queue will not say anything about

28

the size of the WCCs themselves. Similarly, when MultiStep does backwardBFS from the roots found after stabilization of coloring propagation, the lengthof the queue of roots will not predict how much time is spent on the backwardBFS from each of those roots. Lastly, FW-BW creates recursive instances foreach disjoint subgraph identified. Thus, in these cases, the main thread simplyrequests the maximum number of workers that is available. This in part explainswhy the run times are longer on 8 threaded runs than on 4 threaded runs forspecific test graphs.

One can observe that the run times for the different algorithms are ratherclose together. In fact, on all graphs the run time of the slowest algorithm isless than twice the run time of the fastest algorithm. Part of the explanationfor this are the graph properties that are associated with efficient FW-BW andcoloring performance. FW-BW performs well when the recursive calls creategood parallelism. When a FW-BW instance has identified its SCC and also hasidentified the three sets (forward, backward, remainder) that will be used asinput for recursive calls, if a significant portion of unassigned vertices is presentin each of the three sets, good parallelism will be achieved. Given the highnumber of significantly sized SCCs in all graphs, FW-BW performs decentlyon each of them. Since the run time of the coloring subroutine is related tothe diameter of the graph, coloring can perform poorly if large componentsare decomposed by it. Both MultiStep versions however, remove a large SCCbefore starting the coloring subroutine. If the remaining SCCs are boundedin diameter, coloring will run efficiently. Notice how the relative performanceof MultiStep is at its worst on the Twin Giants graph, where a second largeSCC remains after the removal of the first. The coloring subroutine is takinglong to reach color stabilization while propagating through the remaining largecomponent. Lastly, with graphs containing several medium sized components,good parallelism can be expected from BFS, which is employed by all algorithms.

Solo Giant: MultiStep is the fastest, MultiStep+ does not create speedupover MultiStep on this graph. A larger top t also incurs a performance penalty.One would think that regardless of the size of t, all vertices must be evaluatedfor their product of degrees, and therefore there should be no slowdown whenincreasing t. The main thread however, is slowed when it has to merge theindividual top t of each thread into one global top t. The main thread compareseach element from the top t of each other thread to the lowest element in thetop t of the main, then evicts the lowest element if it is lower. Each of thesesets is a TreeSet, thus is already sorted, but does incur O(log(t)) execution timefor each replacement. When all elements from all sets have been evaluated, themain thread now has the global top t in its set. (This could have been donemore efficiently, see section 8.)

Twin Giants: FW-BW is the fastest by a large margin. Both MultiStepversions and Hong attempt to target a giant SCC early on, but fail to elim-inate all giant SCCs this way, as the graph has two. When MultiStep+ hasdecomposed its first large graph, the other one remains, and likely contains themajority of the top t vertices, possibly rendering top t logic ineffective.

Islands: MultiStep+ outperforms MultiStep on 4 threads. Notice how t =

29

50 gives better results than t = 25. Understandable, as the graph contains50 medium-sized components. One would think that picking t higher than thenumber of medium components in the graph leads to better run times, as someof those top product of degrees vertices may be in the same component, but itseems that 50 is the optimal value for this graph.

Z-Islands: MultiStep+ is the fastest here, with t = 100 being the optimalvalue. Remarkable when one keeps in mind that the ”Islands” graph is nearlyidentical in layout to this graph. It seems that the Z curve on the probabil-ity modifier has indeed increased the importance of the top t logic, allowing itto prevent more wasteful color propagation with a higher t. Another remark-able observation one can make when comparing this graph and ”Islands”, isthe poorer run time of MultiStep, which further reinforces the theory that theprobability modifying function in type 3 bucketing is preventing wasteful colorpropagation. In this graph, wasteful propagation is not being prevented thisway, and thus has more of it, impacting the run time of MultiStep.

7 Conclusion

The MultiStep+ algorithm effectively prioritizes vertices with its alternate colorinitialization logic that gives high colors to vertices with many edges. TheFW-BW, MultiStep, Hong and MultiStep+ SCC detection algorithms were allsuccessfully implemented, albeit with some implementation oversights. Bench-marks were carried out and from them can be concluded that MultiStep+ haspotential on specific graph types. Graphs that are highly ”islandy” in SCC lay-out benefit from the enhancements in MultiStep+. A t should be chosen that islarger than the number of expected islands in the graph for optimal benefit, aslarge components may contain multiple top t vertices in them. Graphs with twoor more large components and the rest islands will still contain one large com-ponent after the removal of the other. This remaining large component likelyholds a large number of the top t vertices, making MultiStep+ less efficient.Further algorithmic adaptations may solve this weakness however. (See section8.)

The random graph generation algorithm was developed and implemented. Itis able to generate graphs with various layouts: Dense, sparse, single giant, multigiant, and islandy are all achievable graph topologies. Furthermore, the graphsgenerated by it were efficaciously used in the benchmarks of the implementedSCC detection algorithms to provide meaningful results.

Find the source code of this project at https://drive.google.com/drive/folders/1_8xGEbuJ-LIMf5Ir91kkU7ttOkqP_lQW?usp=sharing

8 Future Work and Ideas

In MultiStep+ the top t may be absorbed by a remaining giant component afterthe single FW-BW pass. In order to combat this, the algorithm could do several

30

https://drive.google.com/drive/folders/1_8xGEbuJ-LIMf5Ir91kkU7ttOkqP_lQW?usp=sharing

https://drive.google.com/drive/folders/1_8xGEbuJ-LIMf5Ir91kkU7ttOkqP_lQW?usp=sharing

FW-BW passes, and instead of only from the vertex with the highest product ofdegrees, do so from the top u. If a to be decomposed graph is suspected to have3 giant components and 50 islands, a u could be chosen such that it catches allgiants in FW-BW passes, and a t could be chosen such that it catches all islandsin the first coloring iteration.

Iterative trimming in FW-BW in every recursive instance could be improvedby supplying a set of vertices to start with. Instead of rescanning the entiregraph of this instance for trimmable vertices, only scan the vertices in the sup-plied set. The set is supplied by the SCC decomposition step in the FW-BWinstance that created the current. That instance can pass information to recur-sive instances it creates through the constructor. The non-SCC vertices that areneighbors of vertices in the SCC are the only vertices that lost a neighbor andthus the only ones that could potentially be trimmed. In essence this is similarto how iterative trimming populates a set of vertices potentially eligible for trim-ming in the next iteration by collecting neighbors of trimmed vertices in thisset. The same concept can be applied to decomposed SCCs. This would signif-icantly reduce the number of cpu cycles spent on counting how many neighborsa vertex still has, since massively fewer vertices are analyzed this way.

Combining the top t of different threads together can be done more efficiently.Let the main thread request the descendingIterator of the top t of each of the kworker threads. A descendingIterator iterates through the elements of the sortedcollection in order from the highest to the lowest. Track a boolean ”canSkip”flag for each each of these iterators and initialize to false. Initialize variable”goalSet” to the top t of the main thread. Then while at least one of thecanSkip flags is false, for all iterators, if the canSkip flag of this iterator isfalse, compare the next element from it to the lowest element in goalSet. If theelement is higher, insert it to goalSet and evict the old number, else set thecanSkip flag of this iterator to true. This method inserts the highest elementof each iterator first, minimizing the chance that an element that is insertedneeds to be evicted later. When an element of an iterator isn’t inserted, we nolonger have to look at the rest of the elements in this iterator, as they are alllower than the last evaluated element. The lowest element in goalSet can onlyever get higher, meaning that it is impossible that subsequent elements of thisiterator would be inserted. We continue until the canSkip flag of all iteratorsis set. Now goalSet should have the highest t elements out of the (k + 1) ∗ telements in all sets. This method could reduce the overhead of running highervalues of t as well as reduce the overhead of using a higher number of threadsto find the top t, allowing for a lower parallelization threshold.

References

[1] Das-4: Distributed ASCI supercomputer, 2019. URL:https://www.cs.vu.nl/das4/clusters.shtml.

[2] Henri E. Bal, Dick H. J. Epema, Cees de Laat, Rob van Nieuwpoort,

31

John W. Romein, Frank J. Seinstra, Cees Snoek, and Harry A. G. Wi-jshoff. A medium-scale distributed system for computer science research:Infrastructure for the long term. IEEE Computer, 49:54–63, 2016.

[3] Jirı Barnat, Jakub Chaloupka, and Jaco van de Pol. Improved distributedalgorithms for SCC decomposition. Electronic Notes in Theoretical Com-puter Science, 198(1):63–77, 2008.

[4] Vincent Bloemen. On-the-fly parallel decomposition of strongly connectedcomponents. Master’s thesis, University of Twente, 2015.

[5] Vincent Bloemen, Alfons Laarman, and Jaco van de Pol. Multi-core on-the-fly scc decomposition. PPoPP, 8:1–12, 2016.

[6] Reggie Ebendal. Divide-and-conquer algorithm for parallel computation ofterminal strongly connected components. 2015. Bachelor’s Thesis, VrijeUniversiteit Amsterdam.

[7] Lisa K. Fleischer, Bruce Hendrickson, and Ali Pinar. On identifyingstrongly connected components in parallel. In International Parallel andDistributed Processing Symposium, volume 1800 of LNCS, pages 505–511,2000.

[8] Sungpack Hong, Nicole C. Rodia, and Kunle Olukotun. On fast paralleldetection of strongly connected components (SCC) in small-world graphs.In proceedings of the International Conference on High Performance Com-puting, Networking, Storage and Analysis, pages 92–102, 2013.

[9] William McLendon III, Bruce Hendrickson, Steven J Plimpton, andLawrence Rauchwerger. Finding strongly connected components in dis-tributed graphs. Journal of Parallel and Distributed Computing, 65(8):901–910, 2005.

[10] Gavin Lowe. Concurrent depth-first search algorithms based on tarjan’salgorithm. STTT, 18(2):129–147, 2016.

[11] Vera Matei. Parallel algorithms for detecting strongly connected compo-nents. Master’s thesis, Vrije Universiteit Amsterdam, 2016.

[12] Simona Orzan. Detecting Strongly Connected Components. Chapter 5 of”On Distributed Verification and Verified Distribution”. PhD thesis, VrijeUniversiteit Amsterdam, 2004. pages 65-83.

[13] Etienne Renault, Alexandre Duret-Lutz, Fabrice Kordon, and Denis Poitre-naud. Parallel explicit model checking for generalized buchi automata.Lecture Notes in Computer Science, 9035:613–627, 2015.

[14] Warren Schudy. Finding strongly connected components in parallel us-ing o(log2(n)) reachability queries. In Proceedings of the twentieth annualSymposium on Parallelism in Algorithms and Architectures, pages 146–151.ACM, 2008.

32

[15] George M. Slota, Sivasankaran Rajamanickam, and Kamesh Madduri. BFSand coloring-based parallel algorithms for strongly connected componentsand related problems. In Proceedings of the IEEE 28th International Par-allel and Distributed Processing Symposium, pages 550–559, 2014.

[16] Robert E. Tarjan. Depth-first search and linear graph algorithms. SIAMJournal on Computing, 1(2):146–160, 1972.

33

parallel detection of strongly connected components with

Documents