elixir: a system for synthesizing concurrent graph...

Elixir: A System for Synthesizing Concurrent Graph Programs ∗

Dimitrios Prountzos Roman Manevich Keshav PingaliThe University of Texas at Austin, Texas, USA

[email protected],[email protected],[email protected]

AbstractAlgorithms in new application areas like machine learningand network analysis use “irregular” data structures such asgraphs, trees and sets. Writing efficient parallel code in theseproblem domains is very challenging because it requiresthe programmer to make many choices: a given problemcan usually be solved by several algorithms, each algorithmmay have many implementations, and the best choice ofalgorithm and implementation can depend not only on thecharacteristics of the parallel platform but also on propertiesof the input data such as the structure of the graph.

One solution is to permit the application programmer toexperiment with different algorithms and implementationswithout writing every variant from scratch. Auto-tuning tofind the best variant is a more ambitious solution. Thesesolutions require a system for automatically producing ef-ficient parallel implementations from high-level specifica-tions. Elixir, the system described in this paper, is the firststep towards this ambitious goal. Application programmerswrite specifications that consist of an operator, which de-scribes the computations to be performed, and a schedulefor performing these computations. Elixir uses sophisticatedinference techniques to produce efficient parallel code fromsuch specifications.

We used Elixir to automatically generate many paral-lel implementations for three irregular problems: breadth-first search, single source shortest path, and betweenness-centrality computation. Our experiments show that the bestgenerated variants can be competitive with handwritten codefor these problems from other research groups; for some in-puts, they even outperform the handwritten versions.

∗ This work was supported by NSF grants 0833162, 1062335, and 1111766as well as grants from the IBM, Intel and Qualcomm Corporations.

[Copyright notice will appear here once ’preprint’ option is removed.]

Categories and Subject Descriptors D.1.3 [ProgrammingTechniques]: Concurrent Programming—Parallel Program-ming; I.2.2 [Artificial Intelligence]: Automatic Programming—Program synthesis

General Terms Algorithms, Languages, Performance, Ver-ification

Keywords Synthesis, Compiler Optimization, Concurrency,Parallelism, Amorphous Data-parallelism, Irregular Pro-grams, Optimistic Parallelization.

1. IntroductionNew problem domains such as machine learning and so-cial network analysis are giving rise to applications with “ir-regular” data structures like graphs, trees, and sets. Writingportable parallel programs for such applications is challeng-ing for many reasons.

The first reason is that programmers usually have a choiceof many algorithms for solving a given problem: even arelatively simple problem like the single-source shortest path(SSSP) problem in a directed graph can be solved usingDijkstra’s algorithm [11], the Bellman-Ford algorithm [11],the label-correcting algorithm [22], and delta-stepping [22],among others. These algorithms are described in more detailin Sec. 2, but what is important here is to note that whichalgorithm is best depends on many complex factors.

• There are complicated trade-offs between parallelism andwork-efficiency in these algorithms; for example, Dijk-stra’s algorithm is very work-efficient but it has relativelylittle parallelism, whereas the Bellman-Ford and label-correcting algorithms can exhibit a lot of parallelism butmay be less work-efficient. Therefore, the best algorith-mic choice may depend on the number of cores that areavailable to solve the problem.

• The amount of parallelism in irregular graph algorithmsis usually dependent also on the structure of the inputgraph. For example, regardless of which algorithm isused, there is little parallelism in the SSSP problem ifthe graph is a long chain (more generally, if the diameterof the graph is large); conversely, for graphs that have asmall diameter such as those that arise in social networkapplications, there may be a lot of parallelism that can

1 2012/8/6

be exploited by the Bellman-Ford and label-correctingalgorithms. Therefore, the best algorithmic choice maydepend on the size and structure of the input graph.

• The best algorithmic choice may depend on the corearchitecture. If SIMD-style execution is supported ef-ficiently by the cores as is the case with GPUs, theBellman-Ford algorithm may be preferable to the label-correcting or delta-stepping algorithms. Conversely, forMIMD-style execution, label-correcting or delta-steppingmay be preferable.

Another reason for the difficulty of parallel programmingof irregular applications is that even for a given algorithm,there are usually a large number of implementation choicesthat must be made by the performance programmer. Each ofthe SSSP algorithms listed above has a host of implementa-tions; for example, label corrections in the label correctingalgorithm can be scheduled in FIFO, LIFO and other orders,and the scheduling policy can make a big difference in theoverall performance of the algorithm, as we show experi-mentally in Sec. 6. Similarly, delta-stepping has a parameterthat can be tuned to increase parallelism at the cost of per-forming extra computation. As in other parallel programs,synchronization can be implemented using spin-locks, ab-stract locks or CAS operations. These choices can affect per-formance substantially but even expert programmers cannotalways make the right choices.

1.1 Synthesis of Irregular ProgramsOne promising approach for addressing these problems isprogram synthesis. Instead of writing programs in a high-level language like C++ or Java, the programmer writes ahigher level specification of what needs to be computed,leaving it to an automatic system to synthesize efficient par-allel code for a particular platform from that specification.This approach has been used successfully in domains likesignal processing where mathematics can be used as a spec-ification language [29]. However, irregular graph problemsdo not have mathematical structure that can be exploited togenerate implementation variants.

In this paper, we describe a system called Elixir thatsynthesizes parallel programs for shared-memory multicoreprocessors, starting from irregular algorithm specificationsbased on the operator formulation of algorithms [25]. Theoperator formulation is a data-centric description of algo-rithms in which algorithms are expressed in terms of theiraction on data structures rather than in terms of program-centric constructs like loops. There are three key concepts:active elements, operator, and ordering.

Active elements are sites in the graph where there iscomputation to be done. For example, in SSSP algorithms,each node has a label that is the length of the shortest knownpath from the source to that node; if the label of a nodeis updated, it becomes an active node since its immediate

neighbors must be examined to see if their labels can beupdated as well.

The operator is a description of the computation that isdone at an active element. Applying the operator to an activeelement creates an activity. In general, an operator readsand writes graph elements in some small region containingthe active element. These elements are said to constitute theneighborhood of this activity.

The ordering specifies constraints on the processing ofactive elements. In unordered algorithms, it is semanticallycorrect to process active elements in any order, although dif-ferent orders may have different work-efficency and paral-lelism. A parallel implementation may process active ele-ments in parallel provided the neighborhoods do not over-lap. The non-overlapping criterion can be relaxed by usingcommutativity conditions, but we do not consider these inthis paper. The preflow-push algorithm for maxflow com-putation, Boruvka’s minimal spanning tree algorithm, andDelaunay mesh refinement are examples of unordered algo-rithms. In ordered algorithms on the other hand, there maybe application-specific constraints on the order in which ac-tive elements are processed. Discrete-event simulation is anexample: any node with an incoming message is an activeelement, and messages must be processed in time order.

The specification language described in this paper per-mits application programmers to specify (i) the operator, and(ii) the schedule for processing active elements; Elixir takescare of the rest of the process of generating parallel imple-mentations. Elixir addresses the following major challenges.

• How do we design a language that permits operators andscheduling policies to be defined concisely by applicationprogrammers?

• The execution of an activity can create new active ele-ments in general. How can newly created active elementsbe discovered incrementally without having to re-scanthe entire graph?

• How should synchronization be introduced to make ac-tivities atomic?

Currently, there are two main restrictions in the speci-fication language. First, Elixir supports only operators forwhich neighborhoods contain a fixed number of nodes andedges. Second, Elixir does not support mutations on thegraph structure, so algorithms such as Delaunay mesh re-finement cannot be expressed currently in Elixir. We believeElixir can be extended to handle such algorithms, but weleave this for future work.

The rest of this paper is organized as follows. Sec. 2presents the key ideas and challenges, using a number ofalgorithms for the SSSP problem. Sec. 3 formally presentsthe Elixir graph programming language and its semantics.Sec. 4 describes our synthesis techniques. Sec. 5 describesour auto-tuning procedure for automatically exploring im-plementations. Sec. 6 describes our empirical evaluation of

2 2012/8/6

Elixir. Sec. 7 discusses related work and Sec. 8 concludesthe paper.

2. OverviewIn this section, we present the main ideas behind Elixir, us-ing the SSSP problem as a running example. In Sec. 2.1 wediscuss the issues that arise when SSSP algorithms are writ-ten in a conventional programming language. In Sec. 2.2, wedescribe how operators can be specified in Elixir indepen-dently of the schedule. and how a large number of differentscheduling policies can be specified abstractly by the pro-grammer. In particular, we show how the Dijsktra, Bellman-Ford, label-correcting and delta-stepping algorithms can bespecified in Elixir just by changing the scheduling specifi-cation. Finally, in Sec. 2.3 we sketch how Elixir addressesthe two most important synthesis challenges: how to synthe-size efficient implementations, and how to insert synchro-nization.

2.1 SSSP Algorithms and the Relaxation OperatorGiven a weighted graph and a node called the source, theSSSP problem is to compute the distance of the shortest pathfrom the source node to every other node (we assume theabsence of negative weight cycles). As mentioned in Sec. 1,there are many sequential and parallel algorithms for solv-ing the SSSP problem such as Dijkstra’s algorithm [11],the Bellman-Ford algorithm [11], the label-correcting algo-rithm [22], and delta-stepping [22]. In all algorithms, eachnode a has an integer attribute ad that holds the length of theshortest known path to that node from the source. This at-tribute is initialized to∞ for all nodes other than the sourcewhere it is set to 0, and is then lowered to its final value usingiterative edge relaxation: if ad is lowered and (i) there is anedge (a, b) of length wab, and (ii) bd > ad+ wab, the valueof bd is lowered to ad + wab. However, the order in whichedge relaxations are performed can be different for differentalgorithms, as we discuss next.

In Fig. 1 we present Java-like pseudocode for the sequen-tial label-correcting and the Bellman-Ford SSSP algorithmswith the edge relaxation highlighted in grey. Although bothalgorithms use relaxations, they may perform them in differ-ent orders and different numbers of times. We call this orderthe schedule for the relaxations. The label-correcting algo-rithm maintains a worklist of edges for relaxation. Initially,all edges connected to the source are placed on this worklist.At each step, an edge (a, b) is removed from the worklist andrelaxed; if there are several edges on the worklist, the edgeto be relaxed is chosen heuristically. If the value of bd islowered, all edges connected to b are placed on the worklistfor relaxation. The algorithm terminates when the worklistis empty. The Bellman-Ford algorithm performs edge relax-ations in rounds. In each round, all the graph edges are re-laxed in some order. A total of |V |−1 rounds are performed,

Label Correcting

INITIALIZATION:for each node a in V {

if (a==Src) ad = 0;else ad = INFINITY;}RELAXATION:Wl = new worklist () ;// init worklistfor each e=(Src , , ) {Wl.add(e);}while (!Wl.empty()) {(a ,b,w) = Wl.get () ;

if (ad+ w < bd) {

bd=ad+w;

for each e:outEdg(b)Wl.add(e);

}}

Bellman Ford

INITIALIZATION:for each node a in V {

if (a==Src) ad = 0;else ad = INFINITY;}RELAXATION:for i = 1.. |V | − 1 {

for each e=(a ,b,w) {if (ad+ w < bd) {

bd=ad+w;

}}

Figure 1: Pseudocode for label-correcting and Bellman-FordSSSP algorithms.

where |V | is the number of graph nodes. Although both al-gorithms are built using the same basic ingredient, as Fig. 1shows, it is not easy to change from one to another. This isbecause the code for the relaxation operator is intertwinedintimately with the code for maintaining worklists, which isan artifact of how the relaxations are scheduled by a partic-ular algorithm. In a concurrent setting, the code for synchro-nization makes the programs even more complicated.

2.2 SSSP in ElixirFig. 2 shows several SSSP algorithms written in Elixir. Themajor components of the specification are the following.

2.2.1 Operator SpecificationIn Fig. 2, lines 1–2 define the graph. Nodes and edgesare represented abstractly by relations that have certain at-tributes. Each node has a unique ID and an integer attributedist; during the execution of the algorithm, the dist at-tribute of a node keeps track of the shortest known path tothat node from the source. Edges have a source node, a desti-nation node, and an integer attribute wt, which is the lengthof that edge. Line 4 defines the source node. SSSP algo-rithms use two operators, one called initDist to initializethe dist attribute of all nodes (lines 6–7), and another calledrelaxEdge to perform edge relaxations (lines 9–13).

Operators are described by rewrite rules in which the left-hand side is a predicated subgraph pattern, and the right-hand side is an update.

A predicated subgraph pattern has two parts, a shapeconstraint and a value constraint. A subgraph G′ of thegraph is said to satisfy the shape constraint of an operator ifthere is a bijection between the nodes in the shape constraintand the nodes in G′ that preserves the edge relation. Theshape constraint in the initDist operator is satisfied byevery node in the graph, while the one in the relaxEdge

3 2012/8/6

operator is satisfied by every pair of nodes connected by anedge. A value constraint filters out some of the subgraphsthat satisfy the shape constraint by imposing restrictionson the values of attributes; in the case of the relaxEdge

operator, the conjunction of the shape and value constraintsrestricts attention to pairs of nodes (a, b) which have anedge between them, and whose dist attributes satisfy theconstraint ad+wab < bd. A subgraph that satisfies both theshape and value constraints of an operator is said to matchthe predicated sub-graph pattern of that operator, and willbe referred to as a redex of that operator; it is a specialcase of the general notion of neighborhoods in the operatorformulation [25].

The right-hand side of a rewrite rule specifies updates tosome of the value attributes of nodes or edges in a subgraphmatching the predicated subgraph pattern on the left-handside of that rule. In this paper, we restrict attention to localcomputation algorithms [25] that are not allowed to morphthe graph structure by adding or removing nodes and edges.

Elixir allows disjunctive operators of the form

op1 or . . . or opk

where all operators opi share a common same shape con-straint. The Betweenness Centrality programs discussedin Sec. 6 use disjunctive operators with 2 disjuncts.

Statements define how operators are applied to the graph.A looping statement has one the forms ‘foreach op’, ‘fori=low..high op’, or ‘iterate op’ where op is an oper-ator. A foreach statement finds all matches of the givenoperator and applies the operator once to each matched sub-graph in some unspecified order. Line 15 defines the initial-ization statement to be the application of initDist onceto each node. A for statement applies an operator once foreach value of i between low and high. An iterate state-ment applies an operator ad infinitum by repeatedly findingsome subgraph that matches the left-hand side of the opera-tor and applying the operator there. The statement terminateswhen no sub-graphs match the left-hand side of the opera-tor. Line 16 expresses the essence of the SSSP computationas the repeated application of the relaxEdge operator (fornow, ignore the text “>> sched”). It is the responsibility ofthe user to guarantee that iterate arrives to a fixed-pointafter a finite number of steps by specifying meaningful valueconstraints. Finally, line 17 defines the entire computation tobe the initialization followed by the distances computation.

Elixir programs can be executed sequentially by repeat-edly searching the graph until a redex is found, and then ap-plying the operator there. Three optimizations are needed tomake this baseline, non-deterministic interpreter efficient.

1. Even in a sequential implementation, the order in whichredexes are executed can be important for work-efficiencyand locality. The best order may problem-dependent, soit is necessary to give the application programmer con-

1 Graph [ nodes(node : Node, dist : int )2 edges(src : Node, dst : Node, wt : int ) ]3

4 source : Node5

6 initDist = [ nodes(node a, dist d) ] →7 [ d = if (a == source) 0 else ∞]8

9 relaxEdge = [ nodes(node a, dist ad)10 nodes(node b, dist bd)11 edges(src a, dst b, wt w)12 ad + w< bd ] →13 [ bd = ad + w ]14

15 init = foreach initDist16 sssp = iterate relaxEdge � sched17 main = init ; sssp

Algorithm Schedule specificationDijkstra sched = metric ad � group bLabel-correcting sched = group b� approx metric ad � unroll 2

∆-stepping-style DELTA : unsigned intsched = metric (ad + w) / DELTA

Bellman-Ford

NUM NODES : unsigned int// override ssspsssp = for i =1..( NUM NODES−1)

stepstep = foreach relaxEdge

Figure 2: Elixir programs for SSSP algorithms.

trol over scheduling. Sec. 2.2.2 gives an overview of thescheduling constructs in Elixir.

2. To avoid scanning the graph repeatedly to find redexes, itis desirable to maintain a worklist of potential redexes inthe graph. The application of an operator may enable anddisable redexes, so the worklist needs to be updated in-crementally whenever an operator is applied to the graph.The worklist can be allowed to contain a superset of theset of actual redexes in the graph, provided an item istested when it is taken off the worklist for execution.Sec. 2.3.1 gives a high-level description of how Elixirmaintains worklists.

3. In a parallel implementation, each activity should appearto have executed atomically. Therefore, Elixir must insertappropriate synchronization. Sec. 2.3.2 describes some ofthe main issues in doing this.

2.2.2 Scheduling ConstructsElixir provides a compositional language for specifyingcommonly used scheduling strategies declaratively and au-tomatically synthesizes efficient implementations of them.

We use Dijkstra-style SSSP computation to present thekey ideas of our language. This algorithm maintains nodesin a priority queue, ordered by the distance attribute of thenodes. In each iteration, a node of minimal priority is re-moved from the priority queue, and relaxations are per-formed on all outgoing edges of this node. This is describedby the composition of two basic scheduling policies.

1. Given a choice between relaxing edge e1 = (a1, b1) andedge e2 = (a2, b2) where ad1 < ad2, give e1 a higher

4 2012/8/6

priority for execution. In Elixir, this is expressed by thespecification metric ad.

2. To improve spatial and temporal locality, it is desirable toco-schedule active edges that have the same source nodea, in preference to interleaving the execution of edgesof the same priority from different nodes. In Elixir thisis expressed by the specification group b, which groupstogether relaxEdge applications on all neighbors b ofnode a. This can be viewed as a refinement of the metricad specification, and the composition of these policies isexpressed as metric ad >> group b.

These two policies exemplify two general schedulingschemes: dynamic and static scheduling. Scheduling strate-gies that bind the scheduling of redexes at runtime are calleddynamic scheduling strategies, since they determine the pri-ority of a redex using values known only at runtime. Typ-ically, they are implemented via a dynamic worklist data-structure that prioritizes its contents based on the specificpolicy. In contrast, static scheduling strategies, such asgrouping, bind scheduling decisions at compile-time and arereflected in the structure of the source code that implementscomposite operators out of combinations of basic ones. Oneof the contributions of this paper is the combination of staticand dynamic scheduling strategies in a single system. Themain scheduling strategies supported by Elixir are the fol-lowing.

metric e The arithmetic expression e over the variables ofthe redex is the priority function. In practice, many algo-rithms use priorities heuristically so they can tolerate someamount of priority inversion in scheduling. Exploiting thisfact can lead to more efficient implementations, so Elixirsupports a variant called approx metric e.

group V specifies that every redex pattern node v ∈ V

should be matched in all possible ways. Thus, it applies thegrouping refinement for every operator referring to v.

unroll k Some implementations of SSSP perform two-level relaxations: when an edge (a,b) is relaxed, the outgo-ing edges of b are co-scheduled for relaxation if they are ac-tive, since this improves spatial and temporal locality. Thiscan be viewed as a form of loop unrolling. Elixir supportk-level unrolling, where k is under the control of the appli-cation programmer.

(op1 or op2) � fuse specifies that instances of op1, op2

working on the same redex should create a new compositeoperator. Fusing improves locality and amortizes the cost ofacquiring and releasing locks necessary to guarantee atomicoperator execution.

The group, unroll and fuse operations define staticscheduling strategies. We use the language of Nguyen etal. [23] to define a series dynamic scheduling policies thatcombine metric with LIFO,FIFO policies and use imple-mentations of these worklists from the Galois framework [2].

Fig. 2 shows the use of Elixir scheduling constructsto define a number of SSSP implementations. The label-correcting variant [22] is an unordered algorithm, which oneach step starts from a node and performs relaxations on allincident edges, up to two “hops” away. The delta-steppingvariant [22] operates on single edges and uses a ∆ parameterto partition redexes into equivalence classes. This heuristicachieves work-efficiency by processing nodes in order of in-creasing distance from the source, while also exposing par-allelism by allowing redexes in the same equivalence classto be processed in parallel. Finally Bellman-Ford [11] worksin a SIMD style by performing a series of rounds in which itprocesses all edges in the graph.

2.3 Synthesis ChallengesThis section gives a brief description of the main challengesthat Elixir addresses. First, we discuss how Elixir optimizesworklist manipulation and second how it synchronizes codeto ensure atomic operator execution.

2.3.1 Synthesizing Work-efficient ImplementationsTo avoid scanning the graph repeatedly for redexes, it is nec-essary to maintain a worklist of redexes, and update thisworklist incrementally when a redex is executed since thismight enable or disable other redexes. To understand the is-sues, consider the label-correcting implementation in Fig. 1,which iterates over all outgoing edges of b and inserts theminto the worklist. Since the worklist can be allowed to con-tain a superset of the set of the redexes (as long as items arechecked when they are taken from the worklist), another cor-rect but less efficient solution is to insert all edges incidentto either a or b into the worklist. However, the programmermanually reasoned that the only place where new “useful”work can be performed is at the outgoing edges of b, sinceonly bd is updated,. Additionally, the programmer could ex-periment with different heuristics to improve efficiency. Forexample, before inserting an edge (b, c) into the worklist, theprogrammer could eagerly check whether db+ wbc ≥ dc.

In a general setting with disjunctive operators, differ-ent disjuncts may become enabled on different parts ofthe graph after an operator application. Manually reason-ing about where to apply such incremental algorithmic stepscan be daunting. Elixir frees the programmer from this task.In Fig. 2 there is no code dealing with that aspect of thecomputation; Elixir automatically synthesizes the worklistupdates and also allows the programmer to easily experi-ment with heuristics like the above without having to writemuch code.

Another means of achieving work-efficiency is by using agood priority function to schedule operator applications. Incertain implementations of algorithms such as betweennesscentrality and breadth first search, the algorithm transitionsthrough different priority levels in a very structured manner.Elixir can automatically identify such cases and synthesize

5 2012/8/6

customized dynamic schedulers that are optimized for theparticular iteration patterns.

2.3.2 Synchronizing Operator ExecutionTo guarantee correctness in the context of concurrent execu-tion, the programmer must make sure that operators executeatomically. Although it is not hard to insert synchronizationcode into the basic SSSP relaxation step, the problem be-comes more complex once scheduling strategies like unrolland group are used since the resulting “super-operator” codecan be quite complex. There are also many synchronizationstrategies that could be used such as abstract locks, concretelocks, and lock-free constructs like CAS instructions, andthe trade-offs between them are not always clear even to ex-pert programmers.

Elixir frees the programmer from having to worry aboutthese issues because it automatically introduces appropriatefine grained locking. This allows the programmer to focuson the creative parts of problem solving and still get theperformance benefits of parallelism.

3. The Elixir Graph Programming Language

In this section, we formalize our language whose gram-mar is shown in Fig. 3. Technically, a graph program definesgraph transformations, or actions, that may be used withinan application. A graph program first defines a graph typeby listing the data attributes associated with its nodes andedges. Next, a program defines global variables that actionsmay only read. They may be initialized by the larger appli-cation before invoking an action. The graph program thendefines operators and actions. Operators define unit trans-formations that may be applied to a given subgraph. Theyare used as building blocks in statements that apply opera-tors iteratively. An important limitation of operators is thatthey may only update data attributes, but not alter the graphstructure. Actions compose statements and name them. Theycompile to C++ functions that take a single graph referenceargument.

3.1 Graphs and PatternsLet the Attrs denote a finite set of attributes. An attribute de-notes a subtype of one of the following types: the set of nu-meric values Nums (integers and reals), graph nodes Nodesand sets of graph nodes ℘(Nodes). Let Vals def

= Nums ∪Nodes ∪ ℘(Nodes) stand for the union of those types.

Definition 3.1 (Graph). 1 A graph G = (V G, EG,AttG)where V G ⊂ Nodes are the graph nodes, EG ⊆ V G × V Gare the graph edges, and AttG : ((Attrs × V G) → Vals) ∪((Attrs × V G × V G) → Vals) associates values with nodesand edges. We denote the set of all graphs by Graph.

1 Our formalization naturally extends to graphs with several node and edgerelations, but for simplicity of the presentation we have just one of each.

attid Graph attributesacid Action identifiersopid Operation identifiers

var Operator variables and global variablesctype C++ type

program ::= graphDef global+ opDef+ actionDef+

graphDef ::= Graph [ nodes(attDef+ edges(attDef+) ]attDef ::= attid : ctype | attid : set[ctype]global ::= var : ctypeopDef ::= opid = opExpopExp ::= [ tuple∗ (boolExp) ]→ [ assign∗ ]

tuple ::= nodes(att∗) | edges(att∗)boolExp ::= !boolExp | boolExp & boolExp | arithExp< arithExp

| arithExp == arithExp | var in setExparithExp ::= number | var | arithExp + arithExp | arithExp - arithExp

| if (boolExp) arithExp else arithExpsetExp ::= empty | {var} | setExp + setExp | setExp - setExpassign ::= var = arithExp | var = setExp | var = boolExp

att ::= attid varactionDef ::= acid = stmt

stmt ::= iterate schedExp | foreach schedExp| for var = arithExp .. arithExp stmt | acid| invariant? stmt invariant?| stmt; stmt

schedExp ::= ordered | unorderedunordered ::= disjuncts

ordered ::= opsExp fuseTerm? groupTerm? metricTermdisjuncts ::= disjunctExp | disjunctExp or disjuncts

disjunctExp ::= statExp dynSchedopsExp ::= opid | opid or opsExpstatExp ::= opsExp fuseTerm? groupTerm? unrollTerm?

dynSched ::= approxMetricTerm? timeTerm?fuseTerm ::= >> fuse

groupTerm ::= >> group var∗

unrollTerm ::= >> unroll numbermetricTerm ::= >> metric arithExp

approxMetricTerm ::= >> approx metric arithExptimeTerm ::= >> LIFO | >> FIFO

Figure 3: Elixir language grammar (EBNF). The notation e?means that e is optional.

Definition 3.2 (Pattern). A pattern P = (V P , EP ,AttP ) isa connected graph over variables. Specifically, V P ⊂ Varsare the pattern nodes,EP ⊆ V P×V P are the pattern edges,and AttP : (Attrs × V P ) → Vars ∪ (Attrs × V P × V P ) →Vars associates a distinct variable (not in V P ) with eachnode and edge. We call the latter set of variables attributevariables. We refer to (V P , EP ) as the shape of the pattern.

In the sequel, when no confusion is likely, we maydrop superscripts denoting the association between a com-ponent and its containing compound type instance, e.g.,G = (V,E).

Definition 3.3 (Matching). Let G be a graph and P be apattern. We say that µ : V P → V G is a matching (of P inG), written (G,µ) |= P , if it is one-to-one, and for everyedge (x, y) ∈ EP there exists an edge (µ(x), µ(y)) ∈ EG.We denote the set of all matchings by Match : Vars →Nodes.

We extend a matching µ : V P → V G to evaluate attributevariables µ : Vars → Vals as follows. For every attribute a,pattern nodes y, z ∈ V P , and attribute variable x, we define:

µ(x) = AttG(a, µ(y)) if AttP (a, y) = x

µ(x) = AttG(a, µ(y), µ(z)) if AttP (a, y, z) = x .

6 2012/8/6

Lastly, we extend µ to evaluate expressions over the vari-ables of a pattern by structural induction over the naturaldefinitions of the sub-expression types defined in Fig. 3.

3.2 OperatorsWe denote an operator by op = [Rop,Gdop] → [Updop]where Rop is called the redex pattern; Gdop is a Boolean-valued expression over the variables of the redex pattern,called the guard; and Updop : V R → Exprs contains anassignment per attribute variable in the redex pattern, interms of the variables of the redex pattern (for brevity, weomit identity assignments).

We now define the semantics of an operator as a functionthat transforms a graph for a given matching [[·]] : opExp →(Graph ×Match) → Graph. Let op = [R,Gd] → [Upd] bean operator and let µ : V R → V G be a matching (of the Rin G). We say that µ satisfies the shape constraint of op if(G,µ) |= R. We say that µ satisfies the value constraintof op (and shape constraint), written (G,µ) |= R,Gd, ifµ(Gd) = True. In such a case, µ induces the subgraphD = µ(R), which we call a redex and define by:

V Ddef= {µ(x) | x ∈ V R}

EDdef= {(µ(x), µ(y)) | (x, y) ∈ ER}

AttD def= {(a,AttG(a, u)), (b,AttG(b, v, w)) |

a, b ∈ Attrs, u ∈ V D, (v, w) ∈ ED} .

We define

[[op]](G,µ) =

{G′ = (V G, EG,Att′), (G,µ) |= Rop,Gdop;G, else

whereD = µ(Rop) and the node and edge attributes inDare updated using the expressions in Updop:

Att′(a, v) =

µ(Updop(y)),v ∈ V D, v = µ(xv)

and AttR(a, xv) = y;Att(a, v) else.

Att′(a, u, v) =

µ(Updop(y)),(u, v) ∈ ED,u = µ(xu), v = µ(xv)

and AttR(a, xu, xv) = y;Att(a, u, v) else.

The remainder of this section defines the semantics ofstatements. iterate and foreach statements have two dis-tinct flavors: unordered iteration and ordered iteration. Wedefine them in that order. We do not define for statements astheir semantics is quite standard in all imperative languages.

3.3 Semantics of Unordered StatementsUnordered statements have the form ‘iterate unordExp’ or‘foreach unordExp’ where unordExp uses the or operator,which we will refer to as disjunction, to combine expressionsof the form

opsExp >> statExp >> dynSched .

Intuitively, a disjunction represents alternative graph trans-formations.

The expression opsExp is either a single operator op or adisjunction of operators op1or . . . or opk having the sameshape (Rop1 = . . . = Ropk ). We define the shorthandopi..j = opior . . . or opj .

The expression statExp, called a static schedule, is a pos-sibly empty sequence of static scheduling terms, which mayinclude fuse, group, and unroll. If opsExp is a disjunc-tion then it must be followed by a fuse term. An expressionof the form opsExp>>statExp defines a composite operatorby grouping together operator applications in a statically-defined (i.e., determined at compile-time) way. We refer tosuch an expression as a static operator.

The expression dynSched, called a dynamic schedule, isa possibly empty sequence of dynamic scheduling terms,which may include approx metric, LIFO, and FIFO. Adynamic schedule determines the order by which static op-erators are selected for execution by associating a dynamicpriority with each redex.

To simplify the exposition, in this paper we present thesemantics under the simplifying assumption that statExp isempty. For the full technical treatment, the reader is referredto [27].

3.3.1 PreliminariesDefinition 3.4 (Active Element). An active element, denotedby elem〈op, µ〉, pairs an operator op with a matching µ ∈Match. Intuitively, it means that op is applied on µ. Wedenote the set of all active elements by A.

We define the set of redexes for an operator and for adisjunction of operators, respectively by

RDX[[op]]Gdef= {µ ∈ Match | (G,µ) |= Rop,Gdop}

RDX[[op1..k]]Gdef= RDX[[op1]]G ∪ . . . ∪ RDX[[opk]]G .

We define the set of redexes of an operator op′ created byan application of an operator op by

DELTA[[op, op′]] (G,µ)def=

let G′ = [[op]](G,µ)in RDX[[op′]]G′ \ RDX[[op′]]G .

We lift the operation to disjunctions:

DELTA[[opa, opc..d]] (G,µ)def=

⋃c≤i≤d

DELTA[[opa, opi]] (G,µ) .

LetR = Rop be the pattern of an operator op and v ⊆ V Rbe a subset of the pattern nodes. We require V R \v to inducea connected subgraph of R. We define the set of matchingsV R → G identifying with µ on the node variables v by

EXPAND[[op, v]](G,µ)def= {µ′ ∈ V R → V G | µ|v = µ′|v}

We use EXPAND to implement the group static schedulingterm and to implement DELTA.

7 2012/8/6

3.3.2 Defining Dynamic SchedulersLet iterate exp be a statement and let op1..k be theoperators belonging to exp. An iterate statement ex-ecutes by repeatedly finding a redex for a operator andapplying that operator to the redex. We now describehow metric, approx metric, LIFO, and FIFO schedulingterms define a quasi order over active elements by associ-ating them with priorities, as they are created. An execu-tion of iterate results with a (possibly infinite) sequenceG = G0, . . . , Gk, . . . where each index represents the appli-cation of one operator. Let elem〈op, µ〉 be an active elementcreated at Gi. For an arithmetic expression a, the termsmetric a and approx metric a associate the priority µ(a)evaluated at Gi and quasi order p ≤metric a p

′ if p ≤ p′. Theterms LIFO and FIFO associate the priority i and quasi orderp ≥LIFO p

′ if p ≤ p′ and p ≤FIFO p′ if p ≤ p′.

We denote the priority given by term t to an active el-ement v constructed at graph Gi by p(t, v, i). We definethe priority given by an expression d = t1>> . . . >>tk ∈dynSched by p(d, v, i) = 〈p(t1, v, i), . . . , p(tk, v, i)〉 andthe lexicographic quasi order, which is also defined forvectors of different lengths. A prioritized active elementelem〈op, µ, p〉 is an active element associated with a prior-ity. We denote the set of all prioritized active elements byAP . For two prioritized active elements v = (opv, µv, pv)and w = (opw, µw, pw), we define v ≤ w if pv ≤ pw

We define the type of prioritized worklists by WP def=

AP∗. We say that a prioritized worklist ω = e1, . . . , ek ∈WP is ordered according to a dynamic scheduling expres-sion d ∈ dynSched, if for every 1 ≤ i ≤ j ≤ k, we havethat ei ≤ ej , i.e., ω starts with the lowest priority element.We define PRIORITY[[d]]ω = ω′ if ω′ is a permutation ofω preserving the quasi order induced by d ∈ dynSched. Wedefine the following scheduler-related operations for a dy-namic scheduling expression exp:

EMPTYdef= ε

POP ωdef= (head(ω), tail(ω))

MERGE[[exp]] (ω, δ)def= PRIORITY[[exp]] (ω ∪ δ)

INIT[[exp]]Gdef= MERGE[[exp]] (ε,RDX[[op1..k]]G)

3.3.3 Iteratively Executing Operators

We define the set of program states as Σdef= Graph∪Graph×

Wl. The meaning of statements is given in terms of a transi-tion relation having one of the following forms:

1. 〈S, σ〉 =⇒ σ′, means that the statement S transforms thestate σ into σ′ and finishes executing;

2. 〈S, σ〉 =⇒ 〈S′, σ′〉, means that the statement S trans-form the state σ into σ′ to which the remaining statementS′ should be applied.

The definition of =⇒ is given by the rules in Fig. 4. Thesemantics induced by the transition relation yields (possiblyinfinite) sequences of states σ1, . . . , σk, . . .. A correct con-

iterateinit starts executing iterate e by initializing a schedulerwith the set of redexes found in G〈iterate exp, G〉 =⇒ 〈iterate exp, G+ Wl〉 ifWl = INIT[[exp]]Giteratestep executes an operator〈iterate exp, G+ Wl〉 =⇒ 〈iterate exp, G′ + Wl′′〉 if

(elem〈op, µ, p〉,Wl′) = POP WlG′ = [[op]](G,µ)∆ = DELTA[[op, op1..k]] (G,µ)Wl′′ = MERGE[[exp]] (Wl′,∆)

iteratedone returns the graph when no more operators can bescheduled〈iterate exp, G+ EMPTY[[exp]]〉 =⇒ G

foreachinit, foreachdone same rules as for iterateforeachstep executes an operator〈foreach exp, G+ Wl〉 =⇒ 〈foreach exp, G′ + Wl′〉 if

(elem〈op, µ, p〉,Wl′) = POP WlG′ = [[op]](G,µ)

Figure 4: An operational semantics for Elixir statements.

current implementation gives the illusion that each transitionoccurs atomically, even though the executions of differenttransitions may interleave.

3.4 Semantics of Ordered StatementsOrdered statements have the form

iterate opsExp >> statExp >> metric exp .

The static scheduling expression statExp is the same as inthe unordered case, except that we do not allow unroll.The expression opsExp is either a single operator op or adisjunction of operators op1or . . . or opk having the sameshape. If opsExp is a disjunction then it is followed by afuse term.

Prioritized active elements are dynamically partitionedinto equivalence classes Ci based on the value of exp. Theexecution then proceeds as follows: We start by processingactive elements from the equivalence class C0, which hasthe lowest priority. Applying an operator to active elementsfrom Ci can produce new active elements at other prioritylevels, e.g., Cj . Once the work at priority level i is donewe start processing work at the next level. We will restrictour attention to the class of algorithms where the priority ofnew active elements is greater than or equal to the priority ofexisting active elements (i ≤ j). Under this restriction, weare guaranteed to never miss work as we process successivepriority levels. The execution terminates when all work (atthe highest priority level) is done. All the algorithms that westudied belong to this class. The above execution strategy ad-mits a straightforward and efficient parallelization strategy:associate with each Ci a bucket Bi and have parallel threadsprocess all work in bucket Bi before moving to Bi+1. Thisimplements the so-called “level-by-level” parallel executionstrategy.

8 2012/8/6

3.5 Using Strong Guards for Fixed-Point DetectionOur language allows defining iterate actions that do notterminate for all inputs. It is the responsibility of the pro-grammer to avoid defining such actions. When an actiondoes terminate for a given input, it is the responsibility of thecompiler to ensure that the emitted code detects the fixed-point and stops.

Let µ be a matching and D = µ(G) be the matchedsubgraph. Further, let G′ = [[op]](G,µ). One way to checkwhether an operator application leads to a fixed-point is tocheck whether an operator has made a change to the redex,i.e., Att′D = AttD. This requires comparing the result of theoperator to a backup copy of the redex created prior to itsapplication. However, this approach is rather expensive. Weopt for a more efficient alternative by placing a requirementon the guards of operators, as explained next.

Definition 3.5 (Strong Guard). We say that an operator ophas a strong guard if for every matching µ, applying theoperator disables the guard. That is, if G′ = [[op]](G,µ)then (G′, µ) 6|= Gdop.

A strong guard allows to check (G,µ) 6|= Gdop, whichinvolves just reading the attributes of D and evaluating aBoolean expression.

Further, strong guards help us improve the precision ofour incremental worklist maintenance by supplying moreinformation to the automatic reasoning procedure, as ex-plained in Sec. 4.3.

Our compiler checks for strong guards at compile-timeand signals an error to the programmer otherwise (see detailsin Sec. 4.3). In our experience, strong guards do not limitexpressiveness. For efficiency, operators are usually writtento act on a graph region in a single step, which leads todisabling their guard.

4. SynthesisIn this section, we explain how to emit code to implementElixir statements. We use the notation Code(e) for the codefragment implementing the mathematical expression e in ahigh-level imperative language.

This section is organized as follows. First, we discussour assumptions regarding the implementation language.Sec. 4.1 describes the synthesis of operator-related proce-dures. Sec. 4.2 describes the synthesis of the EXPAND op-eration, which is used to synthesize RDX and as a buildingblock in synthesizing DELTA. Sec. 4.3 describes the synthe-sis of DELTA via automatic reasoning. Sec. 4.4 puts togetherthe elements needed to synthesize unordered statements. Fi-nally, Sec. 4.5 describes the synthesis of ordered statements.

Implementation Language and Notational ConventionsWe assume the language contains standard constructs for se-quencing, conditions, looping, and evaluation of arithmetic

and Boolean expressions such as the ones used in Elixir. Op-erations on sets are realized by methods on set data struc-tures. We assume that the language allows static typing bythe notation v : t, meaning that variable v has type t. Topromote succinctness, variables do not require declarationand come into scope upon initialization. We write vi..j todenote the sequence of variables vi, . . . , vj . Record typesare written as record[f1..k], meaning that an instance r ofthe record allows accessing the values of fields f1..k, writtenas r[fi]. We use static loops (loops preceded by the statickeyword) to concisely denote loops over a statically-knownrange, which the compiler unrolls, instantiating the inductionvariables in the loop body as needed. In defining procedures,we will use the notation f [statArgs](dynArgs) to mean thatf is specialized for the statically-given arguments statArgsand accepts at runtime the dynamic arguments dynArgs. Wenote that, since we assume a single graph instance, we willusually not explicitly include it in the generated code.

Graph Data Structure. We assume the availability of agraph data structure supporting methods for reading and up-dating attributes, and scanning the outgoing edges and in-coming edges of a given node. The code statements corre-sponding to these methods are as follows. Let vn and vm bevariables referencing the graph nodes n and m, respectively.Let a be a node attribute and b be an edge attribute. Let daand db be variables of the appropriate types for attributes aand b, respectively, having the values d and d′, respectively.

• da := get(a, vn) assigns AttG(a, n) to da and db :=get(a, vn, vm) assigns AttG(b, n,m) to db.

• set(a, vn, da) updates the value of the attribute a on thenode n to d: Att′G = AttG(a, n) 7→ d, and set(b, vn, vm, db)updates the value of the attribute b on the edge (n,m) tod′: Att′G = AttG(b, n,m) 7→ d′.

• edge(vn, vm) checks whether (n,m) ∈ EG.• succs(vn) and preds(vn) return (iterators to) the sets of

nodes {s | (n, s) ∈ EG} and {p | (p, n) ∈ EG},respectively.

• nodes returns (an iterator to) the set of graph nodes V G.

In addition, we require that the graph data structure belinearizable2.

4.1 Synthesizing Atomic Operator ApplicationLet op = [nt1..k, et1..m, bexp]→ [nUpd1..k, eUpd1..m] be anoperator consisting of the following elements, for i = 1..kand j = 1..m: (i) node attributes nti = nodes(node ni, ai vi);(ii) edge attributes etj = edges(src sj , dst dj , bj wj); (iii)a guard expression bexp = opGd; (iv) node updates nUpdi =vi 7→ nExpi; and (v) edge updates eUpdj = wj 7→ eExpj .

2 In practice, our graph implementation is optimized for non-morphingactions. We rely on the locks acquired by the synthesized code to correctlysynchronize concurrent accesses to graph attributes.

9 2012/8/6

def apply[op](mu : record[n1..k]) =static for i = 1..k {ni := mu[ni]}static for i = 1..k {lk i := ni}sort(lock less, lk 1..k)static for i = 1..k {lock(lk i)}static for i = 1..k {vi := get(ai, ni)}static for j = 1..m {wj := get(bj , sj , dj)}if checkShape[op](mu) ∧ Code(bexp)

static for i = 1..k {set(ai, ni,Code(nExpi))}static for j = 1..m {set(bj , sj , dj ,Code(eExpj))}

static for i = 1..k {unlock(lk i)}(a) Code([[op]])

def checkShape[op](mu : record[n1..k]) : bool =static for i = 1..k {ni := mu[ni]}// Now sj and dj correspond to µ(sj , dj)static for i = 1..k

static for j = 1..kif ni = nj // Check if µ is one-to-one.return false

static for j = 1..mif ¬edge(sj , dj) // Check for missing edges.

return falsereturn true

(b) Code((G,µ) |= Rop)

def checkGuard[op](mu : record[n1..k]) : bool =static for i = 1..k {ni := mu[ni]}static for i = 1..k {vi := get(ai, ni)}static for j = 1..m {wj := get(bj , sj , dj)}return Code(bexp)

(c) Code((G,µ)Gdop)

Figure 5: Operator-related procedures.

We note that in referring to pattern nodes the naming ofvariables ni, sj , dj , etc. are insignificant in themselves, butrather stand for different ways of indexing the actual set ofvariable names. For example n1 and s2 may both stand for avariable ‘a’.

Fig. 5 shows the codes we emit, as procedure definitions,for (a) evaluating an operator, (b) for checking a shape con-straint, and (c) for checking a value constraint.

The procedure apply uses synchronization to ensureatomicity. The procedure first reads the nodes from thematching variable ‘mu’ into local variables. It then copiesthe variables to another set of variables used for locking.We assume a total order over all nodes, implemented bythe procedure lock less, which we use to ensure absenceof deadlocks. The statement sort(lock less, lk1..k) sorts thelock variables, i.e., swaps their values as needed, using thesort procedure. Next, the procedure acquires the locks in as-cending order (we use spin locks), thus avoiding deadlocks.Then, the procedure reads the node and edge attributes fromthe graph and evaluates the guard. If the guard holds theupdate expressions are evaluated and used to update the at-tributes in the graph. Finally, the locks are released.

Since operators do not morph the graph checkShape doesnot require any synchronization. The procedure checkGuardis synchronized using the same strategy as apply.

def apply[relaxEdge](mu : record [a , b]) =a := mu[a]; b := mu[b];lk 1 := a; lk 2 := b;if lock less ( lk 2 , lk 1 ) // inline sort

swap(lk 1, lk 2 ) ;lock( lk 1 ) ; lock ( lk 2 ) ;ad := get( dist , a) ; bd := get( dist , b) ;w := get(wt, a , b) ;if ad + w < bd // test guard

set ( dist , b, ad + w);unlock(lk 1) ; unlock ( lk 2 ) ;

Figure 6: Code([[relaxEdge]]).

Fig. 6 shows the code we emit for Code([[relaxEdge]]).

4.2 Synthesizing EXPAND

Let R be a pattern and v1..m ⊆ V R and vm+1..k = V R \v1..m be two complementing subsets of its nodes such thatv1..m induces a connected subgraph ofR. We now develop aprocedure that accepts a matching µ ∈ D → V G, whereD is any superset of v1..m, and computes all matchingsµ′ ∈ V R → V G such that µ(vi) = µ′(vi) for i = 1..m.

We can bind the variables vm+1..k to graph nodes in dif-ferent orders, but it is more efficient to choose an order thatenables scanning the edges incident to nodes that are alreadybound. The alternative way requires scanning the entire setof graph nodes for each unbound pattern node and check-ing whether it is a neighbor of some bound node, whichis too inefficient. We represent an efficient order by a per-mutation of vm+1..k, um+1..k, and by an auxiliary sequenceT (R, vm+1..k) = (um+1, wm+1, dirm+1), . . . , (uk, wk, dirk)where each tuple defines the connection between an un-bound node uj and a previously bound node wj and thedirection of the edge between them — forward for false andreverse otherwise. More formally, for every j = m+1..k wehave that wj ∈ v1..j and if dirj = false then (uj , wj) ∈ ERand otherwise (wj , uj) ∈ ER.

The procedure expand, shown in Fig. 7, first updatesµ′ for 1..m and then uses T (R, vm+1..k) to bind nodesvm+1..k. Each node is bound to all possible values by a loopusing the procedure expandEdge, which handles one tuplein (uj , wj , dirj). The loops are nested to enumerate over allcombinations of bindings.

We note that a matching computed by the enumerationdoes not necessarily satisfy the shape constraints of R assome of the pattern nodes may be bound to the same graphnode and not all edges in R may be present between thecorresponding pairs of bound nodes. It is possible to filter outmatchings that do not satisfy the shape constraint or guardby testing a matching with checkShape and checkGuard,respectively.

We use expand to define Code(RDX[[op]](G,µ)) in Fig. 8.

10 2012/8/6

def expand[op, v1..m, T : record[nm+1..k]](mu : record[n1..k],f : record[n1..k]⇒ unit) =

mu′ := record[n1..k] // expanded matchingstatic for i = 1..m {mu′[vi] := mu[vi]}expandEdge[m+ 1,expandEdge[m+ 2,. . .

expandEdge[k, f(mu′)] . . .]

// Inner functiondef expandEdge[i, code] =[si, di, diri] := T [i]if diri = true

for s ∈ succs(mu[vs])mu′[vi] := scode // inline code

else // d = infor p ∈ preds(mu[vd])

mu′[vi] := pcode // inline code

Figure 7: Code for computing EXPAND[[op, v1..m]](G,µ)and applying a function f to each matching.

def redexes[op](f : record[n1..k]⇒ unit) =for v ∈ nodes

mu := record[n1]expand[op, n1, T (R,n2..k)](mu, f ′)

def f ′(mu : record[n1..k]) =if checkShape[op](mu) ∧ checkGuard[op](mu)f(mu)

Figure 8: Code for computing RDX[[op]](G,µ) and applyinga function f to each redex.

4.3 Synthesizing DELTA via Automatic ReasoningWe now explain how to automatically obtain an overap-proximation of DELTA[[op, op′]] (G,µ) for any two operatorsop = [R,Gd] → [Upd] and op′ = [R′,Gd′] → [Upd′] andmatching µ, and how to emit the corresponding code.

The definition of DELTA given in Sec. 3 is global inthe sense that it requires searching for redexes in the entiregraph, which is too inefficient. We observe that we canredefine DELTA by localizing it to a subgraph affected bythe application of the operator, as we explain next.

For the rest of this subsection, we will associate match-ings with the corresponding patterns using the notationalconvention µR.

Let µR and µR′ be two matchings corresponding to theoperators above. We say that µR and µR′ overlap, writtenµR f µR′ , if the matched subgraphs overlap: µR(V R) ∩µR′(V

R′) 6= ∅. Then, the following equality holds:

DELTA[[op, op′]] (G,µR) =let G′ = [[op]](G,µR)in {µR′ | µR′ f µR,

(G,µR′) 6|= Rop,Gdop,(G′, µR′) |= Rop,Gdop} .

We note that any overapproximation of DELTA can beused in correctly computing the operational semantics of aniterate statement. However, tighter approximations leadto reduction in useless work. We proceed by developing anoverapproximation of the local definition of DELTA.

Given a matching µR, the set of overlapping matchingsµR′ can be classified into statically-defined equivalenceclasses, defined as follows. If µR′ f µR then the overlapbetween µR(V R) and µR′(V R

′) induces a partial function

ρ : V R′⇀ V R defined as ρ(x) = y if µR′(x) = µR(y).

We call the function ρ the influence function ofR andR′ anddenote the domain of ρ by ρdom. Two matchings µ1

R′ and µ2R′

are equivalent if they induce the same influence function ρ.We denote the equivalence class of an influence function by[ρ]. We can compute the class [ρ] by

[ρ] = EXPAND[[op′, ρdom]](G,µR) .

Let infs(op, op′) = ρ1..k denote the influence functionsfor the redex patterns Rop and Rop′ . We define the functionshift : Match(×V R′ → V R) → Match, which accepts amatching µR and an influence function ρ and returns the partof a matching µR′ identifying on

shift(µR, ρ)def= {(x, v) | x ∈ ρdom, v = µR(ρ(x))} .

The first overapproximation we obtain is

DELTA1[[op, op′]] (G,µR)def=⋃

ρ∈infs(op,op′)EXPAND[[op′, ρdom]](G, shift(µR, ρ))

An obvious way to obtain a tighter approximation stillis to filter out matchings not satisfying the shape and valueconstraints of ρ.

We say that an influence function ρ is useless if for allgraphs G and all matchings µR′ the following holds: forG′ = [[op]](G,µR) either (G,µR′) |= Rop′ ,Gdop′ , mean-ing that an active element elem〈op′, µR′〉 has already beenscheduled, or (G′, µR′) 6|= Rop′ ,Gdop′ , meaning that the ap-plication of op to (G,µR) does not modify the graph in away that makes µR′(G′) a redex. Otherwise we say that ρis useful. We denote the set of useful influence functionsby useInfs(op, op′) We can obtain a tighter approximationDELTA2[[op, op′]] (G,µR) via useful influence functions.

DELTA2[[op, op′]] (G,µR)def=⋃

ρ∈useInfs(op,op′)EXPAND[[op′, ρdom]](G, shift(µR, ρ)) .

We use automated reasoning to find the set of usefulinfluence functions.

Influence Patterns. For every influence function ρ, wedefine an influence pattern and construct it as follows.

1. Start with the redex pattern Rop and a copy R′ of Rop′

where all variables have been renamed to fresh names.

11 2012/8/6

2. Identify the nodes ofR′ withR and rename node attributevariables in R′ to the variables used in the correspondingnodes of R, and similarly renamed edges attributes foridentified edges.

Example 4.1 (Influence Patterns). Fig. 9 shows the six in-fluence patterns for operator relaxEdge (for now, ignore thetext below the graphs). Here Rop consists of the nodes a andb (and the connecting edge) and R′ consists of the nodes cand d (and the connecting edge). We display identified nodesby showing both names.

Intuitively, the patterns determine that candidate redexesare one of the following types: a successor edge of b, asuccessor edge of a, a predecessor edge of a, a predecessoredge of b, an edge from b to a, and the edge from a to b itself.

Query Programs. To detect useless influence functions,we generate a straight-line program over the variables ofthe corresponding influence pattern, which has the followingform:

assume ( Guard )assume ! ( Guard ’ ) / / comment o u t i f i d e n t i t y p a t t e r nu p d a t e (C)a s s e r t ! ( Guard ’ )

Intuitively, the program constructs the following verifi-cation condition: (i) if the guard of R, Guard, holds (firstassume); and (ii) the guard of R′, Guard’, does not hold(second assume); and (iii) the updates assign new values;then (iv) the guard R′ does not hold for the updated val-ues. Proving the verification condition means that the corre-sponding influence function is useless.

The case of op = op′ and the identity influence functionis special. The compiler needs to check whether the guardis strong, and otherwise emit an error message. This is doneby constructing a query program where the second assume

statement is removed.We pass these programs to a program verification tool (we

use Boogie [6] and Z3 [12]) asking it to prove the last asser-tion. This amounts to checking satisfiability of a proposi-tional (single conjunction in fact) formula over the theoriescorresponding to the attributes types in our language — in-teger arithmetic and set theory. When the verifier is able toprove the condition, we remove the corresponding influencefunction. If the verifier is unable to prove the condition or atimeout is reached, we conservatively consider the functionas useful.

Example 4.2 (Query Programs). Fig. 9 shows the query pro-grams generated by the compiler for each influence pattern.Out of the six influence patterns, the verifier is able to ruleout all except (a) and (e), which together represent the edgesoutgoing from the destination node, with the special casewhere an outgoing edge links back to the source node. Also,the verifier is able to prove that the guard is strong for (f).This results with the tightest approximation of DELTA.

def delta2[op, op′, ρ1..m](mu : record[n1..k],f : record[n1..k]⇒ unit) =

static for i = 1..m// mu′ = shift(mu)mu′ = record[n1..k]forj = 1..k

mu′[inv rho[j]] = mu[j]Code(EXPAND[[op′, ρi dom]](G,µ′))

def f ′(mu : record[n1..k]) =if checkShape[op′](mu) ∧ checkGuard[op′](mu)f(mu)

Figure 10: Code for computing DELTA2[[op, op′]] (G,µ) andapplying a function f to each matching.

We note that if the user specifies positive edges weights(weight : unsigned int) then case (e) is discovered tobe spurious.

Fig. 10 shows the code we emit for DELTA2. We representinfluence functions by appropriate records and supply aninverse function inv rho for every influence function rho.

4.3.1 OptimizationsOur compiler applies a few useful optimizations, which arenot shown in the procedures above.

Reducing Overheads in checkShape. The procedure ex-pand uses the auxiliary data structure T to compute potentialmatchings. In doing so it is checking a portion of the shapeconstraint — the edges of the redex pattern that are includedin T . The compiler omits checking these edges. Often, T in-cludes all of the pattern edges; in such a case we specializecheckShape to only check the one-to-one condition.

Reusing Potential Matchings. In cases when two opera-tors have the same shape, expand reuses the matchings itcomputes for both of them.

4.4 Synthesizing Unordered StatementsWe implement the operational semantics defined in Sec. 3.3.3by utilizing the Galois system runtime, which enables one to:(i) automatically construct a concurrent worklist from a dy-namic scheduling expression, and (i) process the elements inthe worklist in parallel by a given function. We use the lattercapability by passing the code we synthesize for operator ap-plication followed by the code for DELTA2[[op, op′]] (G,µ),which inserts the found elements to the worklist for furtherprocessing.

4.5 Synthesizing Ordered StatementsCertain algorithms, such as [5, 19], have additional proper-ties that enable optimizations over the baseline ordered par-allelization scheme discussed in Sec. 3.4. For example, inthe case of Breadth-First-Search (BFS), one can show thatwhen processing work at priority level i, all new work is atpriority level i+1. This allows us to optimize the implemen-tation to contain only two buckets: Bc that holds work items

12 2012/8/6

b,c d

dist=bd dist=dd

a

dist=adwt=w wt=w2

b

d

dist=bd

dist=dd

a,c

dist=adwt=w

wt=w2

bc

dist=bddist=cd

a,d

dist=adwt=w2 wt=w

assume ( ad + w < bd )assume ! ( bd + w2 < dd )new bd = ad + wa s s e r t ! ( new bd + w2 < dd )

assume ( ad + w < bd )assume ! ( ad + w2 < dd )new bd = ad + wa s s e r t ! ( ad + w2 < dd )

assume ( ad + w < bd )assume ! ( cd + w2 < ad )new bd = ad + wa s s e r t ! ( cd + w2 < ad )

(a) (b) (c)

b,d

c

dist=bddist=adwt=w

wt=w2

a

dist=cd

b,c

dist=bd

a,d

dist=adwt=w

wt=w2b,d

dist=bd

a,c

dist=adwt=w

assume ( ad + w < bd )assume ! ( cd + w2 < bd )new bd = ad + wa s s e r t ! ( cd + w2 < new bd )

assume ( ad + w < bd )assume ! ( bd + w2 < ad )new bd = ad + wa s s e r t ! ( new bd + w2 < ad )

/ / check whe the r gua rd i s s t r o n gassume ( ad + w < bd )new bd = ad + wa s s e r t ! ( ad + w < new bd )

(d) (e) (f)

Figure 9: Influence patterns and corresponding query programs for relaxEdge. (b), (c), (d), and (f) are spurious patterns.

at the current priority level, and Bn that holds work itemsat the next priority level. Hence, we can avoid the overheadsassociated with the generic scheme, which supports an un-bounded number of buckets. Additionally, since Bc is effec-tively read-only when operating on work at level i, we canexploit this to synthesize efficient load-balancing schemeswhen distributing the work contained in Bi to the workerthreads. Currently Elixir uses these two insights to synthe-size specialized dynamic schedulers (using utilities from theOpenMP library) for problems such as breadth-first search.

4.5.1 Automating the OptimizationsWe now discuss how we use automated reasoning to enablethe above optimizations. What we have to show is that ifthe priority of active elements at the current level has somearbitrary value k, then all new active elements have thesame priority k + s, where s ≥ l. We heuristically guessvalues of s by taking all constant numeric values appearingin the program s = C1, . . . , Cn. We illustrate this processthrough the BFS example. In the case of BFS the worklistdelta consists only of a case similar to that of Fig. 9(a) withall weights equal to one. The query program we constructis shown in Fig. 11. The program checks that the differencebetween the priority of the shape resulting from the operatorapplication and the shape prior to the operator applicationis an existentially quantified positive constant. Additionally,we must guarantee that when we initialize the worklist allwork is at the same priority level. Our compiler emits asimple check on the priority value on each item insertedin the worklist during initialization to guarantee that thiscondition is satisfied.

assume ( ad == k )assume ( s == C i )assume ( ad + 1 < bd )new bd = ad + 1assume ( cd == new bd )assume ( cd + 1 < dd )a s s e r t ( ad == k & new bd == k + s )

Figure 11: Query program to enable leveled worklist opti-mization. C i stands for a heuristically guessed value of s.

Dimension Value RangeWorklist (WL) {CF, CL, OBM, BS, LGEN, LOMP}

Group (GR) {a, b, NONE}Unroll Factor (UF) {0, 1, 2, 10, 20, 30}

VC Check (VC) {ALL,NONE,LOCAL}SC Check (SC) {ALL,NONE}

Table 1: Dimensions explored by our synthesized algo-rithms.

5. Design Space ExplorationElixir makes it possible to automatically generate a largenumber of program variants for solving an irregular problemlike SSSP, and evaluate which one performs best on a giveninput and architecture. Tab. 1 shows the dimensions of thedesign space supported in Elixir, and for each dimension, therange of values explored in our evaluations.

Worklist Policy (WL): The dynamic scheduler is imple-mented by a worklist data structure. To implement the LIFO,FIFO, and approx metric policies, Elixir uses worklistsfrom the Galois system [20, 23]. These worklists can becomposed to provide more complex policies. To reduce

13 2012/8/6

overhead, they manipulate chunks of work-items. We re-fer to the chunked versions of FIFO/LIFO as CF/LF and tothe (approximate) metric ordered worklist composed witha CL as OBM. We implemented a worklist (LGEN) to sup-port general, level-by-level execution (metric policy). Forsome programs, Elixir can prove that only two levels areactive at any time. In these cases, it can synthesize an opti-mized, application-specific scheduler using OpenMP prim-itives (LOMP). Alternatively, it can use a bulk-synchronousworklist (BS) provided by the Galois library.

Grouping: In the case of the SSSP relaxEdge operator,we can group either on a, or b, creating a “push-based” ora “pull-based” version of the algorithm. Additionally, Elixiruses grouping to determine the type of worklist items. Forexample, worklist items for SSSP can be edges (a,b), but ifthe group b directive is used, it is more economical to usenode a as the worklist item. In our benchmarks, we considerusing either edges or nodes as worklist items, since this isthe choice made in all practical implementations.

Unroll Factor: Unrolling produces a composite operator.This operator explores the subgraph in a depth-fist order.

Shape/Value constraint checks (VC/SC): We consider thefollowing class of heuristics to optimize the worklist manip-ulation. After the execution of an operator op, the algorithmmay need to insert into the worklist a number of matchingsµ, which constitute the delta of op. Before inserting eachsuch µ, we can check whether the shape constraint (SC)and/or the value constraint (VC) is satisfied by µ, and if it isnot, avoid inserting it, thus reducing overhead. Eliding suchchecks at this point is always safe, with the potential cost ofpopulating the worklist with useless work.

In practice, there are many more choices such as theorder of checking constraints and whether these constraintsare checked completely or partially. In certain cases, elidingcheck ci may be more efficient since performing ci mayrequire holding locks longer. Elixir allows the user to specifywhich SC/VC checks should be performed and providesthree default, useful polices: ALL for doing all checks, NONEfor doing no checks, and LOCAL for doing only checks thatcan be performed by using graph elements already accessedby the currently executing operator. The last one is especiallyuseful in the context of parallel execution. In cases whereboth VC and SC are applied, we always check them in theorder SC, V C.

6. Empirical EvaluationTo evaluate the effectiveness of Elixir, we perform studies onthree problems: single-source shortest path (SSSP), breadth-first-search (BFS), and betweenness centrality (BC). We useElixir to automatically enumerate and synthesize a numberof program variants for each problem, and compare the per-formance of these programs to the performance of existinghand-tuned implementations. In the SSSP comparison, we

use a hand-parallelized code from the Lonestar benchmarksuite [18]. In the BFS comparison, we use a hand-parallelizedcode from Leiserson and Schardl [19], and for BC, we usea hand-parallelized code from Bader and Madduri [5]. Inall cases, our synthesized solutions perform competitively,and in some cases, they outperform the hand-optimized im-plementations. More importantly, these solutions were pro-duced through a simple enumeration-based exploration strat-egy of the design space, and do not rely on expert knowledgefrom the user’s part to guide the search.

Elixir produces both serial and parallel C++ implementa-tions. It uses graph data structures from the Galois library butall synchronization is done by the code generated by Elixir,so the synchronization code built into Galois graphs is notused. The Galois graph classes use a standard graph API,and it is straightforward to use a different graph class if thisis desirable. Implementations of standard collections such assets and vectors are taken from the C++ standard library. Inour experiments, we use the following input graph classes:

Road networks: These are real-world, road network graphsof the USA from the DIMACS shortest paths chal-lenge [1]. We use the full USA network (USA-net) with24M nodes and 58M edges, the Western USA network(USA-W) with 6M nodes and 15M edges,and the Floridanetwork (FLA) with 1M nodes and 2.7M edges.

Scale-free graphs: These are scale-free graphs that weregenerated using the tools provided by the SSCA v2.2benchmark [3]. The generator is based on the Recur-sive MATrix (R-MAT) scale-free graph generation al-gorithm [10]. The size of the graphs is controlled by aSCALE parameter; a graph containsN = 2SCALE nodes,M = 8 × N edges, with each edge having strictly pos-itive integer weight with maximum value C = 2SCALE .For our experiments we removed multi-edges from thegenerated graphs. We denote a graph of SCALE = X asrmatX .

Random graphs: These graphs contain N = 2k nodes andM = 4 × N edges. There are N − 1 edges connect-ing nodes in a circle to guarantee the existence of a con-nected component and all the other edges are chosen ran-domly, following a uniform distribution, to connect pairsof nodes. We denote a graph with k = X as randX .

We ran our experiments on an Intel Xeon machine run-ning Ubuntu Linux 10.04.1 LTS 64-bit. It contains four 6-core 2.00 GHz Intel Xeon E7540 (Nehalem) processors. TheCPUs share 128 GB of main memory. Each core has a 32 KBL1 cache and a unified 256 KB L2 cache. Each processor hasan 18 MB L3 cache that is shared among the cores. For SSSPand BC the compiler used was GCC 4.4.3. For BFS, the com-piler used was Intel C++ 12.1.0. All reported running timesare the minimum of five runs. The chunk sizes in all our ex-periments are fixed to 1024 for CF and 16 for CL.

14 2012/8/6

Dimension Value RangesGroup {a, b, NONE}

Worklist {CF, OBM, LGEN}Unroll Factor {0, 1, 2, 10, 20, 30}

VC check {ALL,NONE}SC check {ALL,NONE}

Table 2: Dimensions explored by our synthetic SSSP vari-ants.

Variant GR WL UF VC SC fPr

v50 b OBM 2 X X ad/∆v62 b OBM 2 × X ad/∆v63 b OBM 10 × X ad/∆dsv7 b LGEN 0 X X ad/∆

Table 3: Chosen values and priority functions (fPr) for bestperforming SSSP variants (Xdenotes ALL, × denotes NONE).

One aspect of our implementation that we have not opti-mized yet is the initialization of the worklist, before the exe-cution of a parallel loop. Our current implementation simplyiterates over the graph, checks the operator guards and pop-ulates the worklist appropriately when a guard is satisfied. Inmost algorithms, the optimal worklist initialization is muchsimpler. For example, in SSSP we just have to initialize theworklist with the source node (when we have nodes as work-list items). A straightforward way to synthesize this code isto ask the user for a predicate that characterizes the statebefore each parallel loop. For SSSP, this predicate would as-sert that the distance of the source is zero and the distanceof all other nodes is infinity. With this assertion, we can useour delta inference infrastructure to synthesize the optimalworklist initialization code. This feature is not currently im-plemented, so the running times that we report (both for ourprograms and programs that we compare against) excludethis part and include only the parallel loop execution time.

6.1 Single-Source Shortest PathWe synthesize both ordered and unordered versions of SSSP.In Tab. 2, we present the range of explored values in eachdimension for the synthetic SSSP variants. In Tab. 3, wepresent the combinations that lead to the three best perform-ing asynchronous SSSP variants (v50, v62, v63) and the bestperforming delta-stepping variant (dsv7). In Fig. 12a andFig. 12b we compare their running times with that of anasynchronous, hand-optimized Lonestar implementation onthe FLA and USA-W road networks. We observe that in bothcases the synthesized versions outperform the hand-tunedimplementation, with the leveled version also having com-petitive performance.

All algorithms are parallelized using the Galois infras-tructure, they use the same worklist configuration, with ∆ =16384, and the same graph data-structure implementation.The value of ∆ was chosen through enumeration and gives

the best performance for all variants. The Lonestar version isa hand-tuned lock-free implementation, loosely based on theclassic delta-stepping formulation [22]. It maintains a work-list of pairs [v, dv∗], where v is a node and dv∗ is an approxi-mation to the shortest path distance of v (following the orig-inal delta-stepping implementation). The Lonestar versiondoes not implement any of our static scheduling transfor-mations. All synthetic variants perform fine grained lockingto guarantee atomic execution of operators, checking of theVC and evaluation of the priority function. For the syntheticdelta-stepping variant dsv7 Elixir uses LGEN since new workafter the application of an operator can be distributed in var-ious (lower) priority levels. An operator in dsv7 works overa source node a and its incident edges (a, b), which belongto the same priority level.

In Fig. 12c we present the runtime distribution of all syn-thetic SSSP variants on the FLA network. Here we summa-rize a couple of interesting observations from studying theruntime distributions in more detail. By examining the tenvariants with the worst running times, we observed that theyall use a CF (chunked FIFO) worklist policy and are eitheroperating on a single edge or the immediate neighbors of anode (through grouping), whereas the ten best performingvariants all use OBM. This is not surprising, since by usingOBM there are fewer updates to node distances and the algo-rithm converges faster. To get the best performance though,we must combine OBM with the static scheduling transforma-tions. Interestingly, combining the use of CF with groupingand aggressive unrolling (by a factor of 20) produces a vari-ant that performs only two to three times worse than the bestperforming variant on both input graphs.

6.2 Breadth-First SearchWe experiment with both ordered and unordered versionsof the BFS. In Tab. 4 and Tab. 5, we present the range of ex-plored values for the synthetic BFS variants and the combina-tions that give the best performance, respectively. In Fig. 13,we present a runtime comparison between the three best-performing BFS variants (both asynchronous and leveled),and two highly optimized, handwritten, lock-free parallelBFS implementations. The first handwritten implementationis from the Lonestar benchmark suite and is parallelizedusing the Galois system. The second is an implementationfrom Leiserson and Schardl [19], and is parallelized usingCilk++. We experiment with three different graph types. Forthe rmat20 and rand23 graphs, the synthetic variants per-form competitively with the other algorithms For the USA-net graph, they outperform the hand-written implementa-tions at high thread counts (for 20 and 24 threads).

To understand these results, we should consider the struc-ture of the input graphs and the nature of the algorithms.Leveled BFS algorithms try to balance exposing parallelismand being work-efficient by working on one level at a time.If the amount of available work per level is small, then theydo not exploit the available parallel resources effectively.

15 2012/8/6

(a) FLA runtimes

Threads

Tim

e (m

s)

100

200

300

400

500

600

700 ●

●

●

●●

●●

1 2 4 8 16 20 24

Algorithm

● Lonestar

v50

v62

v63

dsv7

(b) USA-W runtimes

ThreadsT

ime

(ms)

500

1000

1500

2000

2500

3000

3500●

●

●

●

● ● ●

1 2 4 8 16 20 24

Algorithm

● Lonestar

v50

v62

v63

dsv7

(c) FLA runtime distribution

Variant

Tim

e

200

400

600

800

1000

1200

1400

●●

●

●

●●●●

●

●

●●

●

●●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●●

●

●

●●

●●●

●

●

●●●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

Figure 12: Runtime comparison of SSSP algorithms on FLA,USA-W inputs and runtime distribution of synthetic variants onFLA input.

Asynchronous BFS algorithms try to overcome this problemby being more optimistic. To expose more parallelism, theyspeculatively work across levels. By appropriately pickingthe priority function, and efficiently engineering the algo-rithm, the goal is reduce the amount of mis-speculation in-troduced by eagerly working on multiple levels. Focusing onthe graph structure, we observe that scale-free graphs exhibitthe small-world phenomenon; most nodes are not neighborsof one another, but most nodes can be reached from everyother by a small number of “hops”. This means that the di-ameter of the graph and the number of levels is small (12 forrmat20). The random graphs that we consider also have asmall diameter (17 for rand23). On the other hand, the roadnetworks, naturally, have a much larger diameter (6261 forUSA-net). The smaller the diameter of the graph the largerthe number of nodes per level, and therefore the larger theamount of available work to be processed in parallel. Ourexperimental results support the above intuitions. For lowdiameter graphs we see that the best performing syntheticvariants are, mostly, leveled algorithms (v17,v18, v19). ForUSA-net which has a large diameter, the per-level paral-lelism is small, which makes the synthetic asynchronousalgorithms (v11, v12, v14) more efficient than others. Infact, at higher thread counts (above 20) they manage to,marginally, outperform even the highly tuned hand-writtenimplementations. For all three variants we use ∆ = 8. Thiseffectively, merges a small number of levels together and al-lows for a small amount of speculation, which allows the al-gorithms to mine more parallelism. Notice that, similarly toSSSP, all three asynchronous variants combine some staticscheduling (small unroll factor plus grouping) with a gooddynamic scheduling policy to achieve the best performance.

The main take-away message from these experiments isthat no one algorithm is best suited for all inputs, especiallyin the domain of irregular graph algorithms. This validates

Dimension Value RangesGroup {b, NONE}

Worklist {OBM, LOMP, BS}Unroll Factor {0, 1, 2}

VC check {ALL,NONE}SC check {ALL,NONE}

Table 4: Dimensions explored by our synthetic BFS variants.

our original assertion that a single solution for an irregularproblem may not be adequate, so it is desirable to have asystem that can synthesize competitive solutions tailored tothe characteristics of the particular input.

For level-by-level algorithms, there is also a spectrum ofinteresting choices for the worklist implementation. Elixircan deduce that BFS under the metric ad scheduling policycan have only two simultaneously active priority levels, aswe discussed in Sec. 4.5. Therefore, it can use a customizedworklist in which a bucket Bk holds work for the currentlevel and a bucket Bk+1 holds work for the next. Hence, wecan avoid the overheads associated with LGEN, which sup-ports an unbounded number of buckets. BS is a worklist thatcan be used to exploit this insight. Additionally, since no newwork is added to Bk while working on level k, threads canscan the bucket in read-only mode, further reducing over-heads. Elixir exploits both insights by synthesizing a cus-tom worklist LOMP using OpenMP primitives. LOMP is pa-rameterized by an OpenMP scheduling directive to exploreload-balancing policies for the threads querying Bk (in ourexperiments we used the STATIC policy).

6.3 Betweenness CentralityThe betweenness centrality (BC) of a node is a metric thatcaptures the importance of individual nodes in the overallnetwork structure. Informally, it is defined as follows. LetG = (V,E) be a graph and let s, t be a fixed pair of graph

16 2012/8/6

(a) rmat20

Threads

Tim

e (m

s)

200

400

600

800

●

●

●

●● ● ●

1 2 4 8 16 20 24

Algorithm

● Lonestar

v16

v17

v19

cilk

(b) rand23

Threads

Tim

e (m

s)

1000

2000

3000

4000

5000

6000

●

●

●

●

● ● ●

1 2 4 8 16 20 24

Algorithm

● Lonestar

v17

v18

v19

cilk

(c) USA-net

Threads

Tim

e (m

s)

1000

2000

3000

4000

5000

6000

●

●

●

●

● ● ●

1 2 4 8 16 20 24

Algorithm

● Lonestar

v11

v12

v14

cilk

Figure 13: Runtime comparison of BFS algorithms.


v11 b OBM 1 X X ad/∆v12 b OBM 2 X X ad/∆v14 b OBM 1 X × ad/∆v16 b OBM 0 X × ad/∆v17 b BS 0 X X adv18 b LOMP 0 X X adv19 b BS 0 X × ad

Table 5: Chosen values and priority functions for BFS vari-ants. We chose ∆ = 8. (Xdenotes ALL, × denotes NONE.)

nodes. The betweenness score of a node u is the percent-age of shortest paths between s and t that include u. Thebetweenness centrality of u is the sum of its betweennessscores for all possible pairs of s and t in the graph. The mostwell known algorithm for computing BC is Brandes’ algo-rithm [7]. In short, Brandes’ algorithm considers each nodes in a graph as a source node and computes the contributiondue to s to the betweenness value of every other node u in thegraph as follows: In a first phase, it starts from s and exploresthe graph forward building a DAG with all the shortest pathpredecessors of each node. In a second phase it traverses thegraph backwards and computes the contribution to the be-tweenness of each node. These two steps are performed forall possible sources s in the graph. For space efficiency, prac-tical approaches to parallelize BC (e.g. [5]) focus on process-ing a single source node s at a time, and parallelize the abovetwo phases for each such s. Additionally, since it is compu-tationally expensive to consider all graph nodes as possiblesource nodes, they consider only a subset of source nodes (inpractice this provides a good approximation of betweennessvalues for real-world graphs [4]).

In Tab. 6 and Tab. 7, we present the range of explored val-ues for the synthetic BC variants and the combinations thatgive the best performance, respectively. We synthesized so-lutions that perform a leveled parallelization of the forward

Dimension Forward Phase Backward PhaseRanges Ranges

Group {a, b, NONE} {a}Worklist {LOMP, BS} {CF}

Unroll Factor {0} {0}VC check {ALL,NONE} {LOCAL}SC check {ALL,NONE} {ALL,NONE}

Table 6: Dimensions explored by the forward and backwardphase in our synthetic BC variants.


v1 NONE BS 0 (X,L) (X,X) adv14 b LOMP 0 (X,L) (X,X) adv15 b BS 0 (X,L) (×,×) adv16 b LOMP 0 (X,L) (×,×) adv24 b LOMP 0 (×,L) (×,×) ad

Table 7: Chosen values and priority functions for BC variants(Xdenotes ALL, × denotes NONE, L denotes LOCAL). For thebackward phase there is a fixed range of values for mostparameters (see Tab. 6). In the SC column the pair (F,B)denotes that F is used in the forward phase and B in thebackward phase. fPr is the priority function of the forwardphase.

phase and an asynchronous parallelization of the backwardphase. In Fig. 14 we present a runtime comparison betweenthe three best performing BC variants, and a hand-written,OpenMP parallel BC implementation by Bader and Mad-duri [5], which is publicly available in the SSCA benchmarksuite [3]. All algorithms perform the computation outlinedabove for the same five source nodes in the graph, i.e. theyexecute the forward and backward phases five times. The re-ported running times are the sum of the individual runningtimes of all parallel loops.

17 2012/8/6

We observe that in the case of the USA-W road networkour synthesized versions manage to outperform the hand-written code, while in the case of rmat20 graph the hand-written implementation outperforms our synthesized ver-sions. We believe this is mainly due to the following reason.During the forward phase, both the hand-written and synthe-sized versions build a shortest path DAG by recording foreach node u, a set p(u) of shortest path predecessors of u.The set p(u) therefore contains a subset of the immediateneighbors of u. In the second phase of the algorithm, thehand-written version walks the DAG backward to update thevalues of each node appropriately. For each node u, it iteratesover the contents of p(u) and updates each w ∈ p(u) appro-priately. Our synthetic codes instead examine all incomingedges to u and use p(u) to dynamically identify the appropri-ate subset of neighbors and prune out all other in-neighbors.In the case of rmat graphs, we expect that the in-degree ofauthority nodes to be large, while in the road network casethe maximum in-degree is much smaller. We expect there-fore our iteration pattern to be a bottleneck in the first classof graphs. A straight-forward way to handle this problem isto add support in our language for multiple edge types in thegraph. By having explicit predecessor edges in the graph in-stead of considering p(u) as yet another attribute of u, ourdelta inference algorithm will be able to infer the more opti-mized iteration pattern. We plan to add this support in futureextensions of our work.

7. Related WorkWe discuss related work in program synthesis, term andgraph rewriting, and finite-differencing.

Synthesis Systems: The SPIRAL system uses recursivemathematical formulas to generate divide-and-conquer im-plementations of linear transforms [29]. Divide-and-conqueris used in the Pochoir compiler [34], which generates codefor finite-difference computations, given a finite-differencestencil, and in the synthesis of dynamic programming al-gorithms [28]. This approach cannot be used for synthe-sizing high-performance implementations of graph algo-rithms since most graph algorithms cannot be expressedusing mathematical identities; furthermore, the divide-and-conquer pattern is not useful because the divide step requiresgraph partitioning, which usually takes longer than solvingthe problem itself. Green-Marl [16] is an orchestration lan-guage for graph analysis. Basic routines like BFS and DFSare assumed to be primitives written by expert programmers,and the language permits the composition of such traversals.Elixir gives programmers a finer level of control and pro-vides a richer set of scheduling policies; in fact, BFS is oneof the applications presented in this paper for which Elixircan automatically generate multiple parallel variants, com-petitive with handwritten third-party code. There is also agreater degree of automation in Elixir since the system can

(a) rmat20

Threads

Tim

e (m

s)

5000

10000

15000

20000

25000

30000

●

●

●

●● ● ●

1 2 4 8 16 20 24

Algorithm

● SSCA2

v14

v16

v24

(b) USA-W

Threads

Tim

e (m

s)

10000

20000

30000

40000

50000

●

●

●

●●

●●

1 2 4 8 16 20 24

Algorithm

● SSCA2

v1

v14

v15

Figure 14: Runtime comparison of BC algorithms.

explore large numbers of scheduling policies automatically.Green-Marl provides support for nested parallelism, whichElixir currently does not support. In [23] Nguyen et al. de-scribes a synthesis procedure for building high performanceworklists. Elixir uses their worklists for dynamic scheduling,and adds static scheduling and synthesis from a high-levelspecification of operators.

Another line of work focuses on synthesis from logicspecifications [17, 33]. The user writes a logical formula anda system synthesizes a program from that. These specifica-tions are at a much higher level of abstraction than in Elixir.Another line of work that focuses on concurrency is Sketch-ing [32] and Paraglide [35]. There, the goal is to start from a(possibly partial) sequential implementation of an algorithmand infer synchronization to create a correct concurrent im-plementation. Automation is used to prune out a large partof the state space of possible solutions or to verify the cor-rectness of each solution [36].

Term and Graph Rewriting: Term and graph rewriting [30]are well-established research areas. Systems such as Gr-Gen [14], PROGRES [31] and Graph Programming (GP) [26]are using graph rewriting techniques for problem solving.The goals however are different than ours, since in that set-

18 2012/8/6

ting the goal is to find a schedule of actions that leads to acorrect solution. If a schedule does not lead to a solution,it fails and techniques such as backtracking are employedto continue the search. In our case, every schedule is a solu-tion and we are interested in schedules that generate efficientsolutions. Additionally, none of these systems is focused onconcurrency and the optimization of concurrency overheads.

Graph rewriting systems try to perform efficient incre-mental graph pattern matching using techniques such as Retenetworks [8, 15]. In a similar spirit, systems that are basedon dataflow constraints are trying to efficiently perform in-cremental computations using runtime techniques [13]. Un-like Elixir, none of these approaches focuses on parallel ex-ecution. In addition, Elixir tries to synthesize efficient incre-mental computations using compile-time techniques to inferhigh quality deltas.

Finite-differencing: Finite differencing [24] has been usedto automatically derive efficient data structures and algo-rithms from high level specifications [9, 21]. This workis not focused on parallelism. Differencing can be used tocome up with incremental versions of fixpoint computa-tions [9]. Techniques based on differencing rely on a set ofrules, which are most often supplied manually, to incremen-tally compute complicated expressions. Elixir automaticallyinfers a sound set of rules for our problem domain, tailoredfor a given program, using an SMT solver.

8. Conclusion and Future WorkIn this paper we present Elixir, a system that is the first steptowards synthesizing high performance, parallel implemen-tations of graph algorithms. Elixir starts from a high-levelspecification with two main components: (i) a set of oper-ators that describe how to solve a particular problem, and(ii) a specification of how to schedule these operators to pro-duce an efficient solution. Elixir synthesizes efficient paral-lel implementations with guaranteed absence of concurrencybugs, such as data-races and deadlocks. Using Elixir, we au-tomatically enumerated and synthesized a large number ofsolutions for interesting graph problems and showed that oursolutions perform competitively against highly tuned hand-parallelized implementations. This shows the potential ofour solution for improving the practice of parallel program-ming in the complex field of irregular graph algorithms.

As mentioned in the introduction, there are two main re-strictions in the supported specifications. First, Elixir sup-ports only operators for which neighborhoods contain a fixednumber of nodes and edges. Second, Elixir does not supportmutations on the graph structure. We believe Elixir can beextended to handle such algorithms, but we leave this forfuture work.

Another interesting open question is how to integrateElixir specifications within the context of a larger projectthat mixes other code fragments with a basic graph algorithmcode. Currently Elixir allows the user to insert inside an op-

erator fragments of uninterpreted C++ code. This way, appli-cation specific logic can be embedded into the algorithmickernel easily, under the assumption that the uninterpretedcode fragment does not affect the behavior of the graph ker-nel. Assessing the effectiveness of the above solution andchecking that consistency is preserved by transitions to theuninterpreted mode is the subject of future work.

AcknowledgmentsWe would like to thank the anonymous referees and NoamRinetzky for their helpful comments, and to Xin Sui formany useful discussions.

References[1] 9th DIMACS Implementation Challenge. http://www.dis.

uniroma1.it/~challenge9/download.shtml, 2009.

[2] Galois system. http://iss.ices.utexas.edu/?p=

projects/galois, 2011.

[3] D. Bader., J. Gilbert, J. Kepner, and K. Madduri. Hpcs scal-able synthetic compact applications graph analysis (SSCA2)benchmark v2.2, 2007. http://www.graphanalysis.org/benchmark/.

[4] D. A. Bader, S. Kintali, K. Madduri, and M. Mihail. Approx-imating betweenness centrality. In WAW, 2007.

[5] D. A. Bader and K. Madduri. Parallel algorithms for evaluat-ing centrality indices in real-world networks. In ICPP, 2006.

[6] M. Barnett, B. E. Chang, R. DeLine, B. Jacobs, and K. Rus-tan M. Leino. Boogie: A modular reusable verifier for object-oriented programs. In FMCO, 2005.

[7] U. Brandes. A faster algorithm for betweenness centrality.Journal of Mathematical Sociology, 25:163–177, 2001.

[8] H. Bunke, T. Glauser, and T. Tran. An efficient implementa-tion of graph grammars based on the RETE matching algo-rithm. In Graph Grammars and Their Application to Com-puter Science, 1991.

[9] J. Cai and R. Paige. Program derivation by fixed point com-putation. Sci. Comput. Program., 11(3), 1989.

[10] D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recur-sive model for graph mining. In In SIAM Data Mining, 2004.

[11] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, editors.Introduction to Algorithms. MIT Press, 2001.

[12] L. De Moura and N. Bjørner. Z3: an efficient SMT solver. InTACAS, 2008.

[13] C. Demetrescu, I. Finocchi, and A. Ribichini. Reactive im-perative programming with dataflow constraints. In OOPSLA,2011.

[14] R. Gei, G. Batz, D. Grund, S. Hack, and A. Szalkowski. Gr-gen: A fast spo-based graph rewriting tool. In Graph Trans-formations, 2006.

[15] A. H. Ghamarian, A. Jalali, and A. Rensink. Incrementalpattern matching in graph-based state space exploration. InGraBaTs, 2010.

[16] S. Hong, H. Chafi, E. Sedlar, and K. Olukotun. Green-marl: aDSL for easy and efficient graph analysis. In ASPLOS, 2012.

19 2012/8/6

http://www.dis.uniroma1.it/~challenge9/download.shtml

http://www.dis.uniroma1.it/~challenge9/download.shtml

http://iss.ices.utexas.edu/?p=projects/galois

http://iss.ices.utexas.edu/?p=projects/galois

http://www.graphanalysis.org/benchmark/

http://www.graphanalysis.org/benchmark/

[17] S. Itzhaky, S. Gulwani, N. Immerman, and M. Sagiv. Asimple inductive synthesis methodology and its applications.In OOPSLA, 2010.

[18] M. Kulkarni, M. Burtscher, C. Cascaval, and K. Pingali. Lon-estar: A suite of parallel irregular programs. In ISPASS, 2009.

[19] C. E. Leiserson and T. B. Schardl. A work-efficient parallelbreadth-first search algorithm (or how to cope with the non-determinism of reducers). In SPAA, 2010.

[20] A. Lenharth, D. Nguyen, and K. Pingali. Priority queues arenot good concurrent priority schedulers. Technical Report TR-11-39, UT Austin, Nov 2011.

[21] Y. A. Liu and S. D. Stoller. From datalog rules to efficientprograms with time and space guarantees. ACM TOPLAS,31(6), 2009.

[22] U. Meyer and P. Sanders. ∆-stepping: A parallelizable short-est path algorithm. J. Algorithms, 49(1):114–152, 2003.

[23] D. Nguyen and K. Pingali. Synthesizing concurrent sched-ulers for irregular algorithms. In ASPLOS, 2011.

[24] R. Paige and S. Koenig. Finite differencing of computableexpressions. ACM TOPLAS, 4(3):402–454, 1982.

[25] K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A.Hassaan, R. Kaleem, T. H. Lee, A. Lenharth, R. Manevich,M. Mendez-Lojo, D. Prountzos, and X. Sui. The TAO ofparallelism in algorithms. In PLDI, 2011.

[26] D. Plump. The graph programming language GP. In CAI,2009.

[27] D. Prountzos, R. Manevich, and K. Pingali. Elixir: A systemfor synthesizing concurrent graph programs. Technical ReportTR-12-27, UT Austin, Aug 2012.

[28] Y. Pu, R. Bodik, and S. Srivastava. Synthesis of first-orderdynamic programming algorithms. In OOPSLA, 2011.

[29] M. Puschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso,B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko,K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Codegeneration for DSP transforms. Proceedings of the IEEE,2005.

[30] G. Rozenberg, editor. Handbook of graph grammars andcomputing by graph transformation: Volume I. Foundations.World Scientific Publishing Co., Inc., 1997.

[31] A. Schurr, A. J. Winter, and A. Zundorf. Handbook of graphgrammars and computing by graph transformation. chapterThe PROGRES approach: language and environment. WorldScientific Publishing Co., Inc., 1999.

[32] A. Solar-Lezama, C.G. Jones, and R. Bodik. Sketching con-current data structures. In PLDI, 2008.

[33] S. Srivastava, S. Gulwani, and J. Foster. From program verifi-cation to program synthesis. In POPL, 2010.

[34] Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.K. Luk, andC. E. Leiserson. The Pochoir stencil compiler. In SPAA, 2011.

[35] M. Vechev and E. Yahav. Deriving linearizable fine-grainedconcurrent objects. In PLDI, 2008.

[36] M. Vechev, E. Yahav, and G. Yorsh. Abstraction-guided syn-thesis of synchronization. In POPL, 2010.

1 Graph [ nodes(node : Node dist : int2 sigma : double delta : double3 nsuccs : int preds : set [Node],4 bc : double bcapprox : double)5 edges(src : Node dst : Node dist : int )6 ]7

8 source : Node9

10 // Shortest path rule .11 SP = [ nodes(node a, dist ad)12 nodes(node b, dist bd, sigma sigb , preds pb, nsuccs nsb)13 edges(src a, dst b)14 (bd > ad + 1) ] →15 [ bd = ad + 116 sigb = 017 nsb = 018 ]19

20 // Record predecessor rule .21 RP = [ nodes(node a, dist ad, sigma sa , nsuccs nsa)22 nodes(node b, dist bd, sigma sb, preds pb)23 edges(src a, dst b, dist ed)24 (bd == ad + 1) & (ed != ad) ] →25 [ sb = sb + sa26 pb = pb + a27 nsa = nsa + 128 ed = ad29 ]30

31 // Update BC rule .32 updBC = [ nodes(node a, nsuccs nsa , delta dela , sigma sa)33 nodes(node b, nsuccs nsb, preds pb, bc bbc,34 bcapprox bbca, delta delb , sigma sb)35 edges(src a, dst b)36 (nsb == 0 & a in pb) ] →37 [ nsa = nsa - 138 dela = dela + sa / sb ∗ (1 + delb)39 bbc = bbc - bbca + delb40 bbca = delb41 pb = pb - a42 ]43

44 backwardInv :∀a : Node, b : Node : : a ∈ preds(b) =⇒ ¬(b ∈ preds(a))

45 Forward = iterate (SP or RP)� metric ad � fuse � group b46 Backward = iterate {backwardInv} updBC� group a47 main = Forward; Backward

Figure 15: Elixir program for betweenness centrality.

A. Betweenness CentralityFig. 15 shows an Elixir program for solving the betweenness-centrality problem.

The Elixir language allows a programmer to specify in-variants (in first-order logic) and use them to annotate ac-tions, as shown in lines 44 and 45. (We avoided discussingannotations until this point to simplify the presentation ofstatements.) Our compiler adds these invariants as con-straints to the query programs to further optimize the in-ference of useful influence patterns. Additionally, we coulduse these invariants to optimize the redexes procedure inorder to avoid scanning the entire graph to find redexes. Inall of our benchmarks this optimization would reduce thesearch for redexes to just those including the source node.

20 2012/8/6

elixir: a system for synthesizing concurrent graph...

Documents