arxiv:1911.09135v2 [cs.dc] 27 feb 2020

An Adaptive Load Balancer For GraphAnalytics on GPUs

Vishwesh Jatala1,2, Loc Hoang1,3, Roshan Dathathri1,3, Gurbinder Gill1,3,V. Krishna Nandivada4,5, and Keshav Pingali1,3

1 The University of Texas at Austin, Texas, United States2 [email protected]

3 {loc,roshan,gill,pingali}@cs.utexas.edu4 Indian Institute of Technology Madras, Chennai, India

5 [email protected]

Abstract. Load-balancing among the threads of a GPU for graph an-alytics workloads is difficult because of the irregular nature of graphapplications and the high variability in vertex degrees, particularly inpower-law graphs. We describe a novel load balancing scheme to addressthis problem. Our scheme is implemented in the IrGL compiler to allowusers to generate efficient load balanced code for a GPU from high-levelsequential programs. We evaluated several graph analytics applicationson up to 16 distributed GPUs using IrGL to compile the code and theGluon substrate for inter-GPU communication. Our experiments showthat this scheme can achieve an average speed-up of 2.2× on inputs thatsuffer from severe load imbalance problems when previous state-of-the-art load-balancing schemes are used.

Keywords: Load Balancing, GPUs, Graph Processing, Parallelization

1 Introduction

Graphics processing units (GPUs) have become popular platforms for processinggraph analytical applications [10,2,5,1,11]. In spite of the computational advan-tages provided by GPUs, achieving good performance for graph analytical ap-plications remains a challenge. Specifically, load balancing across GPU threadsas well as among multiple GPUs is a difficult problem since many distributedapplications execute in bulk-synchronous rounds, and imbalance among threadswithin a GPU in a round may cause all threads to wait for stragglers to complete.

Load balancing in multi-GPU systems is a difficult problem for many rea-sons. The first reason is that the set of vertices to be processed in a computa-tional round is statically unpredictable and may vary dramatically from roundto round. Therefore, static load balancing techniques do not work well. Anothercomplication is that most large graphs today are power-law graphs in whichvertex degrees follow a power-law distribution (i.e., a few vertices have ordersof magnitude more neighbors than the rest). Therefore, simple load balancingschemes that assign vertices to threads may not perform well. Finally, good

arX

iv:1

911.

0913

5v2

[cs

.DC

] 2

7 Fe

b 20

20

2 V. Jatala et al.

load balancing schemes must account for the architecture of modern GPUs andhierarchy of threads: thread-blocks, warps, and threads.

Several graph processing frameworks have proposed load balancing strategiesfor graph analytics on GPUs [8,6,11,9]. Most of these strategies involve dynam-ically partitioning vertices or edges evenly across the thread blocks, warps, orthreads of the GPU. However, they have one or more of these limitations: (i)they do not load balance across thread blocks, (ii) they have high memory orcomputation overheads, or (iii) they require high programming effort.

We present an adaptive load balancing strategy called ALB that addressesload imbalance at runtime. In each computation round, it classifies vertices basedon their degrees since vertex degrees provide an estimate of load imbalance.Edges of very high degree vertices are evenly assigned across all threads in allthread blocks, using a novel cyclic edge distribution strategy that accounts forthe memory access patterns of the GPU threads. All other vertices are evenlydistributed across thread blocks, warps, or threads similar to a prior load bal-ancing scheme [8]. We implemented our strategy in the IrGL compiler [10],which permits users to write sequential graph analytics programs without knowl-edge of GPU architectures. The generated compiler code inter-operates with theGluon communication substrate [2], enabling it to run on multiple GPUs in adistributed-memory cluster.

We evaluated the benefits of our approach on a single machine with up to 8GPUs and on a distributed GPU cluster with up to 16 GPUs. We compare ourapproach with other frameworks that support different load balancing strate-gies. Our experiments show that our load balanced code achieves an averagespeedup of (1) 1.6× on a single GPU and (2) 2.2× on multiple GPUs for manygraph applications. Our load-balanced code achieves an average speedup of 2.2×compared to other third-party frameworks on power-law graphs while incurringnegligible overhead in inputs that do not suffer from heavy load imbalance.

2 Background on Graph Analytics and GPUs

Graph Analytics: A graph consists of vertices, edges, and their associatedlabels. Vertex labels are initialized at the start and updated repeatedly untilsome global quiescence condition is reached. Updates to labels are performed byapplying an operator to active vertices in the graph [12]. Push operators readthe label of the active vertex and updates the labels of its immediate neighbors.Pull operators read the labels of the immediate neighbors of the active vertexand updates the label of the active vertex. To process a graph on a distributedcluster, the input graph is partitioned among machines. Execution occurs inbulk-synchronous parallel (BSP) rounds: in each round, hosts compute on lo-cal partitions and then participate in global synchronization in which labels ofvertices are made consistent.GPU Execution: GPUs offer higher memory bandwidth and more concurrencythan most CPUs; both can be exploited for high-performance graph analytics.A GPU executes multithreaded programs called kernels. A kernel is launched

An Adaptive Load Balancer For Graph Analytics on GPUs 3

0.0e+00

4.0e+07

8.0e+07

1.2e+08

0 120 240 360 480 600Thread Block Number

Th

rea

d B

lock L

oa

d

0 1 2

(a) SSSP on rmat25 for differ-ent rounds

0e+00

2e+06

4e+06

6e+06

8e+06


Th

rea

d B

lock L

oa

d

rmat25 road-USA

(b) BFS on rmat25 and road-USA

0e+00

2e+06

4e+06

6e+06

8e+06


Th

rea

d B

lock L

oa

d

pagerank sssp

(c) SSSP and PR on rmat25 in

1st round

Fig. 1: Thread block load imbalance for various configurations

on the GPU with a fixed number of threads. The CUDA programming model(used for NVIDIA GPUs) for kernels is hierarchical – each kernel executes asa collection of thread blocks or cooperative thread arrays (CTA). The threadsin the CTA are divided into sets called warps. The threads in a warp executeprogram instructions in a SPMD manner. Once a CTA finishes its execution,another CTA can be launched. A kernel ends once all CTAs (thread blocks)finish.

3 Existing Load Balancing Schemes

Vertex/Edge-Based: Vertex-based load balancing [9] assigns roughly equalnumbers of vertices to GPU threads. Each thread processes the edges of verticesassigned to it. This scheme works well if the number of edges connected to eachvertex is roughly the same. Otherwise, if degree distribution is skewed like inpower-law graphs, it results in severe load imbalance at an inter-thread andintra-thread block level. Edge-based load balancing [13] assigns roughly equalnumbers of edges to each GPU thread, and each thread updates its assignededges’ endpoints. This balances the workload, but it needs a graph representationthat allows a thread to quickly access the endpoints and data of an edge from theedge ID such as coordinate (COO) format or an edge list format, which requiressignificantly more space than compressed sparse row (CSR) or compressed sparsecolumn (CSC) formats.Thread-Warp-CTA (TWC): TWC load balancing [8] assigns a vertex and itsedges to a thread, a warp, or a thread block (CTA) based on the degree of thevertex. Each vertex is assigned to one of three bins (small, medium, and large)based on its degree. Vertices in the large bin are processed at the thread blocklevel, vertices in the medium bin at the warp level, and vertices in the smallbin at the thread level. This scheme load balances all threads within the threadblock and does not need edge endpoints like edge-based balancing.

This scheme may result in load imbalance if degree distributions within abin vary significantly, particularly for the large bin which has no upper boundon degree. To demonstrate this, we measured the load distribution (number of

4 V. Jatala et al.

1 procedure InspectWithTWC(Graph g, Worklist wl, Work work):2 foreach src in wl do3 degree = getOutDegree(src);4 if degree ≥ THRESHOLD then5 work.push(src)

6 else7 distribute src edges to thread/warp/CTA based on degree

Algorithm 1: Inspection Phase of Adaptive Load Balancer

1 procedure LB(Graph g, Worklist wl, Work work, Worklist prefixWork):2 edgelist = getMyEdges(work, prefixWork) . edge distribution policy3 foreach edge in edgelist do4 computeEndPoints(edge, src, dst, prefixWork) . use binary search5 applyOperator(g, src, dst, wl) . apply operator

Algorithm 2: Execution Phase of Adaptive Load Balancer

processed edges) on different thread blocks in the first three rounds of a single-source shortest path (sssp) computation on the rmat25 graph, using the TWCload-balancing scheme in D-IrGL [2]. Figure 1a shows that the first two roundstake most of the execution time, and computational load across thread blocks isimbalanced. Thread block imbalance can also vary across different applicationsfor the same input. We use Figure 1c to illustrate this: sssp suffers from threadblock load imbalance but pagerank does not. Finally, computational load can bedistributed differently across thread blocks for different inputs. Figure 1b showsthe thread block distribution for bfs on road-USA and rmat25 inputs. Here, bfsexhibits thread block load imbalance on rmat25 but not on road-USA.Vertex Across CTA: Some schemes [6,11] distribute the edges of a vertexacross thread blocks (CTAs) while using a CSR (or CSC) format. In the CSRformat, only the destination of an edge can be quickly accessed. To access thesource, Enterprise [6] fixes the source in a round (they only consider bfs), whereasa load balancing scheme called LB in Gunrock [11] finds the source on the fly.Enterprise adds another bin (huge) to TWC, and for each vertex in the huge bin,it processes vertices in rounds (barriers) and distributes edges of the (source) ver-tex in a round across threads in all CTAs. LB assigns edges of all vertices equallyamong all threads (in all CTAs). The thread uses binary search to find its sourcefor each assigned edge. LB uses this distribution for all vertices in all rounds, sothe overheads of binary search might not offset the benefits. Thus, Enterpriseand LB have computation overheads due to barriers and binary searches.

4 Adaptive Load Balancer (ALB)

4.1 Design

ALB detects thread-block-level (CTA) load imbalance with minimal overheadand efficiently balances the load among all threads of all thread blocks. In graph


V0

V1

V2

V3

T4T5T6T7

V4

V5

V6

V7

TB1

E4E5E6E7

40

75

115

20

50

45

25

30

Degrees(Work)

PrefixSum

40

115

230

250

300

345

370

400

#Thread blocks = 5, #Threads per thread block = 4, #Total Threads =20, Threshold = 20

Total units of work = 400, Work for each thread = 20

T 0 ..

T19

T0 .. T19

T0 .. T19

T 0 -

T 12

T13 -T

19

T0 - T5

T6 - T12

T0 - T2

T3 - T5

T0T1T2T3

TB0

E0E1E2E3

T4T5T6T7

TB1

E80E100E120E140

T0T1T2T3

TB0

E0E20E40E60

V0

V1

V2

V3

V4

V5

V6

V7

40

75

115

20

50

45

25

30

Degrees(Work)

PrefixSum

40

115

230

250

300

345

370

400

Cyclic Distribution Blocked Distribution

Fig. 2: Cyclic and blocked edge distribution and binary search for vertices. Dis-tribution of edges shown in thread blocks while binary search is shown by trees.

analytical applications, CTA load imbalance occurs when some CTAs processactive vertices with huge degrees while others process active vertices with lowdegrees (Section 3). To detect imbalance, ALB uses an inspection phase to sep-arate high-degree active vertices from lower-degree vertices that will not causeload imbalance with the TWC scheme (Algorithm 1). In the execution phase,ALB distributes the processing of the high-degree active vertices across CTAs inits execution phase (Algorithm 2), eliminating load imbalance for such vertices,and it uses the TWC scheme for the other vertices.

Inspection (Algorithm 1): Inspection iterates over active vertices (Line 2)and uses a lightweight check (Line 4) to determine if a vertex is high degreebased on a threshold value. If so, it is pushed onto a separate worklist (Line 5).If not, it is assigned to a TWC bin and processed with TWC (Line 7).

Execution (Algorithm 2): Each thread in each thread block (1) identifiesedges that it processes (Line 2), (2) computes the endpoints of those edges(Line 4), and (3) applies the operator [12] along the edge (Line 5). There aremany distribution policies for assigning edges to threads for Step (1). Step (2)searches for the source (or destination) of the edge in the graph.

Distribution: We used two policies illustrated in Figure 2: cyclic and blocked.In cyclic distribution, edges are assigned to threads in a round-robin manner,while in blocked distribution, each thread is assigned a contiguous set of edges.More precisely, if the number of threads is p and the total number of edges ofall huge vertices is w, thread Ti is assigned edges ei, ep+i . . . e(w−1)∗p+i in cyclicand e(i∗w), e(i∗(w+1)) . . . e(i+1)∗w−1 in blocked.

Search: If the graph is stored in COO format, edge endpoints are readilyavailable, but like most graph analytical systems, our system stores the graph inCSR (or CSC) format to save space. Therefore, to find the source (or destination)of an edge, we perform a binary search (Line 4) on prefixWork, which is the prefix

6 V. Jatala et al.

sum of out-(or in-)degrees of the huge vertices in work. For example, in Figure 2,thread T4 needs to process edge e4 in the cyclic distribution, so it performs abinary search in prefixWork to find the source vertex v0 (which has the first40 edges). The performance of binary search is sensitive to the distribution.To illustrate in Figure 2, in cyclic distribution, consecutive threads (T0..T19)process consecutive edges (E0..E19) whose binary search computations follow thesame trajectory in the binary search tree, leading to low thread divergence andgood locality. In blocked distribution, threads T0..T19 compute source vertices byfollowing different paths in the binary search tree, leading to thread divergenceand poor locality. Hence, we use cyclic distribution by default.Threshold: The threshold for classifying huge vertices can affect the perfor-mance of applications. Setting it to 0 places all vertices in the huge bin; thismay be good for load balancing, but there will be overhead from the binarysearches. Conversely, setting it to a value larger than the maximum degree ofvertices in the graph ensures that no vertex will be placed in the huge bin;this eliminates binary search but hurts load balance. Determining the optimalthreshold is difficult due to several unknown hardware parameters, such as hard-ware thread block scheduling, warp scheduling, and cache replacement policies.However, setting the threshold to be equal to the number of threads launchedin a kernel ensures that the threads in each warp follow at most two divergentbranches during the binary search with the cyclic distribution since the threadswill be assigned edges of at most two different vertices per cyclic distribution ofedges. Setting the threshold to less than the number of threads launched leads tomore divergent branches, resulting in thread divergence and poor locality. Thus,we set the threshold as the number of the threads by default.

4.2 Analysis

Memory Accesses: Table 1 shows notation used in analyzing the complexityof the number of memory accesses in a computation round for cyclic and blockeddistribution policies. We set the threshold t to the number of threads T .

Each instance of a warp processes W edges, so each warp needs e/(W ∗Nw)instances to process all it edges. Each divergent branch in a warp during binarysearch requires O(log(v)) memory accesses. We assume there is no locality amongthe accesses generated by different warps (worst case).

Table 1: Notations

Parameter Description

t ThresholdT Number of threadsW Warp sizeNw Number of Warpse Sum of degree of huge verticesv Number of huge vertices

In cyclic distribution, threads ina warp processes successive edges. Asthe threads within a warp follow atmost two divergent paths during bi-nary search, the number of memory ac-cesses in each warp is O(log(v)). Thus,the total number of memory accessesin all Nw warps is O(e ∗ log(v)/W ).

In blocked distribution, each threadin a warp processes non-contiguousedges. In the worst case, each thread


1 WL = Workl i st ( )2 Kernel ( ”SSSP” , [ ( ’ Graph ’ , ’ g ’ ) ]3 [ ForAll ( s r c in WL. items ( ) ) {4 // f o r each a c t i v e node5 ForAll ( edge in g . edges ( s r c ) ) {6 // f o r each neighbor7 dst = g . ge tDes t ina t i on ( edge )8 // r e l a x a t i o n operator f o r ( src , dst )9 newDist = g . ge tDi s t ( s r c ) + g . getWeight ( edge )

10 i f ( atomicMin ( g . curDis t ( dst ) , newDist ) )11 WL. push ( dst ) // push i f updated12 }13 }14 ] ) ,15 I t e r a t e ( ” whi le ” , ”WL” , SSSP( graph ) , I n i t (WL) )

Fig. 3: A snippet of single-source shortest path (sssp) program written in IrGL.

can follow a different divergent branch. Hence, the number of memory accessesin each warp is O(W ∗ log(v)). Consequently, the number of memory accesses inall Nw warps is O(e ∗ log(v)).Memory Usage: In each round, ALB uses two worklists to keep track of hugevertices and their prefix sums, the latter of which reduces the overhead of binarysearch. Although they incur O(V ) space overhead, in practice, v � V , where vis the number of huge vertices.

4.3 Code Generation

Single GPU: We implemented ALB in the IrGL compiler [10] to generateCUDA code for a single GPU. The compiler supports programming constructsto traverse vertices and edges in parallel. Figure 3 shows a snippet of the ssspprogram written in IrGL. The outer loop (Line 3) iterates over the active verticesin each round, and the inner loop (Line 5) processes the neighbors of the activevertex. The sssp operator (known as the relaxation operator) is applied in eachiteration of the inner loop, and it corresponds to a unit of work (load) for thisprogram.

For the sssp program in Figure 3, the ALB code generated by our modifiedIrGL compiler is shown in Figure 4. Lines 3-5 show the generated code forthe inspection phase. The function SSSP LB shows the generated code for theexecution phase. The main function (Lines 22-26) invokes the inspection phase toidentify the vertices in a huge bin (a worklist denoted as work). In the presenceof load imbalance, it invokes the execution phase (Line 26) after computing theprefix sum (Line 25) of all the vertices in the huge bin.Distributed GPUs: Most multi-GPU graph analytical systems use the BSPexecution model. Hence, thread block imbalance within a GPU can lead to strag-glers and exacerbate load imbalance among GPUs. To show the strengths ofALB, we use ALB generated code with the CuSP [4] graph partitioner andthe Gluon [2] communication substrate (the Abelian [3] compiler can generateGluon communication code automatically). CuSP partitions the input graph

8 V. Jatala et al.

1 g l o b a l void SSSP Inspect with TWC (Graph g , WL wl , WL work , . . . ) {2 f o r ( s r c = t i d ; s r c < wl . end ( ) ; s r c += nthreads ) {3 degree = getOutDegree ( s r c ) ;4 i f ( degree >= THRESHOLD)5 work . push ( s r c ) ;6 e l s e7 // D i s t r i bu t e edges o f s r c to thread /warp/CTA based on i t s degree8 }9 }

10 g l o b a l void SSSP LB(Graph g , WL wl , WL work , WL prefixWork , . . . ) {11 i n t t o t a l e d g e s = getTotalEdges ( prefixWork ) ;12 i n t edge s pe r th r ead = t o t a l e d g e s / nthreads ;13 i n t cu r r ent edge = compute f i r s t edge ( t i d ) ; // d i s t r i b u t i o n p o l i c y14 whi le ( cur r ent edge < edge s pe r th r ead ) {15 i n t src , dst ;16 compute end points ( current edge , src , dst , prefixWork ) ;17 . . . // SSSP r e l a x a t i o n operator f o r ( src , dst )18 ge t nex t edge ( t id , cu r r ent edge ) ; // d i s t r i b u t i o n p o l i c y19 }20 }21 I n i t i a l i z e ( wl , work , prefixWork ) ; // I n i t i a l i z e w o r k l i s t s22 whi le ( ! wl . empty ( ) ) {23 SSSP Inspect with TWC<<<#blocks , #threads>>>>(g , wl , work , . . . ) ;24 i f ( ! work . empty ( ) ) {25 compute pref ix sum ( work , prefixWork ) ;26 SSSP LB<<<#blocks , #threads>>>(g , wl , work , prefixWork , . . . ) ;27 }28 }

Fig. 4: A snippet of our compiler generated CUDA code for the sssp program.

Table 2: Inputs and their key properties.

rmat25 orkut road-USA rmat26 rmat27 twitter40 uk2007

|V | 33.5M 3.1M 23.9M 67.1M 134M 41.6M 106M|E| 536.8M 234M 57.7M 1,074M 2,147M 1,468M 3,739M|E|/|V | 16 76 2 16 35 35max Dout 125.7M 33,313 9 239M 453M 2.99M 15,402max Din 14733 33,313 9 18211 21806 0.77M 975,418Approx. 3 6 6261 3 3 12 115DiameterSize (GB) 4.3 1.8 0.6 8.6 18 12 29

using various partitioning policies selectable at runtime, and Gluon managessynchronization of vertex labels. We denote the resulting system D-IrGL (ALB).

5 Evaluation Methodology

For our evaluation, we used 2 different setups on the Bridges supercomputer.For single-GPU and single-host multi-GPU experiments, we used a single ma-chine (this setup is referred to as Bridges-Volta) on the Bridges supercomputerwith 2 Intel Xeon Gold 6148 CPUs (with 128GB RAM) and 8 NVIDIA VoltaV100 GPUs each with 16GB of memory. For multi-host multi-GPU experiments,we used multiple machines (this setup is referred to as Bridges-Pascal) on theBridges supercomputer each with 2 Intel Broadwell E5-2683 v4 CPUs (with128GB RAM) and 2 NVIDIA Tesla P100 GPUs each with 16GB of memory.The network interconnect is Intel Omni-Path. All applications were compiled


using CUDA 9.0, gcc 6.3.0, and MVAPICH2-2.3. Threshold for huge verticesin ALB is the number of threads launched for the application kernel (based onGPU architecture): 163840 and 114688 on Bridges-Volta and Bridges-Pascal,respectively.

Table 2 lists the input graphs used; the table splits the graphs into smallgraphs that are evaluated on Bridges-Volta (single machine) and large graphsthat are evaluated on Bridges-Pascal (multiple machines). We run on up to 16GPUs (8 machines).

We evaluated five applications: breadth-first search (bfs), connected com-ponents (cc), k-core decomposition (kcore), pagerank (pr), and single-sourceshortest path (sssp). All applications are run until convergence. The reportedexecution time is an average of three runs excluding graph construction time.

For single-host multi-GPU experiments, we compare with Gunrock [11], andfor multi-host multi-GPU experiments, we compare with D-IrGL [2] and Lux [5].Gunrock is single host multi-GPU graph analytical framework and supportsTWC and LB load balancing. For Gunrock, we used bfs, sssp, and cc; it doesnot have kcore, and we omit pr as it did not produce correct results. Gunrockprovides bfs with and without direction optimization. The other systems donot have direction optimization; therefore, we evaluated bfs in Gunrock withoutdirection optimization. D-IrGL and Lux are a distributed multi-GPU graph an-alytical frameworks. D-IrGL uses TWC load balancing. Applications in D-IrGLuse push operators (traverse outgoing edges) except for pr and kcore which usepull operators (traverse incoming edges). Lux uses a variant of TWC load bal-ancing. For Lux, we use only cc and pr: the other applications are either notavailable or not correct.

6 Experimental Results

6.1 Performance Analysis on a Single GPU Platform

Table 3 compares the performance of our adaptive load balancing approach, D-IrGL (ALB), with D-IrGL (TWC), D-IrGL (LB), Gunrock (TWC), and Gunrock(LB). D-IrGL (TWC) is default D-IrGL. D-IrGL (LB) is D-IrGL (ALB) withoutadaptive balancing: it balances all active vertices regardless of their degrees.Gunrock (TWC) uses TWC, and Gunrock (LB) uses its LB.D-IrGL and ALB: D-IrGL (ALB) is 4.6× faster on average compared to D-IrGL (TWC) for bfs, sssp, cc, and kcore on rmat25. These application and inputconfigurations have heavy load imbalance across thread blocks in some itera-tions, and ALB detects this and balances the workload. D-IrGL (ALB) does notimprove pr runtime on rmat25 because pr iterates over the incoming neighbors ofa vertex as opposed to the other applications which iterate over outgoing neigh-bors. The in-degree distribution (Din, Table 2) is not as skewed as the out-degreedistribution (Dout), so large imbalance does not exist. Similar reasoning explainswhy D-IrGL (ALB) does not outperform D-IrGL (TWC) for applications on theother graphs: both road-USA and orkut have low load imbalance (the skew of in

10 V. Jatala et al.

0e+00

2e+06

4e+06

6e+06

8e+06


Th

rea

d B

lock L

oa

d

bfs pagerank

(a) bfs and pr using TWC

0e+00

2e+05

4e+05

6e+05

8e+05


Th

rea

d B

lock L

oa

d

LBLoad TotalLoad TWCLoad

(b) bfs using ALB

0

250000

500000

750000


Th

rea

d B

lock L

oa

d

LBLoad TotalLoad TWCLoad

(c) pr using ALB

Fig. 5: Thread block load distribution of D-IrGL for rmat25 on 1 V100 GPU.

and out degrees is relatively small), so ALB does not detect huge vertices andthread block load imbalance.

Table 3: Execution time (ms) on a single V100 GPU.

Input App Gunrock Gunrock D-IrGL D-IrGL D-IrGL(TWC) (LB) (TWC) (LB) (ALB)

rmat25

bfs 1419.6 299.9 1015.7 120.8 113.7

sssp 1321.2 346.3 1418.0 150.2 142.4

cc 347.7 382.3 648.3 265.3 142.4

pr - - 1418.0 3948.4 1423.0

kcore - - 1561.3 317.5 247.6

orkut

bfs 62.2 36.5 17.7 16.6 18.4

sssp 166.4 137.1 55.3 67.5 57.3

cc 22.9 23.5 28.0 35.8 29

pr - - 1562.7 2889.3 1578.6

kcore - - 86.0 154.2 90.6

road-USA

bfs 355.7 335.7 2287 4266.7 2531.0

sssp 20883.6 20497.2 6436.0 16465.3 8902.0

cc 31.3 33.7 3066.0 - 3300.7

pr - - 450.0 877.2 458.5

kcore - - 3.0 14.4 3.4

D-IrGL (LB) illus-trates the importanceof selectively balanc-ing only high-degree(huge) vertices: D-IrGL(LB) performs worsethan D-IrGL (ALB)for most configurations.Even though D-IrGL(LB) addresses the loadimbalance for rmat25,it suffers from theoverhead of balancingall vertices in every it-eration for other con-figurations. In particu-lar, it significantly un-derperforms on road-USA: since road-USAdoes not have im-balance, binary searchoverhead from balancing leads to worse performance.

Gunrock and ALB: Gunrock (LB) generally performs better than Gunrock(TWC) due to thread block load imbalance in TWC. D-IrGL (ALB) outperformsboth LB and TWC in Gunrock for most of the applications on rmat25 and orkutdue to its adaptivity: it fixes imbalance if it exists, and if a round does not haveimbalance, ALB has minimal load balancing overhead.

Gunrock (LB) performs better than D-IrGL (ALB) for bfs and cc on road-USA. Gunrock uses an explicit sparse work-list to maintain the active vertices,whereas D-IrGL use an implicit dense work-list (boolean vector). As bfs and cc


have few active vertices in a round of computation, implicit worklists do notperform well: finding active vertices requires iterating over the entire worklisti.Thread Block Load Distribution: We examine the thread block load dis-tribution of work with TWC and with ALB to better understand why ALBperforms well. Figure 5a shows the thread block work distribution for D-IrGL(TWC) for bfs and pr on rmat25. Figure 5b and Figure 5c show the distributionwith D-IrGL (ALB) for bfs and pr, respectively. The figures show the distri-butions for two kernels launched in our approach: (1) LB, which distributes thework of the huge active vertices among all the thread blocks, and (2) TWC, whichdistributes the load of all other active vertices to threads/warp/CTA based ontheir degrees. The figure also shows the total load (the sum of TWC and LB).

D-IrGL (TWC) has heavy load imbalance for bfs. D-IrGL (ALB), however,has a more balanced load distribution than D-IrGL (TWC) because ALB dis-tributes the load of high-degree active vertices equally among all the threadblocks. As pr does not have from thread block load imbalance in D-IrGL, itperforms well for both D-IrGL (TLB) and D-IrGL (ALB). Notably, ALB doesnot instantiate the LB kernel as it detects that no imbalance exists.

bfs cc kcore pr sssp

rma

t25

Blo

cke

d

Cyclic

Blo

cke

d

Cyclic

Blo

cke

d

Cyclic

Blo

cke

d

Cyclic

Blo

cke

d

Cyclic

0

500

1000

1500

Exe

cu

tion

tim

e (

se

c)

Fig. 6: D-IrGL (ALB) withcyclic and blocked distributionfor rmat25 on 1 V100 GPU.

Analyzing Threshold: The threshold for ahigh-degree (huge) vertex in our experimentswas the thread count, but to explore the ef-fect of the threshold on performance, we var-ied the threshold value by powers of 2 start-ing from 1 (including 0 as well) for pr andsssp on road-USA and rmat25. We found thatfor configurations with low imbalance (pr onboth graphs and both applications on road-USA), any threshold value starting from 28

(the thread block size) performed similarly,while for the configuration with imbalance(sssp, rmat25), execution time suffered whenthe threshold was made larger than the maxdegree (making it equivalent to TWC as no vertex will be considered huge). Forboth inputs, which have widely different characteristics, ALB performed simi-larly for any threshold in the range of 28 to 218. In other words, there are manygood thresholds, and whether a threshold value is good or not depends more onthe architecture than on the input or application.Cyclic vs. Blocked Distribution: Figure 6 compares the performance of cyclicand blocked distribution for D-IrGL (ALB) for rmat25. Cyclic distribution out-performs blocked distribution for all configurations (except pr), and it is on aver-age 1.7× faster. As explained in Section 4, cyclic distribution has better localityand benefits from lower thread divergence compared to blocked distribution.

6.2 Performance Analysis on Multi-GPU Platforms

Single-Host Multi-GPU: Figure 7a shows strong scaling of D-IrGL (TWC),D-IrGL (ALB), Gunrock (TWC), and Gunrock (LB) for small graphs on Bridges-

12 V. Jatala et al.

orkut rmat25 road-USA

bfs

cc

kco

rep

rsssp

1 2 4 8 1 2 4 8 1 2 4 8

0.03125

0.25000

2.00000

0.0625

0.5000

4.0000

0.0625

0.2500

1.0000

0.25

0.50

1.00

0.0625

0.5000

4.0000

32.0000

Number of GPUs

Exe

cu

tion

tim

e (

se

c)

D-IrGL (ALB) D-IrGL (TWC) Gunrock(LB) Gunrock(TWC)

(a) Bridges-Volta (V100 GPUs)

rmat26 rmat27 twitter40 uk2007

bfs

cc

kco

rep

rsssp

1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 2 4 8 16

0.25

1.00

4.00

1

8

64

0.25

1.00

4.00

2

16

128

0.25

0.50

1.00

2.00

4.00

Number of GPUs

Exe

cu

tion

tim

e (

se

c)

D-IrGL (ALB) D-IrGL (TWC) Lux

(b) Bridges-Pascal (P100 GPUs)

Fig. 7: Strong scaling of different systems on multi-GPU platforms.

Volta. The trends are similar to those on 1 Volta GPU. For all applications onorkut and road-USA and for pr on rmat25, D-IrGL (ALB) and D-IrGL (TWC)perform similarly because ALB’s inspection phase does not detect thread blockload imbalance (i.e., huge vertices). Both are also better than Gunrock in mostcases. For bfs, cc, kcore, and sssp on rmat25, D-IrGL (ALB) outperforms D-IrGL(TWC), Gunrock (TWC), and Gunrock (LB). ALB improves computation oneach GPU, thereby reducing the total execution time.Multi-Host Multi-GPU: Figure 7b shows the strong scaling of D-IrGL (ALB),D-IrGL (TWC), and Lux for large graphs on Bridges-Pascal (we do not evaluateGunrock because it is restricted to a single-host). Both D-IrGL (ALB) and D-IrGL (TWC) perform better than Lux for all configurations. The only differencebetween ALB and TWC is in computation on each GPU, and D-IrGL (ALB)performs similar to or better than D-IrGL (TWC).

D-IrGL (ALB) performs slightly better than D-IrGL (TWC) for pr on uk2007,but for the other applications on uk2007, they perform similarly. uk2007’s maxout-degree 15402 is smaller than the threshold 114688, so for push-style algo-rithms like bfs, cc, kcore, and sssp, ALB does not detect any huge vertices.uk2007’s max in-degree 975418 is much higher, so ALB detects the imbalanceand corrects it. For twitter40, D-IrGL (ALB) is slightly faster than D-IrGL(TWC) on twitter40 in most cases because twitter40’s max out-degree and maxin-degree are much higher than the threshold. The max in-degree of rmat26 and


rmat27 is smaller than the threshold, so D-IrGL (ALB) and D-IrGL (TWC)perform similarly for pr. For the other applications on rmat26 and rmat27, ALBdetects and addresses thread block load imbalance because their max out-degreeis much higher than the threshold. In these cases, D-IrGL (ALB) is faster thanD-IrGL (TWC) by 4.3× on average.

bfs cc kcore pr sssp

rma

t26

rma

t27

D-I

rGL

(A

LB

)

D-I

rGL

(T

WC

)

D-I

rGL

(A

LB

)

D-I

rGL

(T

WC

)

D-I

rGL

(A

LB

)

D-I

rGL

(T

WC

)

D-I

rGL

(A

LB

)

D-I

rGL

(T

WC

)

D-I

rGL

(A

LB

)

D-I

rGL

(T

WC

)

0.0

0.5

1.0

1.5

0.0

0.5

1.0

1.5

2.0

Tim

e (

se

c)

Computation Non-overlap Communication

Fig. 8: Execution time break-down of D-IrGL on 16 P100GPUs.

Analysis: To analyze the performance improve-ment for rmat26 and rmat27, we show the break-down of the total execution time into computa-tion and non-overlapping communication time on16 GPUs in Figure 8. The computation time ofan application is the time spent in executing thekernels of the application in the GPU. We reportcomputation time as the maximum time amongall GPUs. Thus, computation time acccounts forload imbalance. The rest of the execution time isthe non-overlapping communication time, includ-ing the time to synchronize vertex labels amongthe GPUs.

The results show that most applications in D-IrGL (TWC) spend most of the execution timein computation. rmat26 and rmat27 have a verylarge max out-degree (Dout shown in Table 2), sopush-style applications in D-IrGL (TWC) sufferfrom thread block load imbalance on one of the GPUs. D-IrGL (ALB) reducesthe computation time on the (straggler) GPUs by balancing load across thethread blocks on each GPU. This in turn balances the load among the GPUs,thereby reducing the total execution time.Summary: Our adaptive load balancer (ALB) improves application perfor-mance significantly on configurations with thread block load imbalance on bothsingle GPU and multi-GPU platforms and incurs minimal overhead on configu-rations that have a balanced thread block load.

7 Related Work

GPU Graph Analytics: GPUs have been widely adopted for graph analyt-ics, spanning single-host single-GPU systems [10,14,7], CPU-GPU heterogeneoussystems [3], single-host multi-GPU systems [1,11], and multi-host multi-GPUsystems [2,5]. This paper presents an adaptive load balancer that can be incor-porated in all these systems to improve the performance on each GPU.Load Balancing for GPU Graph Analytics: Prior load-balancing schemes [9,8,6,11]were described in Section 3. Unlike the vertex-based load-balancing scheme [9]and TWC [8], ALB balances load across thread blocks. In contrast to edge-basedload-balancing schemes [13], ALB has lower memory overhead as it uses CSRformat instead of COO format. Enterprise [6] is restricted to bfs and has over-head from barriers; ALB uses only a constant number of barriers (one per bin)

14 V. Jatala et al.

to balance the load any graph application. Gunrock’s LB [11] has overhead frommany binary searches because it distributes edges of all vertices among threadsand uses a block distribution. ALB dynamically chooses vertices that wouldbenefit from edge distribution among threads and uses a cyclic distribution tominimize binary search overhead.

8 Conclusion

We presented an adaptive load balancing mechanism for GPUs that detects thepresence of thread block load imbalance and distributes load equally amongthread blocks at runtime. We implemented our strategy in the IrGL compilerand evaluated its effectiveness using up to 16 GPUs. Our approach improvesperformance of applications by 2.2× on average for inputs with load imbalancecompared to the previous load-balancing schemes.

References

1. Ben-Nun, T., Sutton, M., Pai, S., Pingali, K.: Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations. In: PPoPP (2017)

2. Dathathri, R., Gill, G., Hoang, L., Dang, H.V., Brooks, A., Dryden, N., Snir,M., Pingali, K.: Gluon: A Communication-optimizing Substrate for DistributedHeterogeneous Graph Analytics. In: PLDI (2018)

3. Gill, G., Dathathri, R., Hoang, L., Lenharth, A., Pingali, K.: Abelian: A Compilerfor Graph Analytics on Distributed, Heterogeneous Platforms. In: Aldinucci, M.,Padovani, L., Torquati, M. (eds.) Euro-Par (2018)

4. Hoang, L., Dathathri, R., Gill, G., Pingali, K.: CuSP: A Customizable StreamingEdge Partitioner for Distributed Graph Analytics. In: IPDPS (2019)

5. Jia, Z., Kwon, Y., Shipman, G., McCormick, P., Erez, M., Aiken, A.: A Distributedmulti-GPU System for Fast Graph Processing. VLDB (2017)

6. Liu, H., Huang, H.H.: Enterprise: breadth-first graph traversal on gpus. In: SC(2015)

7. Meng, K., Li, J., Tan, G., Sun, N.: A Pattern Based Algorithmic Autotuner forGraph Processing on GPUs. In: PPoPP (2019)

8. Merrill, D., Garland, M., Grimshaw, A.: Scalable GPU Graph Traversal. In: PPoPP(2012)

9. Nasre, R., Burtscher, M., Pingali, K.: Data-driven versus Topology-driven IrregularComputations on GPUs. In: IPDPS (2013)

10. Pai, S., Pingali, K.: A Compiler for Throughput Optimization of Graph Algorithmson GPUs. In: OOPSLA (2016)

11. Pan, Y., Wang, Y., Wu, Y., Yang, C., Owens, J.D.: Multi-GPU Graph Analytics.In: IPDPS (2017)

12. Pingali, K., Nguyen, D., Kulkarni, M., Burtscher, M., Hassaan, M.A., Kaleem, R.,Lee, T.H., Lenharth, A., Manevich, R., Mendez-Lojo, M., Prountzos, D., Sui, X.:The TAO of parallelism in algorithms. In: PLDI (2011)

13. Sariyuce, A.E., Kaya, K., Saule, E., Catalyurek, U.V.: Betweenness Centrality onGPUs and Heterogeneous Architectures. In: GPGPU (2013)

14. Wang, H., Geng, L., Lee, R., Hou, K., Zhang, Y., Zhang, X.: SEP-graph: FindingShortest Execution Paths for Graph Processing Under a Hybrid Framework onGPU. In: PPoPP (2019)

arxiv:1911.09135v2 [cs.dc] 27 feb 2020

Documents