uyg. ve araş. merkezi - sabancı Üniversitesi · computing betweenness centrality by using gpus....

The figure above shows the impact of our virtualized version of betweenness centrality algorithm which performs multiple BFSs at the same time and is enhanced with some level of parallelization. Computation of betweenness centrality is harder than computing closeness centrality. Furthermore, the second is more amenable to vectorization. The following figure shows the impact of the proposed techniques while computing closeness centrality on multicore CPUs, Intel Xeon Phi accelerator and graphics processing units. With the proposed techniques, we can process 80 billion edges per second.

[1] Regularizing graph centrality computations, AE Sarıyüce, E Saule, K Kaya, ÜV Çatalyürek, Journal of Parallel and Distributed Computing 76, 106-119, 2015 [2] Hardware/software vectorization for closeness centrality on multi-/many-core architectures, AE Sariyuce, E Saule, K Kaya, ÜV Çatalyürek, Parallel & Distributed Processing Symposium Workshops (IPDPSW), IEEE, 2014 [3] Incremental algorithms for closeness centrality, AE Sariyuce, K Kaya, E Saule, UV Catalyurek, IEEE International Conference o Big Data, 487-492, 2013 [4] Betweenness centrality on GPUs and heterogeneous architectures, AE Sariyüce, K Kaya, E Saule, ÜV Çatalyürek, Proceedings of the 6th Workshop on General Purpose Processor Using Graphics, 2013 [5] Shattering and compressing networks for betweenness centrality, AE Sariyüce, E Saule, K Kaya, ÜV Çatalyürek, Proceedings of the SIAM International Conference on Data Mining, 686-694, 2013 [6] Fast recommendation on bibliographic networks with sparse-matrix ordering and partitioning, O Küçüktunç, K Kaya, E Saule, ÜV Çatalyürek, Social Network Analysis and Mining 3 (4), 1097-1111, 2013 [7] STREAMER: a distributed framework for incremental closeness centrality computation, AE Sariyuce, E Saule, K Kaya, UV Catalyurek, Cluster Computing (CLUSTER), 2013 IEEE International Conference on, 1-8, 2013

•  Relational data at hand is growing with an increasing speed. •  Graphs are powerful combinatorial structures to model,

process and query relational information. •  Graph DBs and graph-based query processing tools that can

process billions of relationships are gaining more attention. •  Using HPC hardware such as vectorization-capable

multicore processors, Graphics Processing Units (GPU) or other accelerators such as Xeon Phi and Field Programmable Gate Arrays (FPGA) for parallel graph query processing is promising.

•  Depending on the query, solutions exploiting these devices and the inner parallelism of the problem can be more than 100x faster than a Hadoop/Spark based approach.

•  Implementing an accelerator based solution is not an easy task and needs a thorough expertise since, although these devices excelled for dense and regular data the relational information at hand is usually sparse.

•  We deal with sparse data by using smart data structures, performing graph manipulations, and restructuring the computation for each architecture we target.

•  Here, we will use the network centrality computations as a case study.

The best network representation changes with respect to the hardware at hand. For example on graphics processing units (GPUs), the best format is neither the CRS nor the COO. Due to characteristics of GPUs, a better parallelization with a better load balancing requires a special treatment for the nodes with a large number of neighbors. In vertex-based parallelization, each vertex is assigned to a single thread but this makes the load balancing within a GPU warp (a block of 32 threads) much harder since the node degrees in social networks show a long-tail distribution. In edge-based parallelization, each edge, friendship, relation etc. is assigned to a single thread. Hence, the load is perfectly balanced but the parallelization overhead is significant. To solve these problems, we propose to use a hybrid representation based on virtual vertices that are obtained by decomposing large neighborhoods. We then extend this representation with a strided memory access pattern to have better memory locality while computing betweenness centrality. . As the figures above show, straightforward vertex and/or edge based representations are not sufficient to utilize the full performance of GPUs for betweenness centrality. The proposed hybrid representations are much faster and yield up to 2x performance improvement when compared to the basic ones. Unlike multicore CPUs, GPUs have thousands of cores on multiple streaming processors. Hence the centrality algorithms at hand need to be restructured. Regularization of memory access patterns: Although many of the existing algorithms leverage parallel processing, one of the most common parallelism available in almost all of today’s recent processors, namely instruction parallelism via vectorization, is often overlooked due to nature of the sparse graph kernels. Vectorization is not easy to achieve in graph algorithms; to exploit its full potential and enable it for simultaneous graph traversal approach, we provide an ad-hoc CC formulation based on bitwise operations and propose hardware and software vectorization for that formulation on cutting-edge hardware. Our approach for closeness centrality serves as an example to show how vectorization can be utilized for graph kernels that require multiple BFS traversals.

___BETTERNETWORKREPRESENTATIONS___

_________REFERENCES_________

KamerKaya,FacultyofEngineeringandNaturalSciencesHighPerformance,ParallelGraphProcessing

____BETTERHPCALGORITHMS____

BüyükVeriEndüstriÇalıştayı,VeriAnaliFğiUygulamaveAraşMrmaMerkezi(VERİM),30Ekim2017

_____NETWORKCENTRALITY_____Network centrality is a concept used to find the most central vertices/nodes in a network. Such nodes are called super-spreaders for advertisements, gossips or even for diseases. Centrality metrics are originally developed to identify the sociologically most influential nodes in human relationships. The sequential algorithms in the literature can take hours to compute the exact centrality scores in a large-scale network. Furthermore, using HPC hardware on graph processing is not a straightforward task. Graph-based queries are harder to parallelize especially on modern HPC hardware since the graphs are sparse and memory access pattern throughout the execution is not regular and predictable.

______BETTERREORDERINGS______INTRODUCTION

Algorithm 1: Sequential BC

Data: G = (V,E)bc[v] 0, 8v 2 V

1 for each s 2 V do

S empty stack, Q empty queueP[v] empty list, �[v] 0, d[v] �1, 8v 2 V

Q.push(s), �[s] 1, d[s] 0.Forward phase: BFS from s

while Q is not empty do

v Q.pop(), S.push(v)for all w 2 �(v) do

if d[w] < 0 then

Q.push(w)d[w] d[v] + 1

if d[w] = d[v] + 1 then

2 �[w] �[w] + �[v]P[w].push(v)

.Backward phase: Back propagation

3 �[v] 1�[v] , 8v 2 V

while S is not empty do

w S.pop()for v 2 P[w] do

4 �[v] �[v] + �[w].Update bc values by using Equation (5)for v 2 V do

if v 6= s then

5 bc[v] bc[v] + (�[v]⇥ �[v]� 1)return bc

3. BETWEENNESS CENTRALITY ON GPUAs mentioned above, there are two existing studies on

computing betweenness centrality by using GPUs. In thefirst one, Shi and Zhang developed a software package gpu-fan2 to do biological network analysis [20]. Later, Jia et al.compared vertex- and edge-based techniques on GPUs forBC computations [7]. Both of these works employ a fine-grain parallelism: all threads work concurrently while exe-cuting a single, level-synchronized BFS. That is all the fron-tier vertices at current level ` must be processed before thevertices at level `+1. And for each level, the algorithms ini-tiate a GPU kernel to visit the vertices/edges on that level.The di↵erence between the vertex- and edge-based paral-lelism arises from the implementation of a forward/backward-step and the corresponding graph storage scheme. To easethe memory accesses, the former uses the compressed sparserow (CSR) format, and the latter uses the coordinate (COO)format for graph storage. Figures 1(b) and 1(c) show thesestorage schemes for a toy graph with 10 vertices and 17 edgesas given in Figure 1(a).

3.1 Vertex-based parallelismIn vertex-based parallelism, all the edges of a single ver-

tex are processed by a single thread. The pseudocode of theforward and backward phases of the vertex-based approachare given in Algorithm 2. Let u be a frontier vertex at level `and v be one of its neighbors. There can be three cases for v:if d[v] = �1 then v is unvisited before and will be a frontiervertex in level `+1. In this case, the kernel understands thatthe next frontier will not be empty and sets cont to true.It also increases �[v] by �[u], since all shortest paths fromthe source vertex to u will be a prefix of at least one short-est path from s to v (line 4 of Algorithm 2). This operationmust be atomic since there can be other threads concurrently

2http://bioinfo.vanderbilt.edu/gpu-fan/

(a) A toy graph G

(b) CSR representation of G

(c) COO representation of G

(d) Virtual-CSR representation of G

(e) Stride-CSR representation of G

Figure 1: A toy graph G with 10 vertices and 17edges (a), its CSR representation (b), its COO rep-resentation (c), its virtual-CSR representation (d),and stride-CSR representation (e). In the figures, nis the number of vertices, n0 is the number of virtualvertices, and m is the number of edges. For virtual-ization in (c) and (d), mdeg = 4 is used. The memoryusage of each representation is given in terms of thenumber of entries it has.

trying to update �[v]. Hence, in vertex-based parallelisma single atomic operation per successor-predecessor edge isnecessary. If v has been visited before, it can be either atlevel ` or `�1. In the latter case, the kernel sets u as one ofthe predecessors of v, i.e., P

v

[u] 1. To store the predeces-sor information, P, Shi and Zhang used an n ⇥ n-bit array.Considering the size of real-world networks and graphs andthe amount of memory available on modern GPUs, this isnot practical even for mid-size networks. The n

2 storage isactually an overkill since a successor-predecessor relation-ship can be established only by an edge and there are only

78



1 for each s 2 V do





if d[w] < 0 then



2 �[w] �[w] + �[v]P[w].push(v)


3 �[v] 1�[v] , 8v 2 V




if v 6= s then







(a) A toy graph G







v



78

0"1"2"3"4"5"6"7"8"9"

10"11"

Speedu

p"wrt"CPU

"1"th

read"

GPU"vertex"GPU"edge"GPU"virtual"GPU"stride"

Figure 4: Comparison of GPU implementations.

0.0E+00%

2.0E+07%

4.0E+07%

6.0E+07%

8.0E+07%

1.0E+08%

1.2E+08%

1.4E+08%

amazon

0601

%com2orkut%

loc2gowalla%

soc2LiveJournal%

soc2sig

n2ep

inions%

web

2Goo

gle%

web

2NotreDa

me%

wiki2T

alk%

GPU%

0.0E+00%

2.0E+06%

4.0E+06%

6.0E+06%

8.0E+06%

1.0E+07%

1.2E+07%

1.4E+07%

1.6E+07%

1.8E+07%

amazon

0601

%com2orkut%

loc2gowalla%

soc2LiveJournal%

soc2sig

n2ep

inions%

web

2Goo

gle%

web

2NotreDa

me%

wiki2T

alk%

TEPS%

CPU%1%thread%

Figure 5: Absolute performance for sequential CPUand GPU stride expressed in Traversed Edge PerSecond (TEPS)

a warp, the number of completely empty blocks, and mini-mize the overhead due to thread divergence.

Figure 6(a) shows the impact for the sequential CPU case.In the CPU, graph reduction has almost no impact on com-orkut but it brings a 7-fold improvement on wiki-Talk. Onaverage, graph reduction brings a 2-fold improvement. Graphordering brings barely any improvement on two graphs, butit brings a 53% improvement on web-google. When graphreduction and ordering combined, total graph modificationsbring up to 7.6-fold improvement, with an average of 2.21.

Figure 6(b) shows the impact of the graph manipulationfor GPU Stride implementation. The behavior is similar tothe one observed on the CPU. The most notable di↵erenceis that ordering harms performance on soc-sign-epinions andwiki-Talk. Though, it brought a 35% improvement on web-Google.

Note that there are two sources of improvement that spursfrom graph reduction. First, since there are less vertices,there are less source to execute. Second, the graph is smallerwhich makes each source faster.

5.3 Heterogeneous executionIn the last set of experiments, we evaluated performance

of using CPU alone, GPU alone, and using CPU and GPUtogether for BC computation. Figure 7 shows the perfor-mance obtained by using only CPUs (8 CPU threads), us-ing only GPU (proposed two methods Virtual and Stride

presented) and using both the CPUs and the GPU at thesame time (labeled as “Heterogeneous”). Notice that in thelater (heterogeneous) case, we utilize only 7 threads on theCPU to dedicate one core to drive the GPU.The source based parallelism used on the CPU shows an

average parallel speedup of 6 which indicates that the par-allel CPU implementation, even though not linear, is fairlye�cient. (Figure 7 shows an average speedup of 13, but thereis a factor of 2.2 which comes from graph modifications andnot parallelism.)The GPU Stride implementation reaches higher perfor-

mance than the parallel CPU implementation in 5 graphs(amazon0601, com-orkut, soc-LiveJournal, web-Google, andwiki-Talk), while the CPU implementation obtains higherperformance on 3 graphs (web-Google, soc-sign-epinions andloc-gowalla). If one computes geometric mean, on averagethe parallel CPU implementation and the GPU implemen-tation reach the same performance (less than 1% di↵erencein the average). This indicates that the correct choice be-tween CPU and GPU for betweenness centrality is stronglyinput dependent which makes a heterogeneous collaborationbetween CPU and GPU important.Using both the CPU and the GPU allows to reach the

highest performance in all the graphs of our dataset. Itimproves the best mono-device performance by a factor a1.29 on web-NotreDame where the performance of the CPUand GPU are the most di↵erent and by a factor of 1.95 onamazon0601 where the performance of both the CPU andGPU are the most similar.

6. CONCLUSION AND FUTURE WORKIn this work, we investigated a set of techniques to speed

up the betweenness centrality computation on GPUs andCPU/GPU heterogeneous architectures. Our techniques in-clude leveraging the topological properties of graph, i.e.,compressing by removing degree-1 vertices, as well as utiliz-ing the architectures e�ciently. We provided four di↵erentGPU algorithms and compared them experimentally. Com-bining all the techniques yield a 104 speedup on a large socialnetwork. Our techniques in GPU algorithms can be appliedto shortest-path algorithms, and compression techniques weprovided can be used to speed-up graph algorithms withsimilar objectives with betweenness centrality.The e�ciency of our GPU implementation depends on the

diameters of the graphs. In the worst case, the diameter canbe n and the total work will be quadratic on the number ofvertices. Social networks, in general, obey the smallworldphenomenon and their diameters are small. As a futurework, we plan to investigate faster betweenness centralitycomputation techniques on graphs with large diameters byexisting [11, 14] and novel techniques. Also, we plan toincorporate further graph compression techniques [18] to beused in heterogeneous architectures. Apart from that, we areplanning to make more detailed analysis on proposed GPUalgorithms on social networks with di↵erent characteristics,like diameter, density and degree distribution.

83

1.  Closeness centrality identifies the most influential nodes based on their shortest distances to the other nodes in the network.

2.  Betweenness centrality identifies the

most influential nodes based on the probability of them being on the shortest paths between all node pairs.

3.  Eigenvector centrality assigns relative

scores to all nodes in the network based on the concept that connections to high-scoring nodes contribute more than equal connections to low-scoring nodes.

The size of the memory footprint of the data structures one uses in practice should be in the same order of magnitude with the number of edges. Compressed row storage (CRS, vertex based) and Coordinate (COO, edge based) formats are two widely used data structures in practice. One way to regularize the memory access patterns is graph reordering: this process permutes the vertices and renumber them to obtain a more cache/memory friendly pattern. All the matrices above are created from the same citation network (nodes correspond the papers, edges correspond to the citations). The edges are shown with red color. Putting the edges closer to each other, we make the computations access to the closer memory locations in a short time duration. We can also partition/cluster the nodes in addition to reordering to obtain a much better memory locality. The following figure and the table show the structure of the graph with four partitions and the number of edges/non-zeros inside and outside the block diagonal of the matrix, respectively. The figure above shows the execution time to compute the eigenvector centrality of the nodes in the citation network with CRS and COO formats and various network ordering schemes. As the figure shows, more than 2x speedup is possible just by partitioning and permuting the vertices.



1 for each s 2 V do





if d[w] < 0 then



2 �[w] �[w] + �[v]P[w].push(v)


3 �[v] 1�[v] , 8v 2 V




if v 6= s then







(a) A toy graph G







v



78



1 for each s 2 V do





if d[w] < 0 then



2 �[w] �[w] + �[v]P[w].push(v)


3 �[v] 1�[v] , 8v 2 V




if v 6= s then







(a) A toy graph G







v



78



1 for each s 2 V do





if d[w] < 0 then



2 �[w] �[w] + �[v]P[w].push(v)


3 �[v] 1�[v] , 8v 2 V




if v 6= s then







(a) A toy graph G







v



78

118 A.E. Sarıyüce et al. / J. Parallel Distrib. Comput. 76 (2015) 106–119

Table 3Performance of the Closeness Centrality algorithms (in MTEPS).

Graph CPU-DO CPU-SpMM PHI-DO PHI-SpMM GPU-VirCC GPU-SpMM

Amazon 1,985 15,146 1,535 34,743 542 40,602Gowalla 4,340 12,588 2,077 29,409 594 34,759Google 1,736 10,391 1,632 23,953 516 43,206NotreDame 2,925 8,956 1,828 16,858 418 22,462WikiTalk 2,122 11,611 1,940 17,876 462 20,881Orkut 3,073 28,393 2,548 68,290 801 85,335LiveJournal 1,879 23,283 326 56,589 609 55,862

Fig. 14. Impact on the number of threads per vertex on the performance of GPU-SpMM.

simultaneous BFSs, respectively, are used. On GPU, the maximumpossible simultaneous BFSs is used for each graph as describedabove. For the non-vectorized variants, the direction optimizedCC variant performs the best on CPU and Xeon Phi, while theGPU-VirCC algorithm with simultaneous BFSs performs best onthe GPU. On average, the vectorized algorithm is 5.9 times fasterthan the non-vectorized one on CPU, 21.0 times faster on IntelXeon Phi, and 70.4 times faster on NVIDIA Tesla K20c than the bestexisting ones.

5. Conclusion and future work

In this work, we proposed new algorithms and parallelizationtechniques to make betweenness and closeness centrality compu-tations faster on commonly available cutting edge hardware. Thereare two traditionalways to execute centrality computations in par-allel. Either each thread traverses the graph from a single source,or all the threads collaboratively traverse the graph from a uniquesource. We deviated from the traditional approaches by using allthe threads in the system to collaboratively traverse the graphfrom many sources simultaneously. This scheme makes the com-putations more regular and allows a better utilization of moderncomputing devices. The experimental evaluation of the proposed

Fig. 15. Comparison of GPU-based closeness centrality algorithms.

algorithms shows that significant improvements can be obtainedover the best known algorithms for centrality computation on thesame device, without using an additional hardware: a improve-ment of a factor 5.9 on CPU architectures, 70.4 on GPU architec-tures and 21.0 on Intel Xeon Phi.

Fig. 16. Vectorization works: CPU-SpMM is the compiler-vectorized implementation executed on CPU (32 threads) with B = 4096. PHI-SpMM is the corresponding XeonPhi variant with B = 8192. For the GPU-based implementation, the maximum possible B value is used for each graph, and a vertex is assigned to a warp (32 threads).

VERİMVeriAnaliBği

Uyg.veAraş.Merkezi

uyg. ve araş. merkezi - sabancı Üniversitesi · computing betweenness centrality by using gpus....

Documents