centrality measurements

Upload: istcs

Post on 14-Apr-2018

236 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 Centrality measurements

    1/12

    Centralities in Large Networks: Algorithms and Observations

    U KangSCS, CMU

    Spiros Papadimitriou

    GoogleJimeng Sun

    IBM T.J. WatsonHanghang TongIBM T.J. Watson

    Abstract

    Node centrality measures are important in a large number of graph applications, from search and ranking to social andbiological network analysis. In this paper we study nodecentrality for very large graphs, up to billions of nodes andedges. Various denitions for centrality have been proposed,ranging from very simple (e.g., node degree) to more elab-orate. However, measuring centrality in billion-scale graphsposes several challenges. Many of the traditional deni-tions such as closeness and betweenness were not designed

    with scalability in mind. Therefore, it is very difcult, if not impossible, to compute them both accurately and ef-ciently. In this paper, we propose centrality measures suit-able for very large graphs, as well as scalable methods toeffectively compute them. More specically, we proposeeffective closeness and L INE R ANK which are designed forbillion-scale graphs. We also develop algorithms to computethe proposed centrality measures in M AP R EDUCE , a mod-ern paradigm for large-scale, distributed data processing. Wepresent extensive experimental results on both synthetic andreal datasets, which demonstrate the scalability of our ap-proach to very large graphs, as well as interesting ndingsand anomalies.

    1 Introduction

    Centrality is widely-used for measuring the relative impor-tance of nodes within a graph [5, 12]. For example, whoare the most well-connected people in a social network? Orwho are critical for facilitating the transmission of informa-tion in a terrorist network [21]? Which proteins are the mostimportant for the lethality of a cell in protein interactions bi-ological network [16]? In general, the concept of centralityhas played an important role in the understanding of vari-ous kinds of networks by researchers from computer science,network science, sociology, and recently emerging compu-tational social science [23].

    Traditionally, centrality has typically been studied forgraphs of relatively small size. However, in the past fewyears, the proliferation of digital collection of data has led tothe collection of very large graphs, such as the web, onlinesocial networks, user preferences, online communications,

    Work while at IBM.

    and so on. Many of these networks reach billions of nodesand edges requiring terabytes of storage.

    1.1 Challenges for Large Scale Centralities Measuringcentrality in very large graphs poses two key challenges.

    First, some denitions of centrality have inherently highcomputational complexity. For example, shortest-path orrandom walk betweenness [6, 25] have complexity at leastO(n3) where n is the number of nodes in a graph. Further-more, some of the faster estimation algorithms require oper-ations, which are not amenable to parallelization, such as all-sources breadth-rst search. Finally, it may not be straight-forward or even possible to develop accurate approximationschemes. In summary, centrality measures should ideally bedesigned with scalability in mind from the outset. Tradition-ally, this has not always been the case [12]. However, withthe recent availability of very large networks, there is a clearneed for scalable measures.

    Second, even if a centrality measure is designed in away that avoids expensive or non-parallelizable operations,developing algorithms that are both efcient, scalable, andaccurate is not straightforward. Clever approximation orparallelization schemes may need to be employed, in order

    to achieve all of these goals.

    1.2 Problem Denitions In this paper we tackle the prob-lem of efciently and effectively measuring centrality forbillion-scale networks. More specically, we address the fol-lowing problems:

    1. Careful Design. How can we carefully design central-ity measures that avoid inherent limitations to scalabil-ity and parallelization, yet are sufciently informative?

    2. Algorithms. How can we compute the large scale cen-tralities quickly for billion-scale graphs? How can weleverage modern, large-scale, distributed data process-

    ing infrastructures?3. Observations. What are key patterns and observations

    on centralities in large, real world networks?

    In particular, we study popular denitions for three typesof centrality: degree (baseline), closeness (diameter-based),and betweenness (ow-based), which cover the spectrumfrom simpler to more elaborate. Except for degree, the

  • 7/30/2019 Centrality measurements

    2/12

    other two centrality measures, closeness and betweenness,are prohibitively expensive to compute, and thus impracticalfor large networks. On the other hand, although simple tocompute, degree centrality gives limited information sinceit is based on a highly local view of the graph around eachnode. We need a set of centrality measures that enrich ourunderstanding of very large networks, and are tractable tocompute.

    1.3 Our Contributions To address the difculty of com-puting centralities on large-scale graphs, we propose twonew measures, one diameter-based and one ow-based, re-spectively.

    The rst measure we propose is the effective closeness of a node, which approximates the average shortest path lengthstarting from the node in question. We employ sketches andM AP R EDUCE [9] in order to make its computation tractable.

    The second measure we propose is L INE RANK , whichintuitively measures the ow through a node. Our notionof ow is derived by nding the stationary probabilities of a random walk on the line graph of the original matrix andthen aggregating the ows on each node 1 . In addition, weextend the line graph to weighted graphs as well as directedgraphs. The advantage of this approach is that the stationaryprobabilities can be efciently computed in a distributedsetting through M AP REDUCE . However, materialization of the line graph is prohibitively expensive in terms of space.We show that we can decompose the computation in a waythat uses a much sparser matrix than the original line graph,thus making the entire computation feasible.

    Finally, we analyze the centralities in large, real-world

    networks with these new centrality algorithms to nd impor-tant patterns.In summary, the main contributions in this paper are the

    following:

    1. Careful Design. We propose two new large-scale cen-tralities: effective closeness , a diameter-based central-ity, and L INE RANK , a ow-based centrality. Both arecarefully designed with billion-scale networks in mindfrom the beginning.

    2. Algorithms. We develop efcient parallel algorithmsto compute effective closeness and L INE R ANK onM AP R EDUCE by using approximation and efcient line

    graph decomposition of billion scale graphs. We per-form experiments to show that both of our proposedcentralities have linear scale-up on the number of edgesand machines.

    3. Observations. Using our large-scale centralities, weanalyze real-world graphs including YahooWeb, Enronand DBLP. We report important patterns including high

    1 See Section 4 for the formal denition of line graph.

    effective closeness for high degree nodes, the distin-guishing ability of effective closeness for low degreenodes, and the ability of L IN E R ANK for discriminatingrelatively high degree nodes.

    Symbol Denition

    n number of nodes in a graphm number of edges in a graphA adjacency matrix of a graph

    indeg (v) in-degree of node voutdeg (v) out-degree of node v

    N (r, v ) number of neighbors of node v within r stepsG original graph

    L(G) line graph of the original graph GS (G) source incidence matrix of the graph GT (G) target incidence matrix of the graph G

    Table 1: Table of symbols

    The rest of the paper is organized as follows: Section 2presents related work on node centrality. In Section 3 wedene our proposed large scale centralities. In Section 4we develop scalable algorithms to efciently and effectivelyevaluate those measures. Section 5 presents scalability andaccuracy results on synthetic and real data sets. Aftershowing patterns and observations of centralities in large,real world graphs in Section 6, we conclude the paper inSection 7. Table 1 lists the symbols used in this paper.

    2 Related Work

    Related work forms two groups: centrality measures ongraphs and parallel graph mining using H ADOOP .

    2.1 Centrality Measures on Graphs Centrality has at-tracted a lot of attentions as a tool for studying various kindsof networks including social, information, and biologicalnetworks [12, 5, 16]. The centrality of a node in a network is interpreted as the importance of the node. Many centralitymeasures have been proposed based on how the importanceis dened.

    In this section, we discuss various centrality measuresaround the three main centrality groups [12, 5] which repre-sent distinguished types of walks.

    Degree related measures. The rst group of the cen-trality measures is the degree related measures. The degreecentrality, the simplest yet the most popular centrality mea-sure, belongs to this group. The degree centrality cDEGi of node i is dened to be the degree of the node.

    A way of interpreting the degree centrality is that itcounts the number of paths of length 1 that emanate from anode. A generalization of the degree centrality is the K - pathcentrality which is the number of paths less than or equal tok that emanate from a node. Several variations of the K - path centrality exist based on the type of the path: geodesic,

  • 7/30/2019 Centrality measurements

    3/12

    edge-disjoint, vertex-disjoint K - path are among them [5].Another line of centralities are based on the walk on

    the graph. The Katz centrality [20] counts the number of walks starting from a node, giving penalties to longer walks.In a mathematical form, the Katz centrality cKATZ i of nodei is dened by cKATZ i = eT i (

    j =1 (A)

    j )1 where ei is acolumn vector whose ith element is 1, and all other elementsare 0. The is a positive penalty constant to control theweight on the walks of different length. A slight variation of the Katz measure is the Bonacich centrality [4] which allowsthe negative . The Bonacich centrality cBON i of node iis dened to be cBON i = eT i ( 1

    j =1 (A)

    j )1 where thenegative weight allows to subtract the even-numbered walksfrom the odd-numbered walks which have an interpretationin exchange networks [5]. The Katz and the Bonacichcentralities are special cases of the Hubbell centrality [15].The Hubbell centrality cHU Bi of node i is dened to becHU Bi = eT i (

    j =0 X

    j )y where X is a matrix and y is avector. It can be shown that X = A and y = A 1 leadto the Katz centrality, and X = A and y = A1 lead to theBonacich centrality.

    Except for degree, most of variations require someparameter, which may not be easy to determine in realnetworks. Also computationally, degree is the only one thatcan be efciently measured for large networks, which will beserved as the baseline measure in this paper.

    Diameter related measures. The second group of thecentrality measures is diameter related measures, whichcount the length of the walks. The most popular central-ity measure in this group is the Freemans closeness central-ity [12]. It measures the centrality by computing the average

    of the shortest distances to all other nodes. Let S be the ma-trix whose (i, j )th element contains the length of the shortestpath from node i to j . Then, the closeness centrality cCLOiof node i is dened to be cCLOi = eT i S 1.

    As we will see in Section 6, diameter-based measuresare effective in differentiating low degree nodes. However,the existing diameter based measure does not scale up andtherefore efcient computational method needs to be devel-oped, which is one of the focus in this paper.

    Flow related measures The last group of the centralitymeasures is the ow related measures. It is called owrelated since the information owing through edges areconsidered. The most well-known centrality in this group is

    the Freemans betweenness centrality [12]. It measures howmuch a given node lies in the shortest paths of other nodes.Let bjk is the number of shortest paths from node j to k, andbjik be the number of shortest paths from node j to k thatpasses through node i . The betweenness centrality cBET iof node i is dened to be cBET i = j,k

    bjikbjk . The naive

    algorithm for computing the betweenness involves all-pairshortest paths which require (n3) time and (n2) storage.Brandes [6] made a faster algorithm by running n single-

    source-shortest-path algorithms which require O(n + m)space and run in O(nm ) and O(nm + n2 log n) time onunweighted and weighted networks, respectively, where nis the number of nodes and m is the number of edges in agraph.

    Newman [25] proposed an alternative betweenness cen-trality based on random walks on the graph. The main idea isthat instead of considering shortest paths, it considers all pos-sible walks and compute thebetweenness from them. Specif-ically, let R be the matrix whose ( j, k )th element R jk con-tains the probability of a random walk, starting from j withthe absorbing node k, passing through the node i. Then, theNewmans betweenness centrality cNB E i of node i is denedto be cNB E i = j = i = k R jk . Computing the Newmansbetweenness centrality requires O(mn 2) time which is pro-hibitively expensive.

    None of the existing ow related measures are scalableto large networks. In this paper, we propose L INE RANK , a

    new ow-based measure scalable to large networks.

    2.2 Parallel Graph Mining using H ADOOP M AP RE -DUCE is a distributed programming framework [10] for pro-cessing web-scale data. M AP R EDUCE has two benets: (a)The data distribution, replication, fault-tolerance, and loadbalancing is handled automatically; and furthermore (b) ituses the familiar concept of functional programming. Theprogrammer needs to dene only two functions, a map anda reduce . The general framework is as follows [22]: (a) themap stage reads the input le and emits (key, value) pairs;(b) the shufing stage sorts the output and distributes themto reducers; (c) the reduce stage processes the values with

    the same key and emits another (key, value) pairs which be-come the nal result.H ADOOP [1] is the open source equivalent of M AP RE -

    DUCE . HADOOP uses its own distributed le system HDFS,and provides a high-level language called PIG [26]. Dueto its excellent scalability, ease of use, and cost advantage,H ADOOP has been used for important graph mining algo-rithms (see [27, 19, 18, 17]). Other variants which provideadvanced M AP REDUCE -like systems include SCOPE [7],Sphere [14], and Sawzall [28].

    3 Large-scale Centrality Measures

    In this section, we propose centrality measures which are de-signed for large-scale, distributed computation. We rst re-view well-known centrality measures and analyze the com-putations required. While some centralities are easier tocompute, others suffer from inherent limitations in achievingscalability, as explained in Section 2. We propose alternativecentrality measures that follow similar motivation and intu-ition as existing measures, but are much more suitable fordistributed computation on very large graphs.

    Following the classication of centralities in Section 2,

  • 7/30/2019 Centrality measurements

    4/12

    we focus on the three most common and representativetypes of centrality measures: degree (local), diameter-based(closeness), and ow-based (betweenness).

    3.1 Degree Degree centrality has a very simple and intu-itive denition: it is the number of neighbors of a node. De-spite, or perhaps because of its simplicity, it is very popularand used extensively. Not surprisingly, it is also the easiestto compute. The degree centrality vector C DEG of a graphwith an adjacency matrix A can be represented in matrix-vector multiplication form by

    (3.1) C DEG = A1.Thus, the degree centrality of a large network can be exactlycomputed by a large scale matrix-vector multiplication. Themajor limitation of degree based centrality is that it only cap-tures the local information of a node. In many applications,we need more informative measures that can further distin-guish among nodes that have almost equally low degrees, or

    almost equally high degrees (see Section 6).

    3.2 Diameter-based Measures Closeness centrality is themost popular diameter-based centrality measure. While de-gree centrality considers only one-step neighbors, closenesscentrality considers all nodes in the graph, and gives highscore to nodes which have short average distances to all theother nodes. Closeness of a node is typically dened asthe inverse of the average over the shortest distances to allother nodes; to simplify formulas we omit the inverse. Exactcomputation requires an all-pairs shortest paths algorithm.Unfortunately, this operation requires O(n3) time. For thebillion-scale graphs we consider in this work, computingcloseness centrality is prohibitively expensive. To addressthis computational issue, we propose to use an accurate ap-proximation instead of exact computation, leading to the fol-lowing notion of centrality.

    D EFINITION 1. (E FFECTIVE C LOSENESS ) The effectivecloseness centrality C ECL (v) of a node v is dened as theapproximate average distance from v to all other nodes.

    We will next dene the notion of approximate moreprecisely, by describing the approximation scheme we em-ploy. Let N (r, v ) be the number of neighbors of node vwithin r steps, and N v (r ) be the number of nodes whose

    shortest distances to v is r . Notice that N v (r ) = N (r, v ) N (r 1, v). Based on these quantities, standard closenesscan be dened by

    (3.2)closeness =

    dr =1 r N v ( r )

    n

    =dr =1 r (N ( r,v ) N ( r 1,v ))

    n

    where d is the diameter of the graph and n is thenumber of nodes. Lets assume that we can easily getN (r, v ), an unbiased estimate of N (r, v ). Dene N v (r ) to

    be N (r, v ) N (r 1, v). By the linearity of expectation,N v (r ) gives an unbiased estimate of N v (r ). Thus, by usingthis approximation, we can dene the effective closenessC ECL (v) by

    (3.3) C ECL (v) =

    dr =1 r N v ( r )

    n

    =d

    r =1r

    (N

    (r,v

    ) N

    (r

    1,v

    ))nThe remaining question is how to efciently get an accu-rate approximation N (r, v ). For this purpose, we use theFlajolet-Martin [11] algorithm for estimating the number of unique items in a multiset. While many algorithms existfor the estimation (e.g., [3, 8, 13]), we choose the Flajolet-Martin algorithm because it gives an unbiased estimate, aswell as a tight O(log n) space bound [2]. The main resultof the Flajolet-Martin algorithm is that we can represent aset with n unique nodes using a bitstring of size O(log n),and the bitstring can be used to estimate the number n of unique items in the set. From its construction, the bitstringof the union of two sets can be obtained by bitwise-ORingthe bitstrings of these sets. In our case, each node starts witha bitstring encoding a set containing only the node itself. Atevery step, each node updates its bitstring by bitwise-ORingwith the bitstrings of its neighbors. This process continuesuntil the bitstrings for all nodes converge.

    3.3 Flow-based Measures Betweenness centrality is themost common and representative ow-based measure. Ingeneral, the betweenness centrality of a node v is the num-ber of times a walker visits node v, averaged over all possi-ble starting and ending nodes. Different types of walks leadto different denitions for betweenness centrality. In Free-

    man betweenness [12], the walks always follow the shortestpath from starting to ending node. In Newmans between-ness [25], the walks are absorbing random walks. Bothof these popular denitions require prohibitively expensivecomputations: the best algorithm for shortest-path between-ness has O(n2 log n) complexity, while the best for New-mans betweenness has O(mn 2) complexity.

    Since existing measures do not scale well, we proposea new ow-based measure, called L INE RANK . The mainidea is to measure the importance of a node by aggregatingthe importance score of its incident edges . This representsthe amount of information that ows to the node. Severalnon-trivial questions need to be addressed for L INE RAN K to

    be useful. First, how can we dene the edge importance?Second, how do we compute it efciently?

    For the rst question, we dene the edge importanceby the probability that a random walker, visiting edges vianodes with random restarts, will stay at the edge. To denethis random walk precisely, we induce a new graph, calleddirected line graph, from the original graph.

    D EFINITION 2. (D IRECTED L IN E G RAPH ) Given a di-rected graph G , its directed line graph L(G) is a graph such

  • 7/30/2019 Centrality measurements

    5/12

    that each node of L(G) represents an edge of G , and there isan edgefrom a node e1 to e2 in L(G) if for the correspondingedges (u1 , v1) and (u2 , v2) in G , v1 = u2 .

    For example, see a graph and its directed line graph inFigure 1. There is an edge from the node (4, 1) to (1, 2) inL(G) since the edge (4, 1) follows (1, 2) in G.

    (a) Original graph G (b) Directed line graph L(G)Figure 1: Original graph G and its corresponding directedline graph L(G). The rectangular nodes in (b) correspond toedges in (a). There is an edge from a node e1 to e2 in L(G)

    if for the corresponding edges (u1 , v1) and (u2 , v2) in G,v1 = u2 , or the rst edge follows the second. For example,there is an edge from the node (4, 1) to (1, 2) in L(G) sincethe edge (4, 1) follows (1, 2) in G.

    Now think of a random walker visiting nodes on theline graph. The walker staying at a node at the current stepwill move to a neighboring node with high probability c, orto a random node with low probability 1 c, so that thewalk mixes well. We seek the stationary probability of thisrandom walk. Edges in the original graph are associated withthe stationary probabilities by which we dene L INE R ANKas follows.

    D EFINITION 3. (L IN E R AN K ) Given a directed graph G,the L INE RANK score of a node v G is computed byaggregating the stationary probabilities of its incident edgeson the line graph L(G).

    Another important question is how to determine edgeweights in the line graph. The random walk in the linegraph is performed with transition probabilities proportionalto edge weights. For example, in Figure 2, the node e1 inL(G), which corresponds to the edge (u1 , v1) in G, transitsto either e2 = ( v1 , v2) or e3 = ( v1 , v3) with the probabilityproportional to w2 and w3 , respectively.

    For an unweighted original graph, the line graph is alsounweighted. However, for a weighted original graph, the linegraph should have appropriate edge weights. We proposeto multiply the weights of the adjacent edges in the originalgraph to compute the edge weights in the line graph. That is,assume two adjacent edges e1 (u1 , v1) and e2 (v1 , v2)in G have weights w1 and w2 , respectively. Then the edge(e1 , e2) in L(G) have the weight w1w2 where e1 and e2 arethe corresponding nodes in L(G) to (u1 , v1) and (v1 , v2) in

    (a) Original graph G (b) Directed line graph L(G)Figure 2: Weights of the original graph G and the corre-sponding directed line graph L(G). The rectangular nodesin (b) correspond to edges in (a). If two consecutive edges(u1 , v1) and (v1 , v2) in G have weights w1 and w2 , respec-tively, then the corresponding induced edge in L(G) havethe weight w1w2 . For example, the edge (e1 , e2) in L(G)has the weight w1w2 .

    G, respectively. This weighting scheme enables a randomwalker to transit in proportion to the original edge weights inG, after normalization.

    D EFINITION 4. (WEIGHTS IN

    DIRECTED

    LIN E

    GRAPH

    ) If two consecutive edges (u1 , v1) and (v1 , v2) in G haveweights w1 and w2 , respectively, then the correspondinginduced edge in L(G) have the weight w1w2 .

    The remaining challenge is to compute L INE R ANK onthe line graph. In the next section we show how to designefcient algorithms for L INE R ANK , as well as effectivecloseness, of billion-scale graphs using H ADOOP [1], anopen-source M AP R EDUCE framework.

    4 Large-scale Centrality Algorithms

    In this section, we describe H ADOOP algorithms to compute

    centralities for large-scale graphs. Specically, we focus oneffective closeness and L INE RAN K , and propose efcientalgorithms.

    4.1 Effective Closeness The effective closeness requiresN (r, v ), an approximation of N (r, v ) which is the numberof neighbors of node v within r steps. As described in Sec-tion 3, we use Flajolet-Martin algorithm for the approxima-tion. The H ADOOP algorithm for the effective closeness iter-atively updates the Flajolet-Martin (FM) bitstrings for everynode. The crucial observation is that the bitstrings updateoperation can be represented in a form similar to matrix-vector multiplication [18]. Specically, let b(r 1, v) benode vs bitstring encoding the set of nodes within distancer 1. Then the next-step bitstring b(r, v ) is computed byBITWISE-ORing the current bitstring b(r 1, v) of v andthe current bitstrings of the neighbors of v.

    (4.4)b(r, v ) = b(r 1, v) BITWISE-OR {b(r 1, u)|(v, u ) E }

    Since the above equation is a generalized form of matrix-vector multiplication, a repeated matrix-vector mul-

  • 7/30/2019 Centrality measurements

    6/12

    tiplication with BITWISE-OR customization computes theapproximation N (r, v ) and thus can compute the effectivecloseness using Equation (3.3), as shown in Algorithm 1.The InitialBitstring (line 2) and DecodeBitstring (line 11,13)create and decode the FM bitstrings. The sum cur andsum next variables are used to check whether r reached themaximum diameter of the graph, and to nish the computa-tion early if possible.

    Algorithm 1 Effective ClosenessInput: Edge E = {(i, j )} of a graph G with |V | = nOutput: Effective Closeness C ECL = {(score v )}

    1: for v = 1 to n do2: b(0, v) InitialBitstring ;3: C ECL (v) = 0 ;4: end for5: sum next 0;6: for r = 1 to MaxIter do7: sum cur sum next ;8: sum next 0;9: // Update effective closeness of nodes

    10: for v = 1 to n do11: N (r 1, v) DecodeBitstring (b(r 1, v)) ;12: b(r, v ) =

    b(r 1, v) BITWISE-OR {b(r 1, u)|(v, u ) E };13: N (r, v ) DecodeBitstring (b(r, v )) ;14: C ECL (v) =

    C ECL (v) + r (N (r, v ) N (r 1, v)) ;15: sum next = sum next + C ECL (v);16: end for17: // Check whether the effective closeness converged18: if sum next = sum cur then19: break for loop;20: end if 21: end for22: C ECL (v) = C ECL (v)/n ;

    The effective closeness algorithm is much efcient thanthe standard closeness. The effective closeness requiresO(dm) time, where d is the diameter of the graph and mis the number of edges, since it requires at most d matrix-vector multiplications. In contrast, the standard closenessrequires O(n3) time, where n is the number of nodes, whichis much longer than O(dm), given that real-world graphs aresparse( m

  • 7/30/2019 Centrality measurements

    7/12

    method, which repeatedly multiplies L(G) with a randominitial vector. Thanks to the decomposition (4.6), we canmultiply L(G) with a vector v by rst multiplying S (G)T with v , then multiplying T (G) with the previous result.After computing the stationary probability of the randomwalk on the line graph, we aggregate the edge scores foreach node. This can be done by right multiplying the edgescore by the overall incidence matrix B (G) of G, whereB (G) = S (G) + T (G). Algorithm 2 shows the completeL IN E R ANK algorithm.Algorithm 2 L INE RAN KInput: Edge E = {(i,j,weight )} with |E | = m ,

    Damping Factor c = 0 .85Output: L INE RANK vector linerank

    1: Build incidence matrices S (G) and T (G) from E 2: // Compute normalization factors3: d 1 S (G)T 1;4: d 2 T (G)d 1 ;5: d 1 ./ d 2 ;6: // Run iterative random walks on T (G)S (G)T 7: v random initial vector of size m ;8: r 1m 1; // restart prob.9: while v does not converge do

    10: v 1 dv ; // Hadamard product11: v 2 S (G)T v 1 ;12: v 3 T (G)v 2 ;13: v cv 3 +(1 c)r ; // add with the restart probability14: end while15: linerank (S (G) + T (G)) T v ;

    Figure 3: Dependency of different M AP R EDUCE jobs for

    L IN E R ANK computation. Each shaded box represents aM AP R EDUCE job. Notice that the most expensive operationsare matrix-vector multiplications which can be performedefciently since the sparse matrices S (G) and T (G) areused instead of the dense L(G) thanks to the line graphdecomposition in Lemma 4.1.

    We describe the above algorithm in detail and alsoillustrate it through a owchart on Figure 3.

    Building Incidence Matrices. We rst construct the inci-dent matrices S (G) and T (G) from the sparse adjacency ma-trix E . These matrices can be built in O(m) time by readingedges and emitting the corresponding outputs.Computing Normalization Factors. The i th element of thediagonal matrix D contains the sum of i th column of L(G).D is used to column-normalize L(G) so that the resultingmatrix can be used for the power iteration. The ./ in line 5represents the element-wise inverse operation.Random Walk on the Line Graph. From line 7 to 14, therandom walk on the decomposed line graph is performed.Notice that all the operations are either matrix-vector multi-plication, vector addition, or vector Hadamard products (line10), all of which are not expensive. Also, notice that thematrices S (G) and T (G) contain only m nonzero elementsfor each, which is typically much smaller than the L(G) if explicitly constructed.Final L IN E R AN K Score. The edge scores are summed upin line 15 to get the nal L IN E R ANK score for each node.

    Note the most expensive operation in Algorithm 2 ismatrix-vector multiplication which can be performed ef-ciently in H ADOOP [19].

    4.3 Analysis We analyze the time and the space complex-ity of L INE RANK . The main result is that thanks to ourline graph decomposition(lemma (4.1)), L INE R ANK has thesame complexity as random walks on the original graph, al-though the line graph is much bigger than the original graph.

    LEMMA 4.2. ( T IME C OMPLEXITY OF L IN E R AN K )L INE R ANK takes O(km ) time where k is the number of iterations and m is the number of edges in the originalgraph.

    Proof. The time complexity is dominated by the while loopfrom line 9 to 14. Inside the while loop, the most expensiveoperations are the matrix-vector multiplications which takeO(m) time since the number of nonzero elements in S (G)or T (G) is m .

    The number k of iterations depends on the ratio of theabsolute values of the top two largest eigenvalues of the linegraph. An advantage of L INE R ANK is that one can stop thecomputation after a few iterations to get reasonable accuracy,while the other betweenness centralities can not be stoppedin an any-time fashion. A similar results holds for spacecomplexity: L INE RANK requires the same space as randomwalks on the original graph.

    LEMMA 4.3. ( S PACE C OMPLEXITY OF L IN E R AN K )L INE R ANK requires O(m) space.

    Proof. The space complexity is dominated by the incidencematrices S (G) and T (G) which have m elements each.

  • 7/30/2019 Centrality measurements

    8/12

    5 Experiments

    We present our experimental evaluation, which has a two-fold goal. The rst goal is to demonstrate the efciencyand scalability of our proposed solutions, by focusing on thefollowing two questions:

    Q1 How fast are our proposed large-scale centralities, com-pared to the standard centralities?

    Q2 How do our algorithms scale with the graph size, as wellas with the number of machines?

    The second goal is to study the effectiveness of our approachon real graphs. More specically, we focus on the followingtwo questions:

    Q3 How well does the effective closeness approximatestandard closeness?

    Q4 What are the patterns of centralities in real networks?Are there correlations between centralities? Are there

    outliers?

    After summarizing the datasets used in the experiments, therest of this section rst addresses questions (Q12) , and then(Q3) . (Q4) is answered in Section 6.

    5.1 Datasets and setup The graphs used in our experi-ments along with their main characteristics are summarizedin Table 2. 2 We use both real-world and synthetic datasets.

    The YahooWeb graph contains the links between webhosts. The weight of an edge is the number of web pagesbetween the hosts. The Enron data contain the email ex-changes of Enron employees, where the weight is the num-

    ber of emails between the two people. AS-Oregon containsthe router connection information. DBLP Authors containsco-author relationships among prolic authors; according toDBLP, authors are prolic if they have published at leastthan 50 papers. The weight of an edge is the number of papers co-authored by the incident authors. Note that thedegree does not necessarily correspond to the total numberof papers authored, since the dataset represents the inducedsubgraph among prolic authors only. We chose this versionof the DBLP dataset to facilitate experiments and compar-isons with standard measures of centrality.

    Scalability experiments are performed on syntheticdatasets, since this allows exibility in choosing graphs of

    any size. We used a generator based on Kronecker multipli-cation [24], which produces realistic graphs.

    2 YahooWeb: released under NDA.Kronecker: http://www.cs.cmu.edu/ ukang/datasetEnron: http://www.cs.cmu.edu/ enronAS-Oregon: http://topology.eecs.umich.edu/data.htmlDBLP: http://www.informatik.uni-trier.de/ ley/db/indices/a-tree/prolic/ index.html, also in http://www.cs.cmu.edu/ ukang/dataset

    Name Nodes Edges DescriptionYahooWeb 24 M 1 B WWW links

    between hostsKronecker 177 K 1,977 M synthetic

    120 K 1,145 M59 K 282 M19 K 40 M

    Enron 80 K 575 K EmailAS-Oregon 13 K 74 K Router

    connectionsDBLP 3 K 22 K DBLP prolicAuthors authors

    Table 2: Summary of datasets and main characteristics.

    We implemented our scalable algorithms in Java usingH ADOOP version 0.20.1. Large-scale experiments were runon the Yahoo! M45 cluster, using 10 to 70 machines. Forstandard measures of centrality on small graphs, we used theiGraph package for R on a single machine.

    5.2 Scalability and efciency Figure 4 shows the resultsfrom experiments on efciency and scalability.

    Figures 4(a,d) show the running time for standard cen-trality measures (closeness and shortest-path betweenness)on a single machine. The running time clearly grows super-linearly with respect to the number of edges.

    Figures 4(b,c,e,f) show the running time of our dis-tributed algorithms for our proposed centralities. For eachof the two centrality measures (effective closeness and L IN -ER ANK ) we vary both the number of machines, as well asthe size of the graphs.

    Both effective closeness and L INE RAN K show lin-ear scale-up with respect to the number of edges, in g-ures 4(b,e). For this set of experiments, the number of ma-chines was xed at 50.

    Scale-up is also near-linear with respect to the numberof machines. Figures 4(c,f) shows the scale-up 1/T M whereT M is the running time with M machines. The scale-upscore is normalized so that it is 1 when M =10. Both of thealgorithms scale near-linearly with respect to the number of machines. For this set of experiments, the size of the graphwas xed to 282M edges.

    5.3 Effective Closeness Figure 5 shows the scatter plotsof standard closeness versus our proposed effective close-ness, on relatively small graphs where it is feasible to com-pute the former measure. Each point in the scatter plot corre-sponds to a node in the graph. Across all datasets there existclear linear correlations between the two measures with cor-relation coefcient at least 0.978. Therefore, effective close-ness is a good substitute for standard closeness. More impor-tantly, effective closeness can also be used in billion-scale

  • 7/30/2019 Centrality measurements

    9/12

    0

    5000

    10000

    15000

    20000

    25000

    10K 50K 100K 200K 325K

    R u n

    t i m e

    i n s e c o n

    d s

    Number of edges

    Closeness

    0

    500

    10001500

    2000

    2500

    3000

    40M282M 1146M 1977M

    R u n

    t i m e

    i n s e c o n

    d s

    Number of edges

    Effective Closeness

    1

    2

    10 30 50 70

    S c a

    l e u p

    : 1 / T

    M

    Number of machines

    Effective Closeness

    (a) Closeness: time vs. edges (b) Effective Closeness: time vs. edges (c) Effective Closeness: time vs. machines

    0

    10002000

    3000

    4000

    5000

    6000

    7000

    8000

    10K 50K 100K 200K 325K

    R u n

    t i m e

    i n s e c o n

    d s

    Number of edges

    Betweenness

    0

    5001000

    1500

    2000

    2500

    3000

    3500

    4000

    40M282M 1146M 1977M

    R u n

    t i m e

    i n s e c o n

    d s

    Number of edges

    LineRank

    1

    2

    3

    4

    10 30 50 70

    S c a

    l e u p

    : 1 / T

    M

    Number of machines

    LineRank

    (d) Betweenness: time vs. edges (e) L INE R ANK : time vs. edges (f) L IN E R ANK : time vs. machinesFigure 4: (a, d) : Scalability of the standard closeness and betweenness. Notice that running time grows super-linearlywith respect to the number of edges, and is quite long for a relatively small graph with 325 K nodes. (b, c) : Scalability of effective closeness. The running time is for one iteration on the Kronecker datasets. The experiment (c) is performed ona graph with 282 million edges. Scale-up is linear with respect to the number of edges, and near-linear with respect to thenumber of machines. (e, f) : Scalability of L INE R ANK . The running time is for one power iteration on the line graphs of theKronecker graphs. The experiment (f) is performed on a graph with 282 million edges. As with the effective closeness, thescale-up is linear both with respect to the number of edges and the number of machines.

    0.00015

    0.0002

    0.00025

    0.0003

    0.00035

    0.0004

    0.00045

    0.0005

    0 .0002 0 .00025 0 .0003 0 .00035 0 .0004 0 .00045 0 .000

    C l o s e n e s s

    Effective Closeness

    DBLP PA-PA

    1e-05

    3e-05

    5e-05

    1e-05 3e-05 5e-05

    C l o s e n e s s

    Effective Closeness

    Enron

    4e-05

    5e-05

    6e-05

    7e-05

    8e-05

    9e-05

    0.0001

    0.00011

    0.00012

    0.00013

    4e-05 5e-05 6e-05 7e-05 8e-05 9e-05 0.00010.000110.0001

    C l o s e n e s s

    Effective Closeness

    AS-Oregon

    (a) DBLP Authors (b) Enron (c) AS-OregonCorr. Coefcient: 0.994 Corr. Coefcient: 0.992 Corr. Coefcient: 0.978

    Figure 5: The correlations between our proposed effective closeness and the standard closeness on small, real-world graphs.Each of the score is normalized to sum to 1. The two measures are near-linearly correlated with correlation coefcients atleast 0.978, but our solution is much faster to compute.

  • 7/30/2019 Centrality measurements

    10/12

    graphs, while standard closeness is limited to small graphs.

    6 Patterns of Centralities in Large Networks

    In this section, we present patterns of centralities in realworld, large scale networks. Figure 6 shows the relationshipsbetween degree (baseline measure) against effective close-ness as well as L IN E R ANK . Each point in the scatter plotscorresponds to a node in the graph. We have the followingobservations for our proposed measures.

    6.1 Effective Closeness and Degree In Figures 6 (a), (c),and (e), we observe high degree nodes have high effectivecloseness. This reects that high degree nodes have higherchances to reach all nodes within short number of steps, dueto its many connections.

    O BSERVATION 1. (E FF . CLO . FOR H IGH D EGREE N ODES ) High degree nodes have high effective closeness, reectingtheir higher chances to reach other nodes quickly.

    In contrast to high degree nodes, low-degree nodes havevarious effective closeness values. This implies that nodesthat are hard to be differentiated by degree measure now canbe easily separated by our effective closeness measure. Thereason is that, if a node v has an effective closeness f , thena neighboring node of v will also have an effective close-ness similar to f . Thus, two nodes with the same degreecan have different effective closeness based on which nodesthey connect to. For example, in the DBLP prolic authorsdataset, both Foto N. Afrati and Massimo Pocino have de-gree 5 (the degree here is the number of prolic co-authors).However, despite having the same degree, Foto N. Afrati has

    1.6 times larger effective closeness than Massimo Pocino,since she has co-authored a paper with Jeffrey D. Ullmanwho has the highest effective closeness. Similarly, in the En-ron dataset, Kenneth Lay, the CEO of Enron, has high effec-tive closeness. [email protected] has degree 1 but 1.81 higher effective closeness than [email protected] with thesame degree, since [email protected] has exchanged emailwith the CEO. Finally, in the YahooWeb dataset, the sitewww.robertsonbonded.com has degree 1 but has high ef-fective closeness 4.4 10 8 which is more than 4 timeslarger than the effective closeness of some pages with thesame degree. The reason is that www.robertsonbonded.comis pointed by dmoz.org which has very high effective close-

    ness. Thus, we conclude that the effective closeness givesadditional useful information not conveyed by the degree.

    O BSERVATION 2. (E FF . CLO . FOR L OW D EGREE N ODES ) Low degree nodes have varying effective closeness based onthe closeness of their neighbors. For this reason, effectivecloseness can be used to distinguish low degree nodes.

    6.2 L IN E R AN K and Degree The effective closenessgives another dimension of information which can be used

    to differentiate nodes further than possible by degree alone.However, nodes with high degree tend to have high effec-tive closeness, and thus can not be distinguished by the ef-fective closeness. L INE RANK can be used to distinguishhigh degree nodes. In contrast to the degree which con-siders only one-step neighbors, L INE RANK considers alsothe quality of the connections of a nodes neighbors wherethe quality is acquired by stationary probabilities in randomwalks over the whole graph. Thus, some nodes have high de-gree but have relatively low L INE R ANK due to the quality of the edges. For example, Noga Alon has the highest degree,which is the number of co-authors, in the DBLP prolic au-thors dataset, but his L IN E R ANK is smaller than Jeffrey D.Ullman since Noga Alon co-authored 125 papers which issmaller than 199 papers that Jeffrey D. Ullman co-authoredwith other prolic authors. On the other hand, some authorshave high L INE RANK compared to the degree. For example,Philip S. Yu has 26 prolic co-authors but published 147 pa-pers with them, thus has higher L INE R ANK than KennethA. Ross who has 58 prolic co-authors but published 66papers with them. The same applies to Micha Sharir whohas 34 prolic co-authors but 223 papers co-authored, andthus has higher L IN E R ANK . Similarly, in the Enron data,the CEO Kenneth Lay has the highest degree, but his L IN -ER ANK is smaller than Jeff Dasovich, the governmental af-fairs executive, since Jeff exchanged about 10 more emailthan the CEO, probably due to his role. In the YahooWebdata, the top 3 highest degree hosts(www7.calle.com,dmoz.org, and www.dmoz.org), are different from thetop 3 highest L INE RAN K hosts(geocities.yahoohost.com,www.angelre.com, and members.aol.com). Again, the rea-

    son for this difference is the strength of the connections: thetop 3 highest L INE RAN K hosts have more total neighboring pages than the top 3 highest degree hosts. We conclude thatL INE R ANK gives yet additional useful information for dis-tinguishing high degree nodes.

    O BSERVATION 3. (L INE RANK FOR H IGH D EGREE N ODES ) High degree nodes have varying L INE R ANK based on thestrength of the incident edges. Thus, L INE R ANK can beused to distinguish high degree nodes.

    7 Conclusion

    In this paper we address challenges in computing informative

    measures of centrality on billion scale graphs. The maincontributions are the following:

    1. Careful Design. We propose effective closeness , adiameter-based centrality, and L INE R ANK , a ow-based centrality, both of which are by design suitablefor large-scale, distributed processing platforms.

    2. Algorithms. We develop the scalable and effective al-gorithms for M AP REDUCE by using approximation andefcient line graph decomposition. We perform experi-

  • 7/30/2019 Centrality measurements

    11/12

    (a) DBLP Authors: Effective Closeness vs. Degree (b) DBLP Authors: L INE R ANK vs. Degree

    (c) Enron: Effective Closeness vs. Degree (d) Enron: L INE R ANK vs. Degree

    (e) YahooWeb: Effective Closeness vs. Degree (f) YahooWeb: L IN E R ANK vs. DegreeFigure 6: [Best Viewed In Color] The scatter plots from all pairs of large scale centrality measures. The effective closenessand the degree centralities are normalized. Notice high degree nodes have high effective closeness since they can reach othernodes within small number of steps due to many neighbors. However, low degree nodes have varying effective closeness.Thus, effective closeness can be used to distinguish nodes with low degrees. High degree nodes can be distinguished byL IN E R ANK .

  • 7/30/2019 Centrality measurements

    12/12

    ments on large datasets with H ADOOP , and demonstratethe effectiveness as well as the scalability of our methodfor billion-scale graphs.

    3. Observations. We show how our proposed measurescan reveal interesting correlations and anomalies of real-world graphs. We report that nodes with higheffective closeness have high degree. Furthermore, weshow that the effective closeness and L IN E R ANK canbe used for discriminating low degree nodes and highdegree nodes, respectively.

    Researches on social network analysis and computa-tional social science [23] can benet signicantly from ourproposed large scale centralities and efcient algorithms. Fu-ture research direction includes extending the current algo-rithms to time-evolving networks.

    Acknowledgments

    Research was sponsored by the Army Research Laboratoryand was accomplished under Cooperative Agreement Num-ber W911NF-09-2-0053. The views and conclusions con-tained in this document are those of the authors and shouldnot be interpreted as representing the ofcial policies, eitherexpressed or implied, of the Army Research Laboratory orthe U.S. Government. The U.S. Government is authorized toreproduce and distribute reprints for Government purposesnotwithstanding any copyright notation here on.

    References

    [1] Hadoop information. http://hadoop.apache.org/.[2] N. Alon, Y. Matias, and M. Szegedy. The space

    complexity of approximating the frequency moments,1996.

    [3] K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, andR. Gemulla. On synopses for distinct-value estimationunder multiset operations. SIGMOD, 2007.

    [4] P. Bonacich. Power and centrality: a family of mea-sures. American Journal of Sociology , 92:11701182,1987.

    [5] S. P. Borgatti and M. G. Everett. A graph-theoreticperspective on centrality. Social Networks , 2006.

    [6] U. Brandes. A faster algorithm for betweenness cen-trality. Journal of Mathematical Sociology , 2001.

    [7] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey,D. Shakib, S. Weaver, and J. Zhou. Scope: easyand efcient parallel processing of massive data sets.VLDB, 2008.

    [8] M. Charikar, S. Chaudhuri, R. Motwani, andV. Narasayya. Towards estimation error guarantees fordistinct values. PODS, 2000.

    [9] J. Dean and S. Ghemawat. Mapreduce: Simplied dataprocessing on large clusters. In OSDI , 2004.

    [10] J. Dean and S. Ghemawat. Mapreduce: Simplied data

    processing on large clusters. OSDI , 2004.[11] P. Flajolet and G. N. Martin. Probabilistic counting al-

    gorithms for data base applications. Journal of Com- puter and System Sciences , 1985.

    [12] L. Freeman. Centrality in networks: I. conceptualclarication. Social Networks , 1979.

    [13] M. N. Garofalakis and P. B. Gibbon. Approximatequery processing: Taming the terabytes. VLDB, 2001.

    [14] R. L. Grossman and Y. Gu. Data mining using highperformance data clouds: experimental studies usingsector and sphere. KDD , 2008.

    [15] C. Hubbell. An input output approach to clique identi-cation. Sociometry , 28:377399, 1965.

    [16] H. Jeong, S. P. Mason, A.-L. Barabasi, and Z. N. Oltvai.Lethality and centrality in protein networks. Nature ,(411):4142, 2001.

    [17] U. Kang, D. H. Chau, and C. Faloutsos. Mining largegraphs: Algorithms, inference, and discoveries. IEEE International Conference on Data Engineering , 2011.

    [18] U. Kang, C. Tsourakakis, A. P. Appel, C. Faloutsos,and J. Leskovec. Radius plots for mining tera-bytescale graphs: Algorithms, patterns, and observations.SIAM International Conference on Data Mining , 2010.

    [19] U. Kang, C. Tsourakakis, and C. Faloutsos. Pegasus: Apeta-scale graph mining system - implementation andobservations. IEEE International Conference on Data Mining , 2009.

    [20] L. Katz. A new index derived from sociometric dataanalysis. Psychometrika , 18:3943, 1953.

    [21] V. Krebs. Mapping networks of terrorist cells. Connec-tions , 24(3):4352, 2002.

    [22] R. L ammel. Googles mapreduce programming model revisited. Science of Computer Programming , 70:130, 2008.

    [23] D. Lazer, A. Pentland, L. Adamic, S. Aral, A.-L.Barabasi, D. Brewer, N. Christakis, N. Contractor,J. Fowler, M. Gutmann, T. Jebara, G. King, M. Macy,D. Roy, and M. V. Alstyne. Computational socialscience. Science , (323):721723, 2009.

    [24] J. Leskovec, D. Chakrabarti, J. Kleinberg, andC. Faloutsos. Generation and evolution, using kro-necker multiplication. In PKDD , 2005.

    [25] M. E. J. Newman. A measure of betweenness centralitybased on random walks. Social Networks , 2005.

    [26] C. Olston, B. Reed, U. Srivastava, R. Kumar, andA. Tomkins. Pig latin: a not-so-foreign language fordata processing. In SIGMOD 08 , pages 10991110,2008.

    [27] S. Papadimitriou and J. Sun. Disco: Distributed co-clustering with map-reduce. ICDM , 2008.

    [28] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan.Interpreting the data: Parallel analysis with sawzall.Scientic Programming Journal , 2005.