[IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining - Real Time Distributed Community Structure Detection in Dynamic Networks

Download [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining - Real Time Distributed Community Structure Detection in Dynamic Networks

Post on 09-Dec-2016

221 views

Category:

Documents

6 download

TRANSCRIPT

<ul><li><p>Real Time Distributed Community StructureDetection in Dynamic Networks</p><p>Valerie GalluzziDepartment of Computer Science</p><p>University of Iowa14 MacLean Hall Iowa City, IA, 52242-2429</p><p>USATelephone: (319)353-2037</p><p>Email: valerie-galluzzi@uiowa.edu</p><p>AbstractCommunities can be observed in many real-worldgraphs. In general, a community can be thought of as a portionof a graph in which intra-community links are dense while inter-community links are sparse. Automatic community structuredetection has been well studied in static graphs. However, manypractical applications of community structure involve networksin which communities change dynamically over time. Severalmethods of detecting the community structure of dynamic graphshave been proposed, however most treat the dynamic graph as aseries of static snapshots, which creates unrealistic assumptions.Others require large amounts of computational resources orrequire knowledge of the dynamic graph from start to nish,relegating them to post-processing. For those who desire real-timecommunity structure detection distributed over the observingnetwork, these solutions are insufcient. This paper proposes anew method of community structure detection which allows forreal time distributed detection of community structure.</p><p>I. INTRODUCTIONMany common phenomena can be modeled as networks.</p><p>Examples from everyday life include information networkslike the internet, transportation networks, social networksboth electronic and real-world, and epidemiological contactnetworks. In many of these networks we observe clusters ofnodes which are densely connected while their connections toother nodes are comparatively few. Such clusters are knownas communities, and are common features of social networks.</p><p>Automatic detection of communities is essential for someapplications. One example is the understanding of diseasediffusion through a social network, where a sophisticatedunderstanding of the underlying community structure couldhelp with better understanding disease spread and guide thedeployment of interventions like vaccines or warnings toincrease hand washing. Automatically detecting communitystructure in graphs that are static, or unchanging over time, iswell explored, and a survey can be found here [1]. However,communities in social network graphs often change over time,leading to dynamic community structure which is difcult tocorrectly detect automatically.</p><p>Some algorithms for automatically detecting communitystructure in dynamic graphs have been proposed. Initial at-tempts tracked communities rather than detecting them in thenetwork [2], [3], [4]. However, these could not nd a newlydeveloped community that appears in the network.</p><p>2 2 2 2 2 2 2 2 2 2 2 2</p><p>Time</p><p>1</p><p>2</p><p>3</p><p>4</p><p>Snapshot</p><p>Sequence A Sequence B Sequence C</p><p>Fig. 1. Edge sequences. All sequences map to same static snapshot. If edgeswere weighted by number of times observed, weights would be 2.</p><p>More recent algorithms apply known static communitydetection algorithms to static graphs created from snapshotsof the dynamic network [6], [7]. However, these need someway of capturing a dynamic network in a series of staticsnapshots, which can be challenging. For instance, each ob-served sequence of edges in gure 1 could map to the samestatic snapshot, even if we choose to weight edges by thenumber of contacts. This is despite the fact that the differentsequences could mean different outcomes. For instance, in anepidemiological setting, a disease starting at the leftmost nodeand capable of infecting one more node with each timestepwould infect 4 nodes in A, 3 nodes in B, and 5 in C. Shouldthey all map to the same static snapshot? These questions makeit difcult to create good mappings to static snapshots.</p><p>Others assume that the entire network is known from therst timestep to the last, and apply post-processing techniquesto produce community structure assignments [8]. These meth-ods preclude real time results.</p><p>However, there exist some applications for which commu-nity structure detection should happen in real time, ideallyin a distributed fashion, a situation which current methodscannot accommodate. An example is the University of IowaComputational Epidemiology groups deployment of sensors</p><p>2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining</p><p>978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.213</p><p>1268</p><p>2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining</p><p>978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.213</p><p>1236</p></li><li><p>in the University of Iowa Hospital [9]. Sensors are small radioequipped devices, and in this case they are attached to peopleand objects. A contact between two sensors is detected whentheir radios are able to communicate with each other and soa network is created by which health care worker movementsand interactions can be observed through the contacts of theirattached sensors.</p><p>An ideal application of community structure detection inthis setting would be assessments of individual risk for in-fection given travel between and within health care workercommunities. If this assessment occurred in post processing,interventions would be difcult to apply on an individualizedbasis as the collected sensor data is anonymized to protectworker privacy. A centralized real time approach would dependupon good connections between sensors and a centralizedcommunity structure detector, which may be difcult to ensuregiven that the sensors are frequently moving. A more idealsetting would be risk levels which are calculated in real timeon a health care workers sensor based on their travel betweencommunities and then displayed to the worker, avoiding pri-vacy concerns and connectivity and post processing issues.</p><p>Another application would be a system that delivers ads tomobile devices. Said system might wish to differentiate thosewho frequent the location from those who are passing throughto deliver different content. This could be accomplished bydetermining whether or not the user is a member of a localcommunity. If community assessment were real time anddistributed then ads could be delivered immediately and the adsystem could be deployed without worrying about connectivityto a master community structure detector.</p><p>These applications require real time assessment of com-munity structure in a distributed fashion, which cannot berealized by current community structure detection algorithms.This paper aims to introduce such an algorithm.</p><p>II. ALGORITHMThe following section describes our algorithm which is</p><p>lightweight and customizable. It is inspired by the staticnetwork community detection algorithm proposed by [10].</p><p>A. DenitionsA dynamic graph G is a series of graphs G =</p><p>G0, G1, ..., GT where Gt(V,Et) is the graph with vertex setV and the edge set consisting of the edges at time stept 0, 1, ..., T . Dene ci as the one community of which iis a member. Let h be the maximum number of contacts iwill remember. Let a history Hi be the list of the communityafliations of node is h most recent contacts at the time imade contact with them.</p><p>Let d be a parameter such that the most recent contact inHi has an importance of h while the next most recent hasimportance of max(hd, 1), and the kth most recent contactin Hi has importance of max(h dk, 1). Let p &gt;&gt; h bea number such that 1/p is the probability that the node willrandomly relabel itself by changing its community to its ownid. Let tr be the number of contacts that will be notied ofrelabeling.</p><p>TABLE IALGORITHM FOR EACH NODE</p><p>Require: Hi be the list of the community label of is h most recent contactsat time of contact</p><p>Require: ri = True if i is currently in a relabeling phase, Falseotherwise</p><p>Require: rci is the number of contacts for which i has been in therelabeling phase</p><p>Require: h, d, p, tr dened as described in II-AFindCommunity(i:Self ID, j:Connected Node):if |Hi| h then</p><p>Remove the oldest contact in Hiend ifHi = Hi + cjif ri then</p><p>With probability 1/pLet ai = ci and bi = iRelabel ci with birci = 0ri = True</p><p>end ifif ri then</p><p>Replace all instances of ai in Hi with biend ifif rj then</p><p>Replace all instances of aj in Hi with bjend ifLet Li be a list of all communities mentioned in HiLet all communities in Li have weight 1for lk in Hi do</p><p>Add max(h dk, 1) to the community lk in list Liend forci = the highest weight community in Liif rci = tr then</p><p>ri = Falseend ifif ri then</p><p>rci = rci + 1end if</p><p>B. Algorithm</p><p>Observe the following:1) Individuals are more likely to interact with fellow com-</p><p>munity members than with members of other communi-ties</p><p>2) Interactions are captured as edges between two individ-uals.</p><p>Let the set Ei be all edges with i as an endpoint in subsetSb,e = Gb, Gb+1, ..., Ge where [b, e] is the interval of interest.Given these two observations, we can see that if in set Ei nodei interacts (shares edges) mostly with members of community lthen it is probably a member of community l during that time.If individuals are known to change community infrequentlythen the current community association of the individual islikely to be predictive of future community association.</p><p>Using these insights, we can estimate community mem-bership on an individual basis in real time using only localinformation with the algorithm shown in table I.</p><p>C. Analysis</p><p>1) Runtime: If this algorithm were simulated using thealgorithm in table II, it would run in O(TEh) time. However,the T and E gures are not relevant to the individual node</p><p>12691237</p></li><li><p>TABLE IIALGORITHM FOR SIMULATION</p><p>Set ci = i i {Initialize each nodes community to be its ID}Set Hi = {i} i {Initialize each nodes history to a meeting with itself}Set ri = False i {Initialize relabeling ag to False for all nodes}Set rci = 0 i {Initialize the relabeling counter to 0 for all nodes}for t in {0, 1, ..., T} do</p><p>for (i, j) in Et doFindCommunity(i,j)</p><p>end forend for</p><p>6</p><p>5</p><p>4</p><p>3</p><p>2</p><p>1</p><p>4</p><p>1</p><p>52</p><p>3</p><p>4</p><p>2</p><p>24</p><p>2</p><p>2</p><p>2</p><p>22</p><p>2</p><p>4</p><p>3</p><p>52</p><p>1</p><p>4</p><p>2</p><p>24</p><p>2</p><p>2</p><p>2</p><p>22</p><p>2</p><p>2</p><p>3</p><p>54</p><p>1</p><p>2</p><p>2</p><p>24</p><p>2</p><p>2</p><p>2</p><p>24</p><p>2</p><p>2</p><p>3</p><p>14</p><p>5</p><p>2</p><p>2</p><p>24</p><p>5</p><p>2</p><p>2</p><p>24</p><p>5</p><p>2</p><p>1</p><p>34</p><p>5</p><p>2</p><p>2</p><p>34</p><p>5</p><p>2</p><p>2</p><p>34</p><p>5</p><p>1</p><p>2</p><p>34</p><p>5</p><p>1</p><p>2</p><p>34</p><p>5</p><p>1</p><p>2</p><p>34</p><p>5</p><p>A: Simultaneous B: Ordered C: Randomly OrderedTime</p><p>Fig. 2. Ordering Policies: h = 1. Depending on policy, convergance can beimpossible. Labels updated according to found edge in each timestep.</p><p>when this is run in a distributed fashion, so each node willspend O(h) time on this algorithm.</p><p>2) Importance of Update Order: Let us consider the condi-tions under which a given community can converge to havingone label. One factor which is important is the order inwhich nodes update their community label when an edgeis observed. As an example, see gure 2. In this gure thecommunity afliation is shown as a label on the node. Weassume h = 1 and show how the ordering of node updatescan effect convergence.</p><p>Example A shows the result if nodes can instantaneouslycommunicate their community afliationin that case everymeeting results in a simple swap of community afliation. InExample B it is assumed that if an edge (i, j) exists then i willupdate before j if i &lt; j. In this case community afliationis pushed around the network. The only way to converge onone afliation in Example B is to be extremely lucky withthe order of edges. The only way to break these cases is bychoosing the rst updating point randomly, as in Example C.</p><p>All simulation experiments use the random updating strat-egy shown in Example C to improve convergence probability.</p><p>3) Relabeling: The purpose of randomly relabeling nodesis to enable detection of the splitting of a community. A splitoccurs when a community breaks into separate communities.</p><p>Without random relabeling it is possible that both commu-nities formed by a community split will retain the same label.If that is the case, then it is impossible to tell the two commu-</p><p>nities apart. By randomly replacing a nodes community labelwith its personal label and propagating that label through thenetwork we ensure that a communitys label is the label of amember in that community. This is essential when we need toupdate community labels after a split has occurred.</p><p>III. RESULTS</p><p>Unfortunately there are no dynamic benchmark graphs thatcould be used to evaluate the performance of this algorithm.Therefore to analyze the performance of this algorithm itwas run on several articially generated random networks. Itwas additionally run on a more realistic network in whichnodes randomly moving through space, the results of whichappear in section III-F. Although random relabeling is essentialfor handling community splits, it injects noise (communitychanges) into the system by design and is therefore notfeatured in many of the results to enhance the clarity of gures.</p><p>A. Networks Used</p><p>Four networks were used for testing. All networks consistedof 100 nodes. In all methods edges are generated by choosinga random node in the network and joining it to another randomnode in its community. A stabilization parameter n is chosen tobe longer than the expected maximum convergence time of thealgorithm. Between experiments the value of n could changeto accommodate the longer or shorter expected maximumconvergence times when using different parameters.</p><p>Join This network illustrates the joining of two communi-ties. Initially the network is partitioned into two communitiesof 50 nodes each. In every iteration a random contact isgenerated. After n contacts have been made the two networksjoin together for n contacts.</p><p>Split This network illustrates the splitting of a community.In this network all nodes are in the same community for ncontacts, and then the community splits into two communitiesof 50 nodes each for n contacts.</p><p>Connected This network contains two communities of equalsize that are connected by some intermediary nodes. Thisnetwork is simulated for n contacts.</p><p>B. Convergence and h</p><p>In gure 3 experiments were run on the Join network withd = 0 and without relabeling to show the effect of changes inh on the convergence time (the time for the number of errorsto stabilize). The gure shows that larger values of h producelonger convergence times.</p><p>The intuition behind this increase is this. In the case of onenode needing to decide which community it is in, it could takeup to h contacts for it to decide. As h gets longer this decisiontime also gets longer. Therefore when a node is supposed tojoin with a community it belongs to, it takes longer for it tojoin the community, therefore lengthening convergence time.</p><p>The errors shown in the graph are summed over windowsof 33 contacts. An error occurs when a pair of nodes iseither in the same community when they belong to differentcommunities or vice versa. Therefore errors are calculated over</p><p>12701238</p></li><li><p>0 5000 10000 15000 20000</p><p>Number of Contacts</p><p>0</p><p>10000</p><p>20000</p><p>30000</p><p>40000</p><p>50000</p><p>60000</p><p>70000</p><p>80000</p><p>90000</p><p>NumberofErrors</p><p>Various History Sizes and Their Performance</p><p>5</p><p>30</p><p>60</p><p>90</p><p>120</p><p>Fig. 3. Varying Values of h and Their Performance. Simulation was runon a network of 100 nodes split into two communities for 10000 contactsthen joined into one community for 10000 contacts. d = 0, relabeling was...</p></li></ul>

Recommended

View more >