[IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining - Crawling Social Internetworking Systems

Download [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining - Crawling Social Internetworking Systems

Post on 10-Mar-2017




2 download

Embed Size (px)


<ul><li><p>Crawling Social Internetworking SystemsFrancesco Buccafurri</p><p>DIMET Dept.Univ. of Reggio Calabria</p><p>Italye-mail: bucca@unirc.it</p><p>Gianluca LaxDIMET Dept.</p><p>Univ. of Reggio CalabriaItaly</p><p>e-mail: lax@unirc.it</p><p>Antonino NoceraDIMET Dept.</p><p>Univ. of Reggio CalabriaItaly</p><p>e-mail: a.nocera@unirc.it</p><p>Domenico UrsinoDIMET Dept.</p><p>Univ. of Reggio CalabriaItaly</p><p>e-mail: ursino@unirc.it</p><p>AbstractIn new generation social networks, we expect thatthe paradigm of Social Internetworking Systems (SISs, for short)will be more and more important. In this new scenario, therole of Social Network Analysis is of course still crucial but thepreliminary step to do is designing a good way to crawl theunderlying graph. While this aspect has been deeply investigatedin the field of social networks, it is an open issue when movingtowards SISs. Indeed, we cannot expect that a crawling strategywhich is good for social networks, is still valid in a SocialInternetworking Scenario, due to its specific topological features.In this paper, we first confirm the above claim and, then, define anew crawling strategy specifically conceived for SISs. Finally, weshow that it fully overcomes the drawbacks of the state-of-the-artcrawling strategies.Keywords: Social Networks, Crawling Strategies, Social</p><p>Internetworking Systems</p><p>I. INTRODUCTION</p><p>In the last years, social networks have become one of themost popular communication media on the Internet [14]. Theresulting universe is a constellation of several social networks,each forming a community with specific connotations, alsoreflecting multiple aspects of people personal life. Despite thisinherent heterogeneity, the possible interaction among distinctsocial networks is the basis of a new emergent internetworkingscenario enabling a lot of strategic applications, whose mainstrength will be just the integration of possibly differentcommunities yet preserving their diversity and autonomy. Thisconcept is very recent and only a few commercial attempts toimplement Social Internetworking Systems (SISs, for short)have been proposed [1], [2], [3], [4]. In this new scenario,the role of Social Network Analysis [15], [21], [11], [25],[8], [24], [23] is of course still crucial in studying theevolution of structures, individuals, interactions, and so on,and in extracting powerful knowledge from them. But thenecessary prerequisite is to have a good way to crawl theunderlying graph. In the past, several crawling strategies forsingle social networks have been proposed. Among them themost representative ones are Breadth First Search (BFS, forshort) [25], RandomWalk (RW, for short) [20] and Metropolis-Hastings Random Walk (MH, for short) [13]. They werelargely investigated for single social networks highlightingtheir pros and cons [13], [17]. But, what happens when wemove towards Social Interneworking Scenarios? The questionopens in fact a new issue that, to the best our knowledge,has not been investigated in the literature. Indeed, this issue</p><p>is far to be trivial, since we cannot presumably expect that acrawling strategy which is good for social networks, is stillvalid in a Social Internetworking Scenario, due to its specifictopological features.</p><p>This paper gives a contribution in this setting. In particular,through a deep experimental analysis of the above existingcrawling strategies, conducted in a multi-social-network set-ting, it reaches the conclusion that they are little adequateto this new context, thus enforcing the need of designingnew crawling strategies specific for SISs. Starting from thisresult, it gives its second important contribution, consistingin the definition of a new crawling strategy, called Bridge-Driven Search (BDS, for short), which relies on a featurestrongly characterizing a SIS. Indeed BDS is centered on theconcept of bridge, which represents the structural elementthat interconnects different social networks. Bridges are thosenodes of the graph corresponding to users who joined morethan one social network and explicitly declared their differentaccounts. By an experimental analysis we show that BDSfits the desired features, thus overcoming the drawbacks ofexisting strategies.</p><p>The plan of this paper is as follows: Section II presentsrelated literature. The state-of-the-art crawling strategies BFS,RW and MH are analyzed in Section III, showing why they arenot adequate for a SIS. In Section IV, our crawling strategyis illustrated. Section V is devoted to the presentation of theexperiments carried out to evaluate BDS. Finally, Section VIprovides a look at our future research efforts.</p><p>II. RELATED WORK</p><p>With the increase of both the number and the dimension ofsocial networks the development of crawling strategies becamea very challenging issue. For instance, the problem of samplingfrom large graphs is discussed in [19], [22]. Given a commu-nication network, the approach of [12] aims at recognizingits overall design as well as at identifying important nodesand links in it. An investigation on the statistical propertiesof sampled scale-free networks is proposed in [18]. In [7],the authors compare the structures of Cyworld, MySpace andOrkut. In particular, they analyze the degree distribution, theclustering property, the degree correlation and the evolutionover time of Cyworld. Some methods to produce a smallrealistic sample from a large real network are presented in[16]. In [25], the social network graph crawling problem is</p><p>2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining</p><p>978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.87</p><p>505</p><p>2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining</p><p>978-0-7695-4799-2/12 $26.00 2012 IEEEDOI 10.1109/ASONAM.2012.87</p><p>506</p></li><li><p>investigated in such a way as to answer questions like: (i) howfast crawlers into consideration discover nodes/links; (ii) howdifferent social networks and the number of protected usersaffect crawlers; (iii) how major graph properties are studied.A framework of parallel crawlers based on BFS and operatingon eBay is described in [10]. In [17], the impact of differentgraph traversal techniques (namely, BFS, DFS, Forest Fire,Snowball Sampling) on the computation of the average nodedegree of a network is analyzed. Finally, an analysis of theFacebook friendship graph is proposed in [13].</p><p>III. MOTIVATIONS</p><p>Several crawling strategies for single social networks havebeen proposed in the literature. Among these strategies, twovery popular ones are BFS and RW. BFS implements theclassical Breadth First Search visit, whereas RW selects thenext node to be visited uniformly at random among theneighbors of the current node. A more recent strategy is MH[13]. At each iteration, it randomly selects a node w from theneighbors of the current node v. Then, it randomly generatesa number p belonging to the real interval [0, 1]. If p (v)(w) ,where (v) ((w), resp.) is the outdegree of v (w, resp.), thenit moves from v to w. Otherwise, it stays in v. Observe thatthe higher the degree of a node, the higher the probability thatMH discards it.In the past, these crawling strategies were deeply investi-</p><p>gated when applied on a single social network. This analysisshowed that none of them is always better than the other ones.Indeed, each of them could be the optimal one for a specific setof analyses. However, no investigation about the applicationof these strategies in a SIS has been carried out. Thus, wehave no evidence that they are still valid in this new context.In order to reason about this, let us start by considering astructural peculiarity of a SIS. It is the existence of bridges,which, we recall, are those nodes of the graph corresponding tousers who joined more than one social network and explicitlydeclared their different accounts. We expect that these nodesplay a crucial role in the crawling of a SIS since they allow thecrossing of different social networks, thus discovering the SISintrinsic nature (in fact related to interconnections). Bridgesare not standard nodes, due to their role; as a consequence,we cannot see a SIS just as a huge social network. Besidesthese intuitive considerations about bridges, we can help ourreasoning also with two results obtained in [9], namely: Fact(i) the fraction of bridges in a social network is low, and Fact(ii) bridges have high degrees on average.Now, the question is, what about the capability of existing</p><p>crawling strategies of finding bridges? The deep knowledgeabout BFS, RW and MH provided by the literature allows thefollowing conjectures to be drawn:1) BFS tends to explore a local neighborhood of the seed it</p><p>starts from. As a consequence, if bridges are not presentin this neighborhood or their number is low (and thisis highly probable due to Fact (i)), the crawled samplefails in covering many social networks. Furthermore,it is well known that BFS tends to favor power users</p><p>and, therefore, presents bias in some network parameters(e.g., the average degree of the nodes of the crawledportions are overestimated) [17].</p><p>2) Differently from BFS, RW does not consider only alocal neighborhood of the seed. In fact, it selects thenext node to be visited uniformly at random amongthe neighbors of the current node. Again, due to Fact(i), the probability that RW selects a bridge as the nextnode is low. As a consequence, the crawled sample doesnot cover many social networks and, if more than onesocial network is represented in it, the coupling degreeof the crawled portions of social networks is low. Finally,analogously to BFS, RW tends to favor power users(and, consequently, to present bias in some networkparameters [17]); this feature only marginally influencesthe capability of RW to find bridges since, in any case,their number is low.</p><p>3) MH has been conceived to unfavor power users and,more in general, nodes having high degrees which are,instead, favored by BFS and RW. It performs verywell in a single social network [13] especially in theestimation of the average degree of nodes. However, dueto Fact (ii), MH will penalize bridges. As a consequence,the sample crawled by MH does not cover many socialnetworks present in the SIS.</p><p>In sum, from the above reasoning, we expect that both BFS,RW and MH are substantially inadequate in the context ofSISs. As it will be described in Section V, this conclusionis fully confirmed by a deep experimental campaign, whichclearly highlights the above drawbacks. Thus we need todesign a specific crawling strategy for SISs.</p><p>IV. PROPOSED CRAWLING STRATEGY</p><p>In the design of the proposed crawling strategy we start bythe analysis of some aspects limiting BFS, RW and MH in aSIS, in order to overcome them. Recall that BFS performs abreadth first search on a local neighborhood of a seed. Now,the average distance between two nodes of a single socialnetwork is generally less than the one between two nodesof different social networks. Indeed, in order to pass froma social network to another, it is necessary to cross a bridge,and since bridges are few, it could be necessary to generatea long path before reaching one of them. As a consequence,the local neighborhood considered by BFS includes one or asmall number of social networks. To overcome this problem,a depth first search, instead of a breadth first search, could beperformed. For this purpose, the way of proceeding of RW andMH could be included in our crawling strategy. However, sincethe number of bridges in a social network is low, the simplechoice to go in-depth blindly does not favor the crossing froma social network to another. Even worse, since MH penalizesthe nodes with a high degree, it tends to unfavor bridges,rather than favor them. Again, in the above reasoning, we haveexploited Facts (i) and (ii) introduced in the previous section.A solution that overcomes the above problems could consistin implementing a non-blind depth first search in such a</p><p>506507</p></li><li><p>way as to favor bridges in the choice of the next node tovisit. This is the choice we have done, and the name we giveto our strategy, i.e., Bridge-Driven Search (BDS, for short),clearly reflects this approach. However, in this way, it becomesimpossible to explore (at least partially) the neighborhoodof the current node because the visit proceeds in-depth veryquickly and, furthermore, as soon as a bridge is encountered,there is a cross to another social network. The overall resultof this way of proceeding would be an extremely fragmentedcrawled sample. In order to face this problem, given the currentnode, our crawling strategy explores a fraction of its neighborsbefore performing an in-depth search of the next node to visit.In order to formalize our crawling strategy, we need to</p><p>introduce the following parameters: nf (node fraction). It represents the fraction of the non-</p><p>bridge neighbors of the current node which should bevisited. It ranges in the real interval (0,1]. For example,when nf is equal to 1, our strategy behaves like BFSsince all neighbors of the current node are selected. Thisparameter is used to tune the portion of the current nodeneighborhood which has to be taken into account and,hence, it balances the breadth and depth of the visit.</p><p> bf (bridge fraction). It represents the fraction of thebridge neighbors of the current node which should bevisited. Like nf , it ranges in the real interval (0,1].Clearly, this parameter is greater than 0 in order to allowthe visit of at least one bridge (if any), thus resulting incrossing to another social network.</p><p> btf (bridge tuning factor). It is a real number belongingto [0,1] which allows the filtering of the bridges to bevisited among the available ones, on the basis of theirdegree. Its role will be explained in the following.</p><p>The pseudo-code of BDS is shown in Algorithm 1.</p><p>V. EXPERIMENTS</p><p>In this section, we present our experiment campaign con-ceived to determine the performances of BDS and to compareit with BFS, RW and MH when they operate in a SIS.</p><p>A. Techniques and Metrics</p><p>In order to perform our experiments, we implemented thefour crawling strategies considered in this paper. Since wewanted to analyze the behavior of these strategies on a SIS,we had to extract not only connections among the accounts ofdifferent users in the same social network but also connectionsamong the accounts of the same user in different socialnetworks. In order to detect these connections, two standardsencoding human relationships, namely XFN [6] and FOAF[5], are generally exploited. Interestingly enough, they allowthe identification of bridge nodes as those nodes linked by meedges; suitable instructions are present in XFN and FOAF todefine these edges and, ultimately, to derive bridges.In our experiments, we considered a SIS consisting of</p><p>four social networks, namely Twitter, LiveJournal, YouTubeand Flickr. These networks are compliant with the XFN andFOAF standards and have been largely analyzed in the past in</p><p>Algorithm 1 BDSNotation We denote by N(x) the function returning the number of the non-bridge</p><p>neighbors of the node x, and b...</p></li></ul>


View more >