CommonFinder: A decentralized and privacy-preserving common-friend measurement method for the distributed online social networks

Download CommonFinder: A decentralized and privacy-preserving common-friend measurement method for the distributed online social networks

Post on 23-Dec-2016

212 views

Category:

Documents

0 download

TRANSCRIPT

  • pod

    ollege

    ince 4

    the distributed online social networks (DOSN for short),e.g., Safebook [1], LotusNet [2], Cuckoo [3], diaspora [4],Peerson [5], Vis-a-Vis [6] have been proposed to better

    ecommendationtworks are able tocomplete knowl-DOSNs, itilling tose of the

    leakage. As a result, it is highly desirable to develoable and privacy-preserving methods to quantify thebility of being friends for DOSN users.

    One of the most popular metric for friend recommenda-tion is based on the Number of Common Friends (NCF forshort), which has been shown to be very effective to rec-ommend new friends or nd old friends [7]. Measuring

    http://dx.doi.org/10.1016/j.comnet.2014.02.0201389-1286/ 2014 Elsevier B.V. All rights reserved.

    Corresponding author. Tel.: +86 13308491230 (Y. Wang).E-mail addresses: yongquanf@nudt.edu.cn (Y. Fu), wangyijie@nudt.

    edu.cn (Y. Wang), wpeng@nudt.edu.cn (W. Peng).

    Computer Networks 64 (2014) 369389

    Contents lists available at ScienceDirect

    Computer N

    journal homepage: www.elsproviders can peek at users proles at will for targetedadvertisements or even sell these sensitive information tothird parties for prots. As a result, improving the privacyprotection of social networks becomes increasinglyimportant.

    Fortunately, borrowing the success of the P2P systems,

    for users, which is known as the friend rproblem. The centralized online social necompute possible friends by using theedge of users proles. Unfortunately, formonly believed that end hosts are unwtheir proles to unknown entities becauis com-publishprivacyp scal-possi-1. Introduction

    Online social networks such as Facebook, YouTube,Flickr have become quite popular these days. However,there are also hot debates over the privacy protection ofthese centralized social services. For example, the service

    protect users privacy by hosting the DOSN infrastructuresbased on decentralized end hosts. The key idea is to storepersonal data on decentralized nodes and to enforce strictaccess rules on who can visit a users prole.

    To increase the popularity of distributed online socialnetworks, an open question is how to nd potential friendsArticle history:Received 19 November 2012Received in revised form 9 October 2013Accepted 3 February 2014Available online 7 March 2014

    Keywords:Distributed online social networkPrivacyCommon friendsFriend recommendationCoordinateBloom lterDistributed social networks have been proposed as alternatives for offering scalable andprivacy-preserving online social communication. Recommending friends in the distributedsocial networks is an important topic. We propose CommonFinder, a distributed common-friend estimation scheme that estimates the numbers of common-friends between anypairs of users without disclosing the friends information. CommonFinder uses privacy-preserving Bloom lters to collect a small number of common-friend samples, andproposes low-dimensional coordinates to estimate the numbers of common friends fromeach user to any other users. Simulation results on real-world social networks conrm thatCommonFinder scales well, converges quickly and is resilient to incomplete measurementsand measurement noises.

    2014 Elsevier B.V. All rights reserved.a r t i c l e i n f o a b s t r a c tCommonFinder: A decentralized andcommon-friend measurement methonline social networks

    Yongquan Fu a, Yijie Wang a,, Wei Peng ba Science and Technology on Parallel and Distributed Processing Laboratory, CHunan Province 410073, ChinabCollege of Computer, National University of Defense Technology, Hunan Provrivacy-preservingfor the distributed

    of Computer, National University of Defense Technology,

    10073, China

    etworks

    evier .com/ locate/comnet

  • preservation of users privacy. Section 4 then analyzes typ-ical characteristics of social-network data sets. Section 5next presents an overview of our proposed NCF predictionmethods. Section 6 presents the coordinate computationprocess. Section 7 next applies the decentralized coordi-nates to select top-k users for recommending friends.Section 8 then measures the degree of the privacy protec-tion offered by CommonFinder. Section 9 next comparesCommonFinders performance with extensive simulationresults. Section 10 next conrms CommonFinder is robustto Sybil accounts. Section 11 then summarizes related

    370 Y. Fu et al. / Computer Networks 64 (2014) 369389NCF values in the distributed online social networks ishowever, a difcult task. First, disclosing friend lists tonon-friend users could severely leak users privacy infor-mation [8,9]. Second, the NCF computation has to scalewell since most end hosts have limited bandwidthcapacity.

    Talash [10] computes the common friends based onexchanging friend lists between pairs of users, which notonly leaks personal information, but also does not scalewell. The PSI approach [1114] represents a list of friendswith coefcients of a polynomial. It then exchanges Homo-morphic encrypted coefcients via the communicationlinks, and nally computes the sums of coefcients, whichcorrespond to the set of common items in two friend lists.The PSI approach works well for semi-honest users, but in-creases the transmission bandwidth and the computationcosts.

    We propose a scalable and privacy-preserving distrib-uted NCF estimation method called CommonFinder thatestimates the NCF values between any pairs of users andis able to select top-k users that have the largest NCFvalues.

    CommonFinder represents a users friend list with thewell-known Bloom lter. Our privacy analysis (Section 8)proves that the Bloom lter provides the differential pri-vacy [15,16] for the friend list: Given a user is Bloom lter,a curious entity is unable to determine who are user isfriends.

    Unfortunately, exchanging the Bloom lters may needseveral KBytes bandwidth costs in order to control the falsepositives of the Bloom lter. To further increase the scala-bility of estimating the NCF values, we propose a distrib-uted maximum margin matrix factorization method toestimate the NCF values by low-dimensional coordinates.As a result, the transmission size is xed to be the length ofthe coordinate that is independent of the length of the friendlists or Bloom lters.

    We model the NCF completion problem with the maxi-mum margin matrix factorization method and provide asystematical design of a distributed NCF prediction methodthat signicantly extends our prior work [17,18]. TheMMMF method maps contiguous matrix factorization re-sults to discrete NCF values with adaptive thresholds.These thresholds are learnt during the process of optimiz-ing the matrix-factorization model. To adapt to the decen-tralization of users, we reformulate the MMMF method inseparable objective functions that involve each users coor-dinate and a small number of neighbors. For each user, wedesign a fully decentralized conjugate gradient optimiza-tion method to adjust the coordinates of each user thatconverges quickly.

    Finally, we present extensive privacy analysis and sim-ulation results over the real-world social network topolo-gies. Our results show that CommonFinder not onlyprotects users friend lists, but also signicantly improvesthe prediction accuracy and is more robust against incom-plete or incorrect NCF measurements than previousmethods.

    The rest of the paper is organized as follows. Section 2introduces the background information. Section 3 nextdenes the problem of predicting NCF values withwork. Section 12 concludes the paper. Table 1 shows keyparameters.

    2. Background

    2.1. Distributed online social networks

    We introduce the basic characteristics for existingDOSNs.

    2.1.1. Social graphLet a user be an online entity that participates in the

    DOSN. Each user A has a friend list SA, where a friend Bof a user A is a user B that establishes the social link orfriendship link with user A on the DOSN. The commonfriends of two users A and B are represented by the subsetof users that are both friends of user A and B at the sametime.

    Users and social links form a social graph. Each userand his/her friends are adjacent on the social graph, whichcorrespond to one-hop neighbors to each other. The de-gree of each user amounts to the size of its neighbors onthe social graph. The number of hops between two usersamounts to the length of the shortest path for these twousers on the social graph.

    2.1.2. Privacy-preserving message routingUsers data are usually stored into his/her own com-

    puter. For improving the availability of users data whenthey are ofine, some DOSNs like Safebook replicate eachusers data on his/her friends computers. These friendsreplication will be used for ofine users.

    The social graph is maintained via some kind of Peer-to-Peer substrates. When a user joins the DOSN, each userscomputer needs to be registered in the Peer-to-Peersubstrate consisting of decentralized computers. Usersreal names are decoupled with randomized keys calledidentiers that are generated by cryptographic hash

    Table 1Notations and their meanings.

    kI The number of hash functions in a Bloom lter

    mI The length of a Bloom lterN The number of users on the DOSNbX The coordinate distance matrixY The pairwise NCF matrix between a set of usersL The maximal NCF valued The coordinate dimensionh The NCF-mapping thresholdsk The number of recommended users

  • user to another non-friend user may be eavesdropped by

    exactly matched users or fails when timeout events

    ment to the search process, most OSNs also recommend

    Y. Fu et al. / Computer Networks 64 (2014) 369389 371friends based on users proles and friend lists.

    2.2.2. Recommending friendsTo recommend friends, the OSNs have to locate the set

    of users that are likely to be future friends for a user. TheNumber of Common Friends (NCF) is frequently used toquantify the probabilities of being friends between users.Then recommending friends for each user A is realized byselecting the top-k users having the largest NCF valueswith user A. These top-k users will serve as the recom-mended future friends. Afterwards, users are promptedwith these top-k users in the OSNs web pages. Each userlater can send invitations to those people in the list. TheNCF values may be appended as a proof of the friendship.Upon receiving the acknowledge of accepting the invita-tion of being friends, a new friendship link is thenestablished on the DOSNs.occur.

    Unfortunately, searching real-life friends is time-con-suming for users, since users have to manually look upevery friend he/she has remembered. Therefore, comple-curious or malicious users.

    2.2. Establishing friendship on the DOSNs

    We introduce how users locate real-life friends on theDOSNs.

    2.2.1. Searching friendsA user that logs into an OSN usually searches real-life

    friends with their names and lters out non-friend usersby checking their proles. The search procedure differsfor centralized and distributed OSNs:

    When centralized OSNs receive users requests, theback-end servers query the OSNs databases to selectmatched users proles. For most DOSNs, since users proles are stored ondecentralized nodes, users requests are routed amongthe Peer-to-Peer substrate. The response either locatesfunctions like SHA-1. The identier is originally used forrouting messages among users, but for the DOSN context,these identiers also hide persons real-world names, sinceeach identier can be assumed as a perfectly randomvariable.

    The identiers are robust against dictionary attacks bycurious users. Since the space of the identiers is huge en-ough to prohibit the enumeration because of the over-whelming computational overhead. For example, thereare 2160 kind of possible strings for 160-bit identiers.

    Existing DOSNs usually route messages among usersalong the social connections in a hop by hop manner, in or-der to protect the privacy of the end-to-end communica-tion. Since each logical link corresponds to the real-liftfriendship, otherwise, sending a request message from a2.3. Maximum margin matrix factorization basedcollaborative ltering

    We next introduce the MaximumMargin Matrix Factor-ization (MMMF) method [19] for the collaborative lteringproblem, which seeks to predict rating scores from users togoods, given a partially available set of rating histories.

    2.3.1. Predicting rating scores with matrix factorizationGiven Nc customers and Ng kinds of items, we can orga-

    nize the set of rating scores as a Nc-by-Ng rating matrix Y,where Yij denotes the rating score from user i to the itemcorresponding to the jth column. The objective of the col-laborative ltering problem is to complete the missingitems in the matrix Y. One of the most popular collabora-tive ltering approach is the matrix factorization, whichtrains a low rank matrix bX to approximate the rating ma-trix Y.

    To obtain the matrix bX, the matrix factorization has tominimize a loss function that quanties the differencesbetween Y (training set) and bX. For a complete matrix Y,the Singular Value Decomposition (SVD) method yieldsthe optimal low-rank approximation [20]. But when Yhas some missing items, SVD becomes less accurate withincreasing missing items, since it treats all missing itemsas the same constant values.

    Worse still, nding the optimal matrix bX for an incom-plete matrix Y is likely to fail due to the overtting phe-nomenon: although items in the training set are predictedperfectly, the items out of the training set however receivelarge prediction errors. Fortunately, in order to mitigatethe overtting issue, it is well known that we can controlthe capacity of the loss function by combining the lossfunction with a regularization model.

    Further, the matrix bX are generally contiguous values,while the rating scores in Y are usually discrete integers,e.g., from one star to ve stars. As a result, we have tomap the contiguous outputs to discrete rating values,which is generally a classication problem. Unfortunately,SVD simply rounds the contiguous values to nearby inte-gers, which may degrade classication errors.

    2.3.2. Overview of MMMFSuppose there are a total number of L rating scores, in

    order to optimize the classication of rating scores, MMMFintroduces a N-by-L 1 threshold matrix h. The thresholdmatrix h is unknown a prior and must be learnt from thetraining process. The ith row vector ~hi of the matrix h isused for mapping the ith row vector of the matrix bX to dis-crete rating scores.

    MMMF classies rating scores according to the SupportVector Machine (SVM). Suppose that we need to group bX toa binary matrix Y, the SVM groups bX to two separate clas-ses by locating a hyperplane that separates the items fromone group with items in the other group. In MMMF, thethreshold matrix is analogous to the hyperplane that sepa-rates items in bX to discrete rating scores. The optimalhyperplane of a SVM is dened as the one with the largestmargin between pairs of items from two groups, where themargin amounts to the maximum distance of the slab thatis orthogonal with the hyperplane.

  • query results on stored items. The differential-privacy

    372 Y. Fu et al. / Computer Networks 64 (2014) 3693892.3.3. Mapping bX to rating scoresMMMF maps the ith row vector in matrix bX by the

    threshold vector~hi, the ith row vector of the threshold ma-trix h. Let hi

    ! hi1; . . . ; hiL1

    be a vector of thresholdssorted in the ascending order. We create N adjacentintervals:

    1; hi1 ; hi1; hi2 ; . . . ; hiL1;1

    with ~hi as L 1 separating points.For each item bXij, its rating score is calculated as the in-

    dex of the interval that contains the item bXij. For example,suppose there are two thresholds 0:2;0:3 that separatethe whole range of real values into three intervals:

    1;0:2 ; 0:2;0:3 ; 0:3;1 An item bXij 0:2 is then mapped to 2, since bXij is within

    the second interval 0:2;0:3 .

    2.3.4. Soft margin loss functionAn optimal classication of the estimated matrix bX to

    the rating score matrix Y means that, each item bXij shouldbe contained in the interval hiYij1; hiYij

    i, otherwise, we

    have a misclassication.The hard margin matrix factorization seeks a matrix bX

    matching observed rating scores by classifying each itembXij with immediate thresholds hiYij1 and hiYij of the Yijthinterval:

    hiYij1 1 < bXij < hiYij 1 1for all ij 2 S. Unfortunately, the hard margin is sensitive tonoises. Noises are common in bX, since bX consists of inher-ent uncertainty. Further, Eq (1) is non-differentiable, caus-ing the nontrivial difculty for the optimization process.

    For improving the robustness against noises, the softmargin matrix factorization relaxes the hard constraint ofEq. (1) by adding slack variables nij P 0

    for all ij 2 S:

    hiYij1 1 nij < bXij < hiYij 1 nij 2As the optimal solution for Eq. (2) amounts to that by opti-mizing the hinge loss function h z max 0;1 z [19](the distance from the classication margin), we transformEq. (2) to:

    L Yij; bXij h bXij hiYij1 h hiYij bXij 3Enforcing the hinge loss functionmakes Eq. (3) become dif-ferentiable, therefore, Eq. (3) can be solved via the convexoptimization.

    Further, MMMF also penalizes the errors of classica-tion with all thresholds in each threshold vector in orderto further improve the robustness of Eq. (3):

    L Yij; bXij XL1r1Lr; bXij XYij1

    r1h bXij hir XL1

    rYijh hir bXij

    XL1r1

    h Trij r;Yij hir bXij 4

    where the indicator function Trij determines whether athreshold item r in hi

    !is larger than the rating score Yij:Df maxA;B;AB1

    f A f B k k1 8

    [15]method provides very strong protection of the privacy:For any two databases D1 and D2 that differ only oneitems x, i.e., D1 D2 x. An adversary is unable to dis-tinguish whether the entity x is in the database D1 orD2.

    A random function j provides -differential privacy iffor any two databases D1 and D2 having one different item,i.e., where jD1 D2j 1 and all /#Range j PrjD1 2 / 6 exp PrjD2 2 / 7where Range denotes the output of the random function j[15]. The random function j adds some random noises tothe query results. The parameter is public to all users.Setting is up to the users decisions.

    The degree of the privacy protection of the randomfunction j depends on the scale of the added noise andthe sensitivity of the query result. The sensitivity statesthe maximum difference of the query results due to addingor removing an entity from the database:

    Denition 1. For a function f : D! Rd over a certaindomain D, the sensitivity of a function f is2.3.5. Formulating objective functionMMMF seeks to optimize Eq. (4) with robustness

    against the overtting phenomenon. To that end, MMMFnds a matrix bX and a classication matrix h by minimiz-ing the loss function in Eq. (4) plus two regularized items Uk k2F Vk k2F , where k kF denotes the Frobenius norm thatamounts to the squared root of the sum of squared items inthe matrix U or V. The regularized item is used for thecapacity control in order to mitigate the overtting prob-lem of Eq. (4).

    The objective function of MMMF can be written as:

    JU;V ; h XR1r1

    Xi;j2X;Yij>0

    hTrijhir UiVj a2 Uk k2F Vk k2F

    6

    where a is a trade-off constant.MMMF solves Eq. (6) based on the PolakRibire variant

    of the nonlinear conjugate gradient methods (PR-CG) in acentralized manner. PR-CG incurs a linear computationcomplexity, converges quickly and is robust to missing orerroneous inputs [21].

    2.4. Differential privacy

    The differential privacy [15,16] represents one of thestate-of-art privacy-protection techniques. A databasestores a set of items. Users interact with the databasethrough statistical queries and the database respondsTrij r;Yij 1 r P Yij1 r < Yij

    5

  • inject fake friend-list information into the system (see

    based on the NCF metric, the OSN needs to know the sizesof the intersections of the friend lists among two-hop

    top-k recommended users of user A according to the NCF

    DOSN faces severely imbalanced computation and trans-mission costs.

    Y. Fu et al. / Computer Networks 64 (2014) 369389 373users.Unfortunately, directly disclosing the friend lists to

    non-friend causes severe privacy leakage. This is becausethe friendship links on the social graph can serve as the n-gerprints to locate users activities [23]. For example, userspersonal proles could be accurately inferred from theirfriends even they hide their proles [8]. Therefore, to bet-ter protect users privacy, centralized OSNs allow for usersto hide friend lists from non-friend.

    Further, disclosing friend lists to non-friend is also vul-nerable to the Identity Cloning Attack (ICA) [9]. Supposethat a user T obtains a large number of friend lists of usersSection 10).

    3.2. Disclosing friend lists leaks privacy

    To recommend each user a set of candidate friendsFor real-valued functions, the most popular differentialprivacy approach [15] is to add independently identicaldistributed (i.i.d) noises from the Laplace distribution tothe query result, which provides -differential privacy:

    Theorem 1. For a function f : D! Rd over a certain domainD, a random function

    Mx f x Laplace Df

    d9

    provides -differential privacy [15].

    3. Problem denition

    We rst introduce how we model the behaviors ofusers. We next present the requirements of recommendingfriends based on the numbers of common friends.

    3.1. Adversary model

    Users Trust Friends. We assume that users trust theirfriends and allow their friends to visit their proles includ-ing their friend lists, since online friends correspond to thereal-world friendships. In the social graph, it means thateach user trusts his one-hop neighbors. This model allowsfor the anonymous communication process on the DOSN,since each user can recursively sends messages to his orher friends.

    Users Are Semi-honest. We further assume that usersare semi-honest [22]: users may be curious to learn thefriend lists of non-friend, but they follow the common-friend measurement protocol and do not claim fake friendsabout his/her friend list. Particularly, each user is able toeavesdrop on the logical communication links connectingthat user.

    Although the semi-honest model is simple, designingalgorithms for DOSNs turns out to be nontrivial due tothe large scale and the decentralization of nodes. Further,we will extend our model to tolerate Sybil users that could4. Analyzing the NCF metric of the online socialnetworks

    To better understand the characteristics of pairwise NCFvalues, we next analyze the NCF metric with OSN data sets.metric.

    3.3. Problem requirements

    We next summarize key requirements for computingpairwise NCFs in DOSN settings:

    Privacy-preserving. Friend lists should be treated asprivacy information since disclosing them may leakusers privacy and suffer from the ICA attack in Sec-tion 3.2. However, hiding the friend lists from non-friend signicantly increases the difculty of computingthe NCF values. Decentralized. As there exist no centralized entitiesthat are able to collect the friend lists of all userson the DOSN, users have to compute the NCF valueswith their two-hop neighbors on the social graph ina distributed manner. Further, the NCF-computingprocess among two-hop online users should still workdespite some users may join or leave the DOSN atwill. Scalable. Computing the NCF values should require lowbandwidth overhead and scale well with increasingfriends, since each user only has limited computingand bandwidth resources.

    3.4. A naive privacy-preserving NCF-computation solution

    One simple decentralized approach is to let each friendB of a user A be the broker of computing the NCF values foruser A [?]. User B computes the NCF values between A andthose in Bs friend list. To that end, user B requests thefriend lists from user A and one friend C, and then calcu-lates the intersection of two friend lists of A and C. Wecan see that the friend lists are always kept at one-hopneighbors. Since each user trusts one-hop neighbors onthe social graph, the privacy of the friend list is preserved.

    For calculating the common friends between any pair oftwo-hop users, each broker j needs O

    Pp2List j List p

    com-

    munication overhead, where List denotes the list offriends of a user. The computation overhead is.

    OP

    p2List j P

    q2List j ;pqjList p jjList q j

    for querying theexistence of each item over the friend lists. Due to theskewed distributions of friend lists shown in Fig. 1(a), theon the OSNs. To establish the friend link with each user A,user T creates an account with a friend list by copying fromfriend lists of user A and some other users for randomiza-tion. Then user T will have many common friends withuser A, therefore, user T will be likely to be listed in the

  • 0.20.30.40.50.60.70.80.9

    1Pr

    (X>x

    )FacebookYouTubeFlickr

    1000

    0.20.30.40.50.60.70.80.9

    1

    N

    Pr (X

    >x)

    FacebookYouTubeFlickr

    F dis

    0.05

    0.1

    0.15

    0.2

    Pr (X

    >x)

    FacebookYouTubeFlickr

    s of ke

    374 Y. Fu et al. / Computer Networks 64 (2014) 3693894.1. Data sets

    We choose three representative social graphs that areall collected by the online social networks research groupin the Max Planck Institute for Software Systems [2426].Table 2 summarizes the basic information for each of thenetworks.

    4.2. Distributions of numbers of friends

    We rst plot the Complementary Cumulative Distribu-tion Function (CCDF) of the number of friends for all users.Fig. 1(a) shows that the sizes of friend lists are quite non-uniform. Most friend lists have modest sizes, but a smallnumber of friend lists could have thousands of friends.For Facebook and YouTube data sets, only 5% of users havemore than 100 friends, while for the Flickr data set, over30% of users have over 100 friends.

    4.3. Distributions of numbers of common friends

    We then plot the NCF values between online users inFig. 1(b). The NCF values are non-uniformly distributed.For the Facebook and YouTube data sets, most one-hopor two-hop users have only one common friends, but thereexists 10% of pairs having a large number of commonfriends. For the Flickr data set, there are much more com-mon friends than the other two data sets, where around30% of pairs have more than 500 common friends.

    1 10 100 10000

    0.1

    # Friends

    (a) Degree distribution.

    1 5000

    0.1

    (b) NC

    Fig. 1. The CCDF plot4.4. Triangle inequality violations of numbers of commonfriends

    We next test whether the pairwise NCF values meet thetriangle-inequality assumption required by the metricspace model. The triangle inequality states that the sumof two NCF values is always not smaller the third NCF valuefor a triple i; j; q: Yij Yjq P Yiq. We dene the triangle

    Table 2Dataset summary.

    Network Date # Users

    Facebook 12/29/2008, 1/3/2009 60,29Flickr 11/2/2006, 104 days 2,570,53YouTube 1/15/2007 1,157,82inequality violation (TIV) as TIVijq YiqYijYjq. When TIVijq > 1,this triple i; j; q has a TIV.

    From Fig. 1(c), most triples follow the triangle inequal-ity, but there exist about 5% of triples violating the triangleinequalities. Further, for the YouTube and Flickr data sets,some triples may have large degrees of the triangleinequality violation. Since the pairwise NCF values do notfulll the metric-space requirements, there exist inherentdistortions for predicting NCF values with the metric spacebased methods such as LandmarkMDS.

    5. Designing a privacy-preserving NCF computationmethod for DOSNs

    We next present a scalable and privacy-preserving NCFprediction method that measures NCF values with con-stant bandwidth overhead.

    5.1. Intuitions

    Since our major objective is to recommend friendsbased on top-k NCF values, we do not necessarily knowthe exact list of common friends. Further, the recommen-dation also tolerates a low degree of prediction errors, asdifferent people have varying subjective views on friends.We can obtain a more efcient NCF computation methodwith sketches of friend lists that have compact storagestructures. For obtaining the pairwise NCF values scalablywith the protection of users privacy, our work applies

    1500 2000 2500

    CF

    tribution.

    0 1 2 3 4 50

    Yik/(Yij+Yjk)

    (c) Triangle inequality violation.y statistical metrics.the well-studied Bloom lters and the collaborative lter-ing methods to measure NCF values the decentralizedOSN settings.

    5.2. Architecture

    We present the main components in the CommonFind-er method. For ease of presentation, assume that a user Bob

    # Links Authors

    0 1,545,686 Viswanath et al. [24]5 33,140,018 Cha et al. [25]7 4,945,382 Mislove et al. [26]

  • logs into the DOSN. As plotted in Fig. 2, at the top level,CommonFinder provides two interfaces for the upper-layerfriend recommendation application:

    Top-k ranking: Selects at most k two-hop users havingthe largest numbers of common friends with Bob. Theseusers will serve as the recommended friends for Bob.The number of common friends for a pair of users iscomputed via the NCF-prediction component. NCF prediction: Computes the NCF value between Boband any other DOSN user based on their decentralizedcoordinates. Each users coordinates is freely exchangedamong DOSN links.

    The coordinate-maintenance component manages ausers coordinate in a fully distributed manner. Each useris assigned a random coordinate when he/she logs intothe system; afterwards, a daemon incrementally maintainseach users coordinate.

    For maintaining the coordinates, the neighbor-manage-

    messages.

    Y. Fu et al. / Computer Networks 64 (2014) 369389 375The Bloom-lter component represents each usersfriend list with a Bloom lter. It uses a bit array to repre-sent a list of items. Each item is hashed to a number of bitsaccording to a set of hash functions. The storage and trans-mission overhead are reduced by a constant factor com-pared to the friend lists. To test whether an element is inthe friend list, we compute the bits in the bit array usingthe same set of hash functions, and test whether these bitsare all set to ones. If yes, then the Bloom lter returns thatthe element is in the friend list.

    5.3. Exchanging bloom lters for privacy protection

    We select the Bloom lter as the sketch structure. EachBloom lter has a certain probability of false positives, i.e.,

    Top-k Ranking NCF prediction

    Bloom-filter maintenance

    Coordinate maintenance

    Neighbormanagement

    Fig. 2. The main components of the NCF computation method.ment component selects neighbors from online users hav-ing non-zero NCF values. This neighbor-managementcomponent sends heart-beat messages to these users andpulls back online users coordinates via the piggybackit may claim a user that is not in the friend list to be in thelist. The false positives are often treated as one drawback ofthe Bloom lter. However, each user can state that he/sheis not represented by the Bloom lter because of the uncer-tainty caused by false positives. Therefore, we can use thefalse positives to protect the privacy of users friend lists,since a curious user cannot exactly infer who is in thefriend list by querying the sketch.

    Suppose that we represent each friend list with a Bloomlter, two users can exchange their Bloom lters to eachother without worrying about the loss of privacy of thefriend lists. In Section 8.2, we prove that the Bloom lterprovides differential privacy for the friend lists.

    5.4. Scale the NCF computation with decentralized prediction

    The transmission and computation overhead of theBloom lter are still linear with the sizes of friend lists.For controlling these overhead, we predict NCF values withdecentralized coordinates.

    5.4.1. OverviewWe compute the NCF values with the Bloom lters for a

    small fraction of users, then we predict the NCF values forthe rest of pairs of two-hop users with low-dimensionalcoordinates. Accordingly, each user has a low-dimensionalcoordinate, the NCF value between a pair of users amountsto the coordinate distance between these two users. Esti-mating NCF values with decentralized coordinates has atleast two desirable advantages:

    Scalability: The coordinates incur xed computationand communication costs, which are independent ofthe sizes of the friend lists. Privacy-protection: Since coordinates are low-dimen-sional real-valued vectors, exchanging coordinates onlyreveals the NCF values without the information of thefriend lists.

    Let Y denote a N-by-N matrix that denotes the pairwiseNCF values for pairs of users. For correctly predicting NCFvalues, the coordinates must be optimized with respectto an objective function. Selecting an appropriate objec-tive functions is nontrivial:

    Sparse: The NCF matrix Y may be quite sparse, becauseYij amounts to zero for any pair i; j of users that aremore than two hops away on the social graph. Distributed: Since users are decentralized in DOSN,optimizing the coordinates has to be performed in a dis-tributed manner, where each user adjusts his/her owncoordinate adaptively with respect to a number ofneighbors.

    Predicting NCF values is analogous to completing therating scores by users in the collaborative ltering eld,since both are discrete integers. We thus propose to predictNCF values for decentralized pairs of users by borrowingthe success of the collaborative ltering eld [27]. TheMMMFmethod (see Section 2.3) is robust to missing items,but we have to transform the centralized MMMF

  • The matrix bX is represented by a linear combination of

    users. For each user i, we set its coordinate as two low-

    Second, CommonFinder fully utilizes the low-dimen-

    tion overhead. Therefore, CommonFinder scales well withdecentralized users on the DOSNs.

    Finally, as NCF has been a very popular metric for rec-ommending friends on centralized OSNs, CommonFindercan serve as a replacement for existing NCF computationin centralized OSNs, which can improve the scalabilityand increase the responsiveness to users requests.

    6. NCF Prediction with decentralized coordinates

    We next present how to estimate the NCF between anypair of users combining the Bloom lters and the

    376 Y. Fu et al. / Computer Networks 64 (2014) 369389dimensional vectors ~ui;~v i and a vector of threshold val-ues ~hi, where ~ui denotes ith row vector of U, ~v i denotesthe ith column vector of V, and ~hi represents the ith rowvector of h. Accordingly, the coordinate distance bXij be-tween two users i and j amounts to the dot productbXij uiv j.5.4.3. NCF prediction

    To predict NCF values to another user j, each user imapsthe coordinate distance bXij to a discrete NCF value with his/her own threshold vector ~hi. Algorithm 1 summarizes keysteps to compute the NCF values with coordinates. Step 2computes the distance bXij. Steps 37 determine the indexof the interval that contains bXij. We can see that user i doesnot need user js threshold vector. Fig. 3 shows an exampleof computing NCF values by users.

    Algorithm 1. Mapping the coordinate distance to the NCFvalue.

    1 MapXtoY (~ui, ~v j, ~hi)input: ~ui: user is coordinate component, ~v j: user jscoordinate component,~hi: User is threshold vector.output: y

    2 bXij ~ui ~v j;3 y = 1;4 for l 1! L 1 do5 y bXij P hil?1 : 0;6 end7 return y;two low-rank matrices: bX U V , where U is a N dmatrix, V is a d N matrix, and d N. Each row vectorof the matrix bX is represented by the inner product ofvectors from U and V , i.e., bXij Pdm1uimvmj. The threshold matrix h hilf g, where i 2 S; l 2 1; L 1 ,is learnt from MMMFs optimization process.

    For predicting NCF values with the MMMF method, weassign row vectors of the matrices bX and h to decentralizedmethod to a decentralized MMMF method since users aredecentralized in the DOSN context. We solve the distrib-uted MMMF method with a novel decentralized conjugateoptimization method that converges quickly to stablepositions.

    5.4.2. Coordinate structureWe next introduce the structure of the coordinates.

    Let S denote a set of DOSN users. Let N be the size ofset S. Let d be a constant parameter (d N). Recall thatthe MMMF method computes a N-by-N matrix bX and aN-by-L 1 threshold matrix h for predicting discreterating scores. The ith row vector of matrix h is used toseparate real values of the ith row vector in matrix bXto discrete values.fore exchanging coordinates among non-friend users doesnot disclose users friend lists.sional coordinates for calculating pairwise NCF values thathave constant transmission bandwidth costs and computa-5.4.4. Coordinate disseminationIn order to predict pairwise NCF values, users need to

    exchange the low-dimensional vectors corresponding tothe matrix factorization model, but hides each usersthreshold vector from other users. For example, each useri can request any other user js coordinate components~uj;~v j. However, we do not allow each user to requestother users threshold vector. As a result, each user is onlyable to predict NCF values from himself/herself to otherusers, but cannot determine the NCF values of an arbitrarypair of users.

    Sending the threshold vectors could leak users friend-ship information. Since knowing the nonzero NCF valuesof a user pair indicates that two users are either friendsor friends of friends, thus the social contacts of users canbe easily derived. Consequently, users privacy settingcould be violated and even worse, users otherwise secretinformation could be inferred as discussed in Section 3.2.

    5.5. Benets of decentralized NCF prediction

    We next summarize the benets of the decentralizedNCF prediction process.

    First, CommonFinder protects users privacy via twocomplement data structures. Bloom lters anonymizeusers identiers and hide the existence of any specicusers with a degree of false positives. Coordinates do notcontain any information about who is in a friend list, there-

    l=1

    -0.2 0.3

    l=2 l=3

    >0.3(-0.2,0.3]

    thresholdsCoordinate distance

    levels

    Fig. 3. Mapping the coordinate distance to NCF values.

  • decentralized coordinates. We rst measure a small num-ber of NCF values with the Bloom lters as inputs to main-tain the coordinates. We next present the neighborhoodmanagement process with respect to which the coordi-

    Alice queries Bobs Bloom lter with her friend list, and

    After establishing several links with friends on the social

    his/her own coordinate with a set of neighbors coordi-

    Y. Fu et al. / Computer Networks 64 (2014) 369389 377graph, user i puts these friends into his/her neighbor setSi, otherwise, this user will restart the search processlater.1

    1 For bootstrapping users online friends, the DOSN may also advertisepopular users proles to each users.vice versa. The subset of friends represented in theBloom lter is returned as the estimated commonfriends.

    Assume that an attacker maintains a dictionary of iden-tiers crawled over the DOSN. The attackers cannot exactlyknow whether an identier is indeed in the set, since theBloom lter is a probabilistic data structure with a false po-sitive probability for a query. The false positives are statis-tical valid for any query, therefore a user is always able todeny that he/she is a friend of another user due to the falsepositives.

    6.2. Neighbor management

    Each user i maintains a set Si of neighbors with a max-imal capacity Sm (Sm 20 by default). Each neighbor is rep-resented with three elds: the identier, the Bloom lterand the coordinate.

    The neighbor set Si covers user is friends. If a user hastoo few friends, we also include a randomized set of two-hop users as neighbors. But we deliberately set usersfriends to have a higher priority than two-hop users sincefriends are trusted. The two-hop users help if there are toofew friends. Since we probe NCF values with Bloom lters,users privacy is still preserved.

    When a user i joins the DOSN system, he/she rst ndsreal-life friends that have already registered in the DOSN.nates are optimized. We then present a distributed algo-rithm to adjust each users coordinate adaptively andscalably.

    6.1. Measuring NCF values based on bloom lters

    We represent each friend with an identier created bycryptographic hash functions like SHA-1 for data routingin the DOSN. Then each friend list is represented by aBloom lter. We use the Dynamic Bloom lter (DBF) [28]to represent a dynamic set of friends because of its simplic-ity and low computational complexity. The DBF [28] in-cludes multiple standard Bloom lters where eachstandard Bloom lter stores only c items.

    Each user then obtains other nodes Bloom lters andqueries them to estimate the set of common friends withhis/her own friend list:

    Alice and Bob exchange their Bloom lters.nates in a fully decentralized manner. We decompose(10) into N sub-objective consisting of coordinates andNCF measurements among each user i and his/her neigh-bors Si:

    JY ~ui; ~v i; ~hi

    XL1r1

    Xj2Si

    h Trij r;Yij hir ~ui~v j

    k2

    ~uik k2F ~v ik k2F

    11

    The objective JY ~ui; ~v i; ~hi

    seeks to optimize user iscoordinates. Since Eq. (11) is dened for each user iseparately, each user i is able to optimize Eq. (11) in aEach online user has a daemon process that triggers aperiodical procedure to sample two-hop users on the socialgraph for user i. The daemon process randomly selectsone of user is friends, say user j, as the counterpart forcommunication. Then user is daemon requests the dae-mon process at user j to send the contact address of afriend of user j. The daemon process at user j then selectsa user q uniformly at random from his own friend listand piggybacks user qs data to the daemon process at useri. Finally, the daemon process at user i puts user qsdata into set Si. If user j does not respond within 20 s,the daemon process at user i removes user j from his/herneighbor set.

    6.3. Decentralized MMMF optimization

    We next formulate the objective function of optimizingusers coordinates for accurate NCF prediction. We formu-late the problem of the NCF prediction with the MMMFmethod. Readers are referred to Section 2.3 for the back-ground of the MMMF method.

    We formulate the objective of predicting NCFs withMMMFs optimization function in Eq. (6) as:

    J u;v ; h XL1r1

    Xi;j 2X

    h Trij r;Yij hir ~ui~v j

    k2

    XNi1

    ~uik k2F ~v ik k2F !

    10

    where X denotes the set of pairs of users with observedNCF values and k is a regularized constant. We can see thatEq. (10) requires all users information, as a result, solvingEq. (10) belongs to the centralized optimization problemwhere an entity collects the NCF matrix Y and computesthe coordinates for all users. Unfortunately, such a central-ized formulation not only leaks the friendship informationof all users, but also does not adapt to the decentralizationof DOSN users where users may leave the systemdynamically.

    As a result, we need a decentralized optimization pro-cess that is able to adapt the dynamics of DOSN users. Tothat end, we decompose Eq. (10) to a set of distributedoptimization problems that are solved by decentralizedusers. The key idea is to let each user iteratively optimize

  • of the last round. Then step 6 calculates the moving step.Step 7gradie

    378 Y. Fu et al. / Computer Networks 64 (2014) 369389distributed manner based on the local information of itsneighbors.

    6.4. Distributed coordinate maintenance

    We adjust each users coordinate in a fully distributedmanner via a novel implementation of the PR-CG meth-od. We rst present how we transform the centralizedPR-CG method to a distributed algorithm for adjustingusers coordinates. We next adapt the coordinate mainte-nance process to the dynamics of decentralizedcoordinates.

    6.4.1. Decentralized conjugate gradient optimizationPR-CG is an iterative optimization algorithm. To mini-

    mize a nonlinear function f x, it iteratively updates thevector ~x according to the conjugate direction of ~x untilreaching a local minima.

    Transforming the iterative PR-CG process to a distrib-uted PR-CG method (D-PR-CG) algorithm is straightfor-ward. Each user i optimizes Eq. (11) in separate rounds.For each round, user i adjusts the vector ~xi once by oneround of the PR-CG method. Further, every two adjacentrounds is separated by s seconds.

    (i) Setting Parameters.We dene the vector ~xi for user iwith the concatenation

    of coordinate components of i, i.e.,

    xi ~ui; ~v i;~hih i

    We next derive the gradient of the vector ~xi as

    rxJY xi ! @JY@ui ;@JY@v i

    ;@JY@hi

    12

    where @JY@ui, @JY@v i

    and @JY@hi

    are partial derivatives with respect tothe coordinate components ~ui, ~v i and ~hi:

    @JY@uiz kuiz

    Xj2Si

    XL1r1

    Trij r;Yij h0 Trij r;Yij hir ui!v j! v jz

    and

    @JY@v iz

    kv iz Xj2Si

    XL1r1

    Trji r;Yji h0 Trji r;Yji

    hjr uj!v i!

    ujz

    for z 2 1; d , and

    @JY@hir Xj2Si

    Trij r;Yij h0 Trij r;Yij hir u!i v!j

    for r 2 1; L 1 .(ii) Bootstrapping.First, we initialize necessary parameters for bootstrap-

    ping the optimization process. Each user is coordinate isinitialized as a randomized vector, where each dimensionis selected from the interval 10;10. Then, user i adjustshis coordinate periodically, separated by s second.

    We set the steepest direction and the conjugate direc-tion for each user i:direction and the conjugate direction and then reconstructuser is coordinate components based on the updatedvector ~xi.next adds the vector ~xi with ai times of conjugatent direction. Finally, steps 812 cache the steepest Compute the steepest direction Dx as the reverse direc-tion of the gradient rxJY xi :

    Dx rxJY xi

    since the reverse of rxJY xi indicates the direction of themaximum decrease for f x. Set the conjugate direction Kx1 with the steepestdirection Dx in the rst round.

    (iii) Incremental Adjustment.Using the steepest direction to adjust the coordinate is

    sensitive to input noises such as the coordinate oscillationor erroneous NCF values. We therefore use the conjugatedirection to adjust the coordinates. At each round l > 1,we updates the conjugate direction Kx of vector ~x as theweighed sum of the steepest direction and the conjugatedirection of the last round:

    Kxl Dxl bl Kxl1 13

    where bl > 0 controls the capacity of the conjugate direc-tion in the current round. We use a revised PolakRibirescalar:

    bi !max 0;DxilT Dxil Dxi1l 1

    Dxil 1TDxil 1

    ( )14

    that automatically resets the conjugate direction. Theparameter bl trades off the robustness and the convergencespeed: Decreasing bl favors the steepest direction, but be-comes more sensitive to noises, while increasing bl reducesthe effect of the steepest direction, but slows down the conver-gence speed of the optimization process.

    We determines the step al of moving the vector ~x. Weuse the well-studied line-search method [21] to be awareof the convergence status of the vector ~x. the line searchmethod minimizes the objective function (11). The moreaccurate the vector~x is, the smaller the movement step be-comes. Therefore, the coordinates become more stablewith increasing rounds of adjustments.

    We then adjust the vector~x at a small distance with theconjugate direction Klx and the movement step al:~x ~x al Klx 15

    Algorithm 2 summarizes the above process for adjust-ing the coordinates. Step 2 calculates the concatenation~xiof the coordinate components of user i. Then step 3 com-putes the new steepest direction D as the reverse directionof the current gradient of~xi. Step 4 then computes the Po-lak-Ribire scalar b that trades off well between therobustness of the optimization process and the conver-gence speed. Step 5 next updates the conjugate directionwith the steepest direction D and the conjugate gradient

  • As users coordinates are modied at each round byAl sco ipr -qu r,fre i-na hla eco em

    ]pr hap sco oth epo

    llne sth -qu -di o

    nate distances. Unfortunately, since we only care about the

    lists of top-k users from his/her friends. The top k users

    Y. Fu et al. / Computer Networks 64 (2014) 369389 379recorded historical records. The minibatch approach worksquite well in practice. For instance, CommonFinderscoordinates converge within twenty rounds.gorithm 2, we have to adapt to the dynamics of userordinates. One simple approach is to let each userobe the NCF values to neighbors in set Si and to reest these users coordinates at each round. Howevequently measuring NCF values or requesting coordtes to all neighbors incurs heavy trafcs at nodes witrge friend lists. Therefore, we have to adapt to thordinate modications of neighbors and control theasurement trafcs.We follow the well-studiedminibatch approach [2931oposed in the network-coordinate eld. The minibatcproach reuses historical positions of most neighborordinates, since historical coordinates become closer te current positions as coordinates converge to stablsitions.Each user caches the NCF values and coordinates of aighbors in the memory; at each round, each user refresheis cache by probing the NCF value to one neighbor and reests this neighbors coordinate, and then updates his coornate with respect to all cached neighbors according tFrom Algorithm 2, the lengths of steepest directionand the conjugate direction amount to the twice sizeof the coordinate. The overall space overhead forAlgorithm 2 is O jSij 2d L . The communicationcomplexity is linear with the number of neighbors. Inpractice, we limit the number jSij of friends, the lengthd of coordinate vector to be around 20, the communica-tion complexity and the storage overhead of Algorithm 2is quite modest.

    6.4.2. Minibatch based coordinate updateAlgorithm 2. Distributed PolakRibire nonlinear conju-gate gradient optimization algorithm.

    1 Update~ui;~v i;~hi;Dxi;Kxi; Si;Yinput: a user is current coordinate~ui, ~v i,~hi, a user issteepest direction Dxi, a user is conjugate directionKxi, the set of neighbors Si, the NCF values from auser i to its neighbors in Si, the coordinates ofneighbors in Si.

    2 xi ~ui;~v i;~hih i

    ;

    3 D rxJY xi ;4 b DT DDxi

    DxTiDxi

    ;

    5 K D bKxi;6 ai argmin

    aiJY xi aiK ;

    7 xi xi aiK;8 Dxi D;9 Kxi K;10 ~ui xi1 : d;11 ~v i xi d 1 : 2d;12 ~hi xi 2d 1 : 2d L ;in the merged list will be returned as the top-k recom-mended friends. The initiator initializes an empty list Listand then merges each newly received list dListkj sent fromhis/her friend j. Both Listi and dListkj are represented withthe linked-list data structure for exible expansion.

    Theorem 2 conrms that merging the local top-k usersyields the correct results. Extending to cases where someNCF values are identical is straightforward, since usershaving the same NCF values with the initiator areexchangeable with each other.

    Theorem 2. For ease of presentation, we assume thatpairwise NCF values are different. Suppose that an initiator iseeks the top-k ranking list of users. By merging local rankinglists from friends of the initiator i, user i has the correct top-kranking list.

    Proof. Let i; j and j; q be two pairs of friends in thesocial graph. Assume that user q is among the top-k rank-ing list for i, but is pruned from the local merging step byuser j. We will prove by contradiction.

    Since user q is pruned by user j, user js top-k rankinglist for i must exclude user q. In other words, user j has atleast k friends having larger NCF values with i than user q.As a result, user qmust not be among the top-k ranking listfor user i. Thus, a contradiction happens. Therefore, user qmust be included in the top-k ranking list of the initiator i,the correct top-k list is preserved. h

    8. Privacy analysis

    8.1. Coordinates hide users friend lists

    Recall that in CommonFinder, each user i can send hiscoordinate component ~ui;~v i to any other user. The~ui;~v i components help other users compute the numberof common friends with user i (see Algorithm 1).top-k users, most of the communication is wasteful.We present a divide-and-conquer based method to re-

    duce the transmission bandwidth costs.Spreading: A user (called the initiator) spreads the

    ranking task to all of his/her friends. Upon receiving theranking task, each user j lters out those who have alreadybeen the initiators friends and selects k users from the restof friends having the largest NCF values with the initiator.The NCF values are computed with the decentralized coor-dinates. Finally, user j piggybacks the initiator with thetop-k users.

    Merging: The initiator then aggressively merges the7. Top-k ranking

    Having presented the decentralized coordinates, wenext show how to use coordinates to select top-k usershaving the highest NCF values as recommended friendson the DOSN.

    A naive approach is to aggregate all coordinates of two-hop users and to compute the descending order of coordi-

  • Intuitively, since ~ui;~v i are two low-dimensional real- Proof. Let I1 and I2 denote the bit arrays for the Bloom l-ters BFS1 and BFS2, respectively.

    We rst compute the probability Pi that two ith bits inthe bit arrays I1 and I2 are different, i.e., I1i 0 andI2i 1. In other words, Pi is the probability that the ithbit in BFS2 does not change after inserting the itemxn1f g, which amounts to the probability that I1i is stillzero after inserting n keys:

    pi enkImI 17

    Since after inserting n keys into a sized-mI Bloom lterwith kI hash functions, the probability that a bit is still zero

    amounts to 1 1mI nkI enkImI .

    We next show the expected L1 difference between theBloom lters BFS1 and BFS2. Let kI hashed positions forthe item xn1 be a1; . . . ; akI

    .

    [

    380 Y. Fu et al. / Computer Networks 64 (2014) 3693898.2.1. Sensitivity of a Bloom lterLet mI and kI be the length of the bit array I and the

    number of hash functions for a Bloom lter. Letf : S! BFS be a function that maps a set of identifers Sto a Bloom lter BFS. Given two sets S1 x1; . . . ; xnf g,S2 x1; . . . ; xn; xn1f g that differ in only one identier xn1.

    We compute the L1 difference of the bit arrays of twoBloom lters BFS1 and BFS2:

    g S1; S2 XmIi1jIBF S1 i IBF S2 i j 16

    in order to reveal the difference of the bit arrays by hashingS1 and S2 into two Bloom lters. Theorem 3 states the ex-pected number of different bits for two Bloom lters repre-senting two sets S1 and S2 that differ in one item.

    Theorem 3. Given two sets S1 and S2 satisfying thatS2 S1 [ xn1f g. The expected number of different bitsbetween the Bloom lter BFS1 and BFS2 amounts toE g S1; S2 kIe

    nkImI .8.2. Bloom lters yield differential privacy

    We next show that the Bloom lters also protect theprivacy of users friend lists. Intuitively, since Bloom ltermay introduce false positives, i.e., the estimation may notcoincide with the ground-truth common friends. Foraround 99% of all cases, the Bloom lter is able to predictcorrect common friends. Moreover, we can turn the falsepositives into a way of protecting the privacy: a user candeny that a user is his (or her) friend, since the Bloom ltercould fail due to the false positives. Increasing the size ofthe Bloom lter may increase the accuracy of estimation,but still fail to distinguish whether a user is in the Bloomlter with 100% certainty.valued vectors, which do not encode any informationabout user is friend list. Therefore, disclosing the ~ui;~v icomponents of user i to his non-friend does not leak useris friend list.

    CommonFinder requires to hide each users thresholdvector ~h. This is because for each user i, his/her thresholdvector~h is only useful for computing the numbers of com-mon friends between i and other users. Further, disclosing~h leaks users friend links. This is because for each user i, amalicious user can compute the set of users having somecommon friends with user i, indicating that these usersare at most two-hop away from user i on the social graph.Therefore, these users are either user is friends or friendsof user is friends. Consequently, the friendship links aredisclosed to malicious users.

    Suppose a malicious user collects user is coordinatecomponents ~ui;~v i, but does not known user is thresholdvector. We can see that this malicious user cannot deter-mine the number of common friends from i to other users,since only the threshold vector uniquely determines theNCF values.0 200 400 600 800 10001

    2

    n

    E

    Fig. 4. Expected number of different bits with increasing items n.Let Zl be the indicator function denoting whether twoalth bits in BFS1 and BFS2 are identical:

    Zl 1 I1 al I2 al 0 else

    ;

    where al 2 1;mI . Let Z be the sum of kI indicatorfunctions:

    Z XkIl1

    Zl 18

    We compute the expected number E Z of different bitsfor BFS1 and BFS2 as:

    E Z EXkIl1

    Zal

    " #XkIl1

    E Zal XkI

    l11Pal kIe

    nkImI

    19We next plot the expected number E Z of different bits

    as the number of items n increases. From Fig. 4, the num-ber of different bits is mostly lower than four. Further,the expected number of different bits drops quickly asthe number of item increases.

    3

    4

    5

    6

    diffe

    rent

    bits

    ]

    m=5000,k=3m=5000,k=6m=10000,k=3m=10000,k=6

  • 9.1.1. Evaluation processOur evaluation tries to ask whether CommonFinder can

    NCF values among two-hop users with CommonFinderand related methods.

    Unfortunately, since most DOSNs are still under devel-opments, no DOSN data sets are publicly available yet.Since DOSNs have consistent social functionalities withthe centralized OSNs, we use the wide-adopted data setsof centralized OSNs introduced in Section 4. These datasets correspond to static social graphs, which do not revealwhen each user adds friends. As a result, our simulation as-

    the accuracy. We set the default number of neighbors for

    All experiments are performed on a laptop with

    Y. Fu et al. / Computer Networks 64 (2014) 369389 381estimate the number of common friends accurately andscalably: (1) How does the Bloom lter based NCF estima-tion method scale for real-world social networks? (2) Is thedecentralized coordinate more accurate than existing NCFestimation methods? (3) How does CommonFinder scalefor real-world social networks.

    The simulation seeks to be general enough for anyexisting DOSN proposals. Therefore, rather than integrat-ing with specic DOSNs, the simulation focuses on predict-ing NCF values for pairs of users. The simulation takes thesocial graph among users as the input and then predicts8.2.2. Differential privacy of bloom ltersHaving shown that the Bloom lters are slightly modi-

    ed by adding or removing an item, we next derive thecapability of the differential privacy for the Bloom lter(see Section 2.4 for brief introduction).

    Theorem 4. Given a sized-mI Bloom lter with ki hash func-tions. Let n be the number of items inserted into this Bloom

    lter. This Bloom lter provides kI ln 1 exp nkImI

    -

    differential privacy.

    Proof. For two sets S2 and S02 that differ in one item, wecompute the ratio of the probabilities that two Bloom l-ters representing sets S2 and S02 having the same bit array:

    Pr IS2 I

    Pr IS02 IjS02 S2 [ yf g

    Pr y does not change the bit array of IS2

    YkIi1

    Pr IS2 hi y 1

    1exp nkImI

    kI exp kI ln 1exp nkImI

    We see that

    Pr IS2 I

    6 Pr IS02 IjS02 S2 [ yf g exp kI ln 1 exp nkImI

    Therefore, representing two sets that differ in one item

    with the Bloom lters provides kI ln 1 exp nkImI

    -

    differential privacy. hTheorem 4 conrms that using the Bloom lter repre-

    senting a friend list provides differential privacy for usersin this friend list. As a result, a malicious user cannot becertain about who is a users friends.

    9. Evaluation

    We next test CommonFinders performance with exten-sive simulations.

    9.1. Experimental setup2.13 GHz CPU and a 2 GB RAM. We have implemented allrelated methods with Matlab 7.0. The experiments are

    Table 3CommonFinder parameters.

    Parameter Value

    Coordinate dimension d 5Number of neighbors for adjusting coordinates 20Regularized parameter k 0.3Neighbor-selection method RandomRounds of coordinate updates 60adjusting the coordinate to 20, which is modest com-pared to the sizes of most users friend lists. We cong-ure the regularized parameter k to 0.3, which yieldsstable accuracy for optimizing the decentralized coordi-nates. We select neighbors randomly for each user fromthe whole set of candidate users due to its robustness.Finally, we evaluate CommonFinders accuracy after eachusers coordinate is updated in 60 rounds. In fact, allusers coordinates have converged to stable positionswithin 20 rounds.

    9.1.3. Performance metricsTo quantify the accuracy of estimations, we use the

    popular metric Normalized Mean Average Error (NMAE),which is dened asPi;j:Yij>0jYij bY ijjP

    i;j:Yij>0Yij20

    where Yij is the ground-truth value, bY ij is the estimated va-lue. Smaller NMAE values correspond to higher predictionaccuracy.sumes that each user has all friends that are available onthe social graph.

    9.1.2. Comparison methodsWe compare CommonFinder with two representative

    NCF prediction methods, LandmarkMDS, a geometriccoordinate based method [32], and ProximityEmbed, amatrix-factorization based method [33]. We congure theparameters of LandmarkMDS and ProximityEmbed accord-ing to the recommended parameters.

    We represent each user with a 160-bit randomizedstring generated with a SHA-1 cryptographic hash func-tion. We set CommonFinders default parameters inTable 3. We choose the coordinate dimension to be ve,since further increasing the dimension does not improve

  • repeated in ten times and we report the average resultsand the corresponding standard deviations.

    9.2. Baseline performance of measuring pairwise NCF values

    To tune the coordinate positions, CommonFinder usesthe DBF to probe the NCF values from a user to his friends.Moreover, DBF also lists the set of common friends that arenot used in the coordinate computation process. The ob-tained NCF values to a set of neighbors are then treatedas the inputs to adjust each users coordinates. Supposethe NCF measurements by the DBF is inaccurate, the coor-dinate adjustment will be impaired accordingly. In otherwords, DBF sets up the upper bound for the predictionaccuracy of the NCF values.

    We therefore rst characterize the baseline perfor-mance of measuring the NCF values for the CommonFind-er. We focus on DBFs performance with n 6 1000, since

    9.2.1. Sensitivity to Bloom-lter parameters

    items per SBF also decreases the degradation of the falsepositive probability of DBF.

    Second, from Fig. 5(b), the storage size of the DBF in-creases with decreasing number c of items inserted into aSBF or the SBF size m. For c 6 50, the storage size increasesquickly with increasing SBF size m. As a result, we need tochoose the number c of items to be bigger than 50 in orderto control the increment of the storage size.

    We next show the detailed variation of performance byxing c or m. From Fig. 6(a) and (b), we see that increasingthe upper bound c exponentially increases the false posi-tive probability of DBF, but decreases the storage size sig-nicantly. As a result, there is a natural trade off betweenthe false positive probability and the storage size. FromFig. 6(c) and (d), increasing the SBF size m exponentiallydecreases the false positive probability, but also increasesthe storage size. As a result, there are no optimal choicesfor DBFs parameters, we need to select c and m balancingthe false positive probability and the storage size. For the

    we next test DBFs accuracy in measuring the pairwise NCF

    exchanging DBFs between pairs of two-hop users. From

    iesnd

    upper bound .

    (bfi

    andability

    382 Y. Fu et al. / Computer Networks 64 (2014) 369389Fig. 5. Variation of the performance of DBF as we vary the upper bound cnumber k of hash functions is selected to optimize the false positive probTo investigate DBFs sensitivity to parameter settings,we compute DBFs false positive probability and calculateits storage size with increasing SBF size m and the numberc of items inserted into each SBF.

    First, from Fig. 5(a), increasing the SBF size m from 0 to1000 bits signicantly decreases the false positive proba-bility of DBF, compared to DBFs with SBF size m > 1000.As a result, we need to select a SBF size m bigger than1000 bits in order to control the false positive probability.Besides, decreasing the upper bound c of the number of

    050

    100150

    200

    01000

    20003000

    4000

    00.20.40.60.8

    1

    cm

    FP

    (a) The expected false positive probabilitas a function of the Bloom filter size amost of the number of online userss friends is below1000 as shown in Section 4.1. To scale well with dynamicitems, Dynamic Bloom Filter (DBF) [28] includes multipleCBF where each CBF stores only c items. After having c in-serted items, a new CBF is added to store subsequent inser-tions. The false positive probability of a DBF is:

    fm;k;c;n1 1 1ekc=m k bn=cc

    1 1ekncbn=cc=m k 21Fig. 7, we see that the storage sizes of all social networktopologies are quite modest, implying that the Bloom l-ters are quite practical for measuring the common friends.

    050100150200

    01000

    20003000

    4000

    0

    20

    40

    60

    cm

    Stor

    age

    (KB)

    ) Storage sizes as a function of the Bloomlter size and upper bound

    the SBF size m. The number of items inserted into the DBF n 1000. Theof each SBF.values for two-hop users on the social graph. We computethe Percentage of incorrect intersection (PII for short) asthe percentage of correct NCF results by DBF.

    Table 4 shows the results on three social network topol-ogies. We see that more than 99% of all tests are correct,which implies that the Bloom lter based NCF measure-ments are quite accurate.

    9.2.3. DBFs bandwidth requirementsWe next compute the bandwidth requirements ofrest of the paper, we choose c 100, m 2000 as the de-fault parameters of the DBF.

    9.2.2. DBFs accuracyHaving determined the default parameters for the DBF,

  • 50 100 150 2001010

    108

    106

    104

    102

    100

    Items Per SBF (c)

    FP n=10000n=1000n=500n=100

    (a) The expected false positive probability asa function of the upper bound

    (bb

    1000 2000 3000108

    106

    104

    102

    100

    SBF size (m)

    FP n=10000n=1000n=500n=100

    (c) The false positive probabilities as a func-tion of the SBF size

    (d

    Fig. 6. The expected false positive probability and the storage size of DBF as wparameters for the experiments are as follows: n 1000, c 100, m 2000. Tprobability of each SBF.

    Table 4Percentages of incorrect intersection results onthree data sets. DBF is congured with the defaultparameters m 2000, c 100.

    Data sets PII

    Facebook 0.000015YouTube 0.000012Flickr 0.00842

    0.5 1 2 3 4 50

    0.2

    0.4

    0.6

    0.8

    1

    Storage (KB) (log scale)

    Pr (X

    >x)

    Facebook

    YouTube

    Flickr

    Fig. 7. The storage sizes of DBF for three social network topologies.

    Y. Fu et al. / Computer Networks 64 (2014) 369389 383101

    100

    101

    102

    Stor

    age

    (KBy

    tes)

    n=10000n=1000n=500n=1009.3. Convergence of the coordinates

    We next test the convergence of the NCF estimations inrounds of coordinate updates. Each user selects at mosttwenty friends for the coordinate update in Algorithm 2in each round.

    Fig. 8 shows the dynamics of estimation accuracy withincreasing rounds of coordinate updates. The estimationaccuracy is quickly improved as the coordinate updaterounds increase. The coordinates have converged to stable

    50 100 150 200

    Items Per SBF (c)) Storage size as a function of the upper

    ound

    1000 2000 3000101

    100

    101

    102

    SBF size (m)

    Stor

    age

    (KBy

    tes)

    n=10000n=1000n=500n=100

    ) Storage size as a function of the SBF size

    e vary the upper bound c and the SBF size m, respectively. The defaulthe number k of hash functions is selected to optimize the false positive

    0 10 20 30 40 500

    5

    10

    15

    20

    25

    Round

    NM

    AE

    FacebookYouTubeFlickr

    Fig. 8. Convergence of CommonFinder on three data sets.

  • positions after ten rounds of coordinate updates. Furthercoordinate updates do not signicantly improve the esti-mation accuracy.

    Furthermore, we also measure the computation time ofCommonFinder for each user. The averaged computationtime of each user on Facebook, YouTube and Flickr is lessthan 0.1 s, which indicates that the coordinate-update pro-cess is quite efcient.

    9.4. Comparison of coordinates based NCF prediction methods

    Having conrmed the convergence of the coordinatecomputation, we next compare CommonFinders coordi-nate based NCF prediction method with existing methods.From Fig. 9, we can see that CommonFinder outperformsLandmarkMDS and ProximityEmbed in several times.CommonFinder is also consistently accurate on different

    384 Y. Fu et al. / Computer Networks 64 (2014) 369389social networks. However, we also see that the perfor-mance on different data sets varies. This is because differ-ent social networks have varying distributions of commonfriends.

    9.4.1. Robustness to incomplete measurementsWe next test the robustness of the NCF estimations to

    incomplete measurements because of ofine users. Werandomly set a fraction of pairwise NCF values to be zero.Then we compare the NCF estimation results of differentmethods with the available NCF observations. We set Com-monFinders parameters to be the default values in Table 3.

    Fig. 10 plots the dynamics of the accuracy with increas-ing missing measurements. CommonFinder is stably accu-rate when the missing fraction is below 0.8, and gracefullydegrades the accuracy as the missing fraction increases to0.9. Finally, CommonFinder is signicantly inaccurate after90% of NCF probing results are missing. On the other hand,the proximity embedding approach degrades the accuracysignicantly as the missing fraction increases. We can seethat CommonFinder is quite robust to a wide range ofmissing measurements.

    9.4.2. Robustness to incorrect measurementsSince some NCF-probing results could be incorrect due

    to the false positives of the Bloom lters, we next test

    Facebook YouTube Flickr0

    0.2

    0.4

    0.6

    0.8

    Data Sets

    NM

    AE

    CommonFinderLandmarkMDSProximityEmbed

    Fig. 9. Comparing the estimation accuracy of different methods on threedata sets.the NCF-estimation accuracy with increasing percentageof incorrect NCF measurements. We select a fraction ofNCF probing results uniformly at random and add an errorconstant e (e > 0) to the ground-truth NCF values. We varythe percentage of erroneous NCF results and the error con-stant e to verify the robustness against erroneous NCFobservations. We set CommonFinders parameters to bethe default values in Table 3.

    Fig. 11 shows the results. CommonFinder gracefully de-grades the accuracy as the error constant e increases. Onthe other hand, ProximityEmbed signicantly degradesthe accuracy. Furthermore, CommonFinder and Proximity-Embed are less robust on the Facebook data set than therest two data sets. This is because the NCF values on theFacebook data set are much lower than those on the othertwo data sets, which makes the NCF estimation results onthe Facebook data set be more sensitive to erroneous NCFprobing results on the other two data sets.

    9.4.3. Accuracy of top-k predictionWe next compare the accuracy of estimating top-k

    highest NCF values. For each user i and a constant param-

    eter k, we calculate the multiplicative ratio

    Py2bSkYiyPx2Sk

    Yix, where

    Sk denotes the set of ground-truth top-k users with the

    highest NCF values, bSk denotes the set of estimate top-kusers with the NCF prediction methods. We can see thatthe ratio values are at most 1 and larger ratios lead to bet-ter prediction results.

    From Fig. 12, we see that CommonFinder signicantlyoutperforms both ProximityEmbed and LandmarkMDSmethods. The multiplicative ratios for CommonFinder aremostly larger than 0.9 and become closer to one as we in-crease the number k of recommended users. This is be-cause CommonFinder predicts pairwise NCF values muchmore accurate than ProximityEmbed and LandmarkMDS.Further, the ProximityEmbed is much more accurate thanthe LandmarkMDS. This is because ProximityEmbed usesthe matrix factorization to adapt the triangle inequalityviolations of the NCF values, while LandmarkMDS assumesthe triangle inequality to hold for all triples, which how-ever does not hold for social data sets.

    9.5. Resilience to decentralized nodes

    We next test whether CommonFinder is sensitive toerroneous coordinates due to the decentralization of users.We divide the overall set of users in the data sets into twoequal halves. Only half users in the data sets are added tothe simulator at the rst round, some users friends arein the second half of users and thus may not be addedyet. The other half users join the system after 40 rounds.After that cycle, all links between friends in the data setsare present in the simulator. As a result, the erroneouscoordinates are injected into the system after 40 rounds.

    To quantify the resilience of the coordinate, we calcu-late the Coordinate Drift with the L1 norm dened asX2dLl1jxi l ~xi l j 22

  • 0 0.2 0.4 0.6 0.8 10

    0.10.20.30.40.50.60.70.80.9

    11.1

    Percent of Missing Measurements

    NM

    AE

    CommonFinderProximityEmbedLandmarkMDS

    (a) Facebook

    0 0.2 0.4 0.6 0.8 10

    0.20.40.60.8

    11.21.41.61.8

    22.2

    Percent of Missing Measurements

    NM

    AE

    CommonFinderProximityEmbedLandmarkMDS

    (b) YouTube

    0 0.2 0.4 0.6 0.8 10

    0.2

    0.4

    0.6

    0.8

    1

    Percent of Missing Measurements

    NM

    AE

    CommonFinderProximityEmbedLandmarkMDS

    (c) FlickrFig. 10. Accuracy comparison by varying the percentage of missing measurements to neighbors.

    0.01 0.02 0.03 0.04 0.05102

    101

    100

    101

    Percent of Incorrect Measurements

    NM

    AE

    ProximityEmbed,e=3ProximityEmbed,e=10CommonFinder,e=3CommonFinder,e=10LandmarkMDS,e=3LandmarkMDS,e=10

    (a) Facebook

    0.01 0.02 0.03 0.04 0.05102

    101

    100

    101

    Percent of Incorrect Measurements

    NM

    AE

    ProximityEmbed,e=3ProximityEmbed,e=10CommonFinder,e=3CommonFinder,e=10LandmarkMDS,e=3LandmarkMDS,e=10

    (b) YouTube

    0.01 0.02 0.03 0.04 0.05102

    101

    100

    101

    Percent of Incorrect Measurements

    NM

    AE

    ProximityEmbed,e=3ProximityEmbed,e=10CommonFinder,e=3CommonFinder,e=10LandmarkMDS,e=3LandmarkMDS,e=10

    (c) FlickrFig. 11. Accuracy comparison by varying the percent of incorrect measurements.

    0 10 20 300.5

    0.6

    0.7

    0.8

    0.9

    1

    Number of Recommendations

    Rat

    io

    ProximityEmbedLandmarkMDSCommonFinder

    (a) Facebook

    0 10 20 300

    0.2

    0.4

    0.6

    0.8

    1

    Number of Recommendations

    Rat

    io ProximityEmbedLandmarkMDSCommonFinder

    (b) YouTube

    0 10 20 300.5

    0.6

    0.7

    0.8

    0.9

    1

    Number of Recommendations

    Rat

    io

    ProximityEmbedLandmarkMDSCommonFinder

    (c) FlickrFig. 12. Accuracy of selecting two-hop users with the top-k NCF values.

    20 40 60 800

    0.10.20.30.40.50.60.70.80.9

    1

    Round

    NM

    AE

    0 00.511.522.533.54

    Drif

    t

    AccuracyDrift

    (a) Facebook

    0 20 40 60 8000.10.20.30.40.50.60.70.80.9

    1

    Round

    NM

    AE

    00.511.522.533.54

    Drif

    t

    AccuracyDrift

    (b) YouTube

    0 20 40 60 800

    0.10.20.30.40.50.60.70.80.9

    1

    Round

    NM

    AE

    00.511.522.533.54

    Drif

    t

    AccuracyDrift

    (c) FlickrFig. 13. The effect of high-error coordinates on the rate of convergence.

    Y. Fu et al. / Computer Networks 64 (2014) 369389 385

  • where xi and ~xi denote the updated coordinate and the last-round coordinate, respectively. We plot the mean coordi-nate drifts and the mean/standard deviations of the esti-mation errors in Fig. 13.

    The rst half users converge to stable coordinates with-in 20 rounds and keep steady until 40 rounds. Further-more, the coordinate drifts are reduced to be below 0.1after the coordinates are stabilized after 20 rounds.

    When the other half users join the system after 40rounds, the overall coordinate errors increase sharply,since the coordinates of newly-joined users are randomlyinitialized and thus may incur high errors. Furthermore,the coordinate drifts also increase due to the similar rea-sons. However, the whole set of coordinates convergewithin the next 20 rounds to stable positions. The newlystabilized coordinates have the similar accuracy as thosebefore 40 rounds. Furthermore, the coordinate drifts alsodecrease to similar degrees as those before 40 rounds. Asa result, the coordinate-update process is quite resilientto erroneous coordinates.

    9.6. Parameter sensitivity of optimizing decentralizedcoordinate

    We then test whether the coordinate adjustment pro-cess is sensitive to parameter settings. We vary one param-eter and set the rest of CommonFinders parameters to bethe default values in Table 3. We report the mean NMAEvalues and the standard deviations.

    9.6.1. Coordinate dimensionFig. 14(a) plots the accuracy with increasing coordinate

    dimensions. CommonFinder accurately predicts NCFs evenwith low coordinate dimensions. The standard deviationsare also quite low, which imply that the NCF estimationis quite stable.

    9.6.2. Number of neighborsFig. 14(b) shows the dynamics of accuracy with increas-

    ing neighbors. CommonFinder becomes more accuratewith increasing neighbors, but keeps stably accurate whenthe number of neighbors exceeds 20. Therefore, Common-Finder is able to predict accurate NCFs with a modest num-ber of neighbors.

    9.6.3. Regularized constant kFig. 14(c) plots the accuracy of CommonFinder by

    increasing regularized constant k. The accuracy decreaseswith increasing k for the YouTube and Flickr data sets,but keeps stably accurate when we vary k for the Facebookdata set. Therefore, we should choose a smaller k for updat-ing coordinates.

    9.6.4. Choice of neighborsWe next evaluate the choice of neighbors on the

    accuracy of CommonFinder. We test four popular neigh-bor-selection rules: (i) Random, we choose neighborsuniformly at random from the whole set of friends.(ii) Closest, we choose neighbors as friends that have the

    0.02

    (s

    (n

    inder

    386 Y. Fu et al. / Computer Networks 64 (2014) 3693890 5 10 150

    Coordinate Dimension(a) NMAE as a function of the coordinate di-mension.

    0 0.5 1 1.5 20

    0.05

    0.1

    0.15

    0.2

    NM

    AE

    FacebookYouTubeFlickr

    (c) NMAE as a function of the parameter

    Fig. 14. NMAE of CommonF0.040.060.08

    0.10.120.140.160.18

    0.2

    NM

    AE

    FacebookYouTubeFlickr10 20 30 400

    0.05

    0.1

    0.15

    0.2

    0.25

    Number of Neighbors

    NM

    AE

    FacebookYouTubeFlickr

    b) NMAE as a function of the neighborhoodize.

    FB YT FR0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    Data Sets

    NM

    AE

    RandomClosestFarthestHybrid

    d) NMAE as a function of the choice ofeighbors.

    as we vary its parameters.

  • lowest NCFs. (iii) Farthest, we choose neighbors as friendsthat have the highest NCFs. (iv) Hybrid, we select halfneighbors using the Closest based selection and the otherhalf neighbors using the Farthest based selection.Fig. 14(d) shows the results. FB, YT and FR denote the Face-book, YouTube and Flickr data sets, respectively. The Ran-dom policy provides the highest accuracy, while otherthree policies incur much higher errors than the Randompolicy, since the randomness in neighbors brings muchhigher diversity in the distribution of NCFs, which makesthe optimization process to be robust to bad local minima.

    10. Extending the NCF computation to Sybil accounts

    coordinate computation process. In order to destabilizethe coordinates of normal users, the Sybil accounts needto have friend links with some normal users and then cre-ate fake Bloom lters and/or generate fake coordinatepositions.

    10.3.1. Fake Bloom ltersFake Bloom lters bring incorrect NCF measurements.

    However, as discussed in the simulation section, Common-Finder is robust to the incorrect NCF measurements be-cause of the inaccuracy brought by the Bloom lters. Thisis because the decentralized MMMF method used by Com-monFinder mitigates the problem of overtting by regular-

    0 Ran) Yoe inc

    Y. Fu et al. / Computer Networks 64 (2014) 369389 38710.1. Sybil accounts

    Prior work [34,35] broadly group those fake accountsfor spamming or performing privacy attacks as Sybil ac-counts. To launch the attacks, the Sybil accounts generallyseek to integrate with the social graph by establishingfriend links with normal users. Therefore, they have strongmotivations to use fake friend lists to obtain informationabout other users. Further, these Sybil accounts usuallycollude together to establish friend links betweenthemselves.

    If Sybil accounts do not establish friend links with nor-mal users, we can see that the CommonFinder protocol willnot be affected, since each user adjusts its coordinate withfriends. Therefore, our analysis assumes that Sybil ac-counts have established several friend links with normalusers. We call these normal users as infected ones.

    10.2. Learn information about other users

    We argue that the CommonFinder protocol does notleak sensitive privacy information to Sybil accounts. First,since we enforce the one-hop trustiness, non-infectedusers friend lists will not be disclosed to Sybil accounts.Second, during the coordinate adjustment process, disclos-ing the Bloom lter and the coordinate of a user do not leakthis users privacy information due to the randomizationbrought by these two data structures.

    10.3. Destabilize coordinate system

    The coordinate system, however may be sensitive to Sy-bil accounts, when they inject fake information into the

    0 0.1 0.2 0.3 0.40

    0.01

    0.02

    0.03

    0.04

    0.05

    Percent of Random Coordinates

    NM

    AE

    (a) Facebook

    0 0.10

    0.05

    0.1

    0.15

    0.2

    Percent of

    NM

    AE

    (bFig. 15. The accuracy of the NCF prediction process as wizing the coordinates with nuclear norms that is robust tonoises [36].

    10.3.2. Fake coordinatesFake coordinates generally refer to those that are cre-

    ated by not following the CommonFinder protocol. Sup-pose that a Sybil user sends a fake coordinate to a friendB that is a normal user. As a normal user always has someother normal users as friends, user B will move its coordi-nates with both Sybil and normal neighbors. When there isonly a small fraction of Sybil accounts for a normal user,this normal users coordinate will keep to converge to sta-ble positions since the conjugate gradient optimizationtries to balance the coordinates from Sybil and normalusers.

    As it is difcult to model the Sybil nodes behaviors. Inthis paper, we use a simple model to represent uncertaincoordinates: To simulate the randomness of coordinate er-rors in decentralized settings, we randomize the coordi-nates of a set of online nodes. For each round of thecoordinate update process of a node, we randomly choosea fraction p of neighbors to have random coordinates. Byvarying the percent p of random coordinates, we are ableto study impact of the degree of node dynamics on thecoordinates accuracy.

    Fig. 15 plots the dynamics of CommonFinders accuracyas we increase the percent of randomized coordinates.Increasing randomized coordinates only smoothly de-grades the accuracy.

    11. Related work

    Talash [10] calculates the common friends by exchang-ing two friend lists, which however, does not scale well

    .2 0.3 0.4dom CoordinatesuTube

    0 0.1 0.2 0.3 0.40.05

    0.06

    0.07

    0.08

    0.09

    Percent of Random Coordinates

    NM

    AE

    (c) Flickrrease the percent of random coordinates for each user.

  • methods need a large amount of bandwidth cost and com-

    decentralized coordinate approaches select neighbors that

    388 Y. Fu et al. / Computer Networks 64 (2014) 369389are used for updating coordinates uniformly at randomfrom the whole set of nodes, which could leak the coordi-nate positions to non-friend users. On the contrary, Com-monFinder selects neighbors from the friend list of eachuser, which avoids the disclosure of coordinates to un-known users.

    12. Conclusion

    To estimate common friends in distributed social net-works scalably with privacy protection, we propose a dis-tributed common-friends prediction method thatcombines the privacy-preserving common-friend mea-surement by using Bloom lters and the distributed com-mon-friend estimation with decentralized coordinates.Simulation results show that our coordinate based methodestimates the NCF values scalably and accurately withmodest transmission costs. Besides, CommonFinder is alsoresilient to incomplete NCF measurements and noisy NCFmeasurements. We plan to implement our method as aplugin for distributed social networks.

    Acknowledgements

    This work was supported by the National Grand Funda-mental Research 973 Program of China (Grant No.2011CB302601), the National Natural Science Foundationof China (Grant No. 61379052), the National High Technol-ogy Research and Development 863 Program of China(Grant No. 2013AA01A213), the Natural Science Founda-tion for Distinguished Young Scholars of Hunan Province(Grant No. S2010J5050), Specialized Research Fund forputation cost, which renders them to be less practical forbandwidth-limited end hosts.

    Song et al. [33] estimate diverse proximity metrics be-tween users through a matrix factorization process thatis computed in a centralized manner, which is impracticalfor distributed online social networks. Dong et al. [37] pro-pose a secure and privacy-preserving dot product protocolfor static coordinates computed by [33], in order to esti-mate the social proximity in mobile social networks.

    Finally, network coordinate methods such as Vivaldi[29], LandmarkMDS [32] can be used to predict NCF values.Network coordinates embed hosts into a geometric spaceand estimate the pairwise distance of two nodes basedon the corresponding coordinate distance. However, thesewith increasing friends and also discloses sensitive per-sonal information to unknown entities.

    The Private Set Intersection (PSI) or private matchingmethod [1114] computes the intersection of two setswith privacy protection. Generally, each set is representedby a list of coefcients of a polynomial. Estimating theintersection of two sets is implemented as the arithmeticoperations on the coefcients of two polynomials. To im-prove the privacy protection, the homomorphic encryptionis enforced over the arithmetic operations in order to hidethe coefcients to the other user. Unfortunately, the PSIthe Doctoral Program of Higher Education (Grant No.20124307110015).

    References

    [1] L.A. Cutillo, R. Molva, T. Strufe, Safebook: Feasibility of TransitiveCooperation for Privacy on a Decentralized Social Network, in: Proc.of IEEE WOWMOM09, pp. 16.

    [2] L.M. Aiello, G. Ruffo, LotusNet: tunable privacy for distributed onlinesocial network services, Comput. Commun. 35 (2012) 7588.

    [3] T. Xu, Y. Chen, J. Zhao, X. Fu, Cuckoo: Towards Decentralized, Socio-aware Online Microblogging Services and Data Measurements, in:Proc. of HotPlanet 10, pp. 4:14:6.

    [4] Diaspora, Diaspora Alpha, 2011. .[5] S. Buchegger, D. Schiberg, L.-H. Vu, A. Datta, PeerSoN: P2P Social

    Networking: Early Experiences and Insights, in: Proc. of SNS 09, pp.4652.

    [6] A. Shakimov, H. Lim, R. Cceres, L.P. Cox, K.A. Li, D. Liu, A. Varshavsky,Vis--Vis: Privacy-preserving Online Social Networking via VirtualIndividual Servers, in: Proc. of COMSNETS11, pp. 110.

    [7] D. Liben-Nowell, J. Kleinberg, The Link Prediction Problem for SocialNetworks, in: Proc. of CIKM 03, pp. 556559.

    [8] A. Mislove, B. Viswanath, K.P. Gummadi, P. Druschel, You are whoyou know: inferring user proles in online social networks, in:Proceedings of the third ACM international conference on Websearch and data mining, WSDM 10, pp. 251260.

    [9] L. Bilge, T. Strufe, D. Balzarotti, E. Kirda, All your contacts are belongto us: automated identity theft attacks on social networks, in:Proceedings of the 18th international conference on World wideweb, WWW 09, pp. 551560.

    [10] R. Dhekane, B. Vibber, Talash: Friend Finding In Federated SocialNetworks, in: Proc. of LDOW2011.

    [11] M.J. Freedman, K. Nissim, B. Pinkas, Efcient private matching andset intersection, in: Proc. of EUROCRYPT 2004, pp. 119.

    [12] E.D. Cristofaro, P. Gasti, G. Tsudik, Fast and private computation ofset intersection cardinality, Cryptology ePrint Archive, Report 2011/141, 2011. .

    [13] L. Kissner, D. Song, Privacy-Preserving Set Operations, in: Proc. ofCRYPTO 2005, pp. 241257.

    [14] D. Dachman-Soled, T. Malkin, M. Raykova, M. Yung, Efcient robustprivate set intersection, in: Applied Cryptography and NetworkSecurity, vol. 5536, 2009, pp. 125142.

    [15] C. Dwork, Differential privacy: a survey of results, in: TAMC, pp. 119.

    [16] F. McSherry, K. Talwar, Mechanism design via differential privacy,in: Proc. of FOCS 07, pp. 94103.

    [17] Y. Fu, Y. Wang, On the feasibility of common-friend measurementsfor the distributed online social networks, in: The 7th InternationalICST Conference on Communications and networking in China(CHINACOM), 2012, pp. 2429.

    [18] Y. Fu, Y.Wang, BCE: A privacy-preserving common-friend estimationmethod for distributed online social networks withoutcryptography, in: The 7th International ICST conference onCommunications and Networking in China (CHINACOM), 2012, pp.212217.

    [19] J.D.M. Rennie, N. Srebro, Fast maximum margin matrix factorizationfor collaborative prediction, in: Proc. of ICML05, pp. 713719.

    [20] G.H. Golub, C.F. Van Loan, Matrix Computations, third ed., JohnsHopkins University Press, Baltimore, MD, USA, 1996.

    [21] J.R. Shewchuk, An Introduction to the Conjugate GradientMethod Without the Agonizing Pain, Technical Report, Pittsburgh,PA, USA, 1994. .

    [22] O. Goldreich, Foundations of Cryptography, Basic Applications, 2,Cambridge University Press, 2004.

    [23] J. Becker, H. Chen, Measuring privacy risk in online social networks,in: Proc. of W2SP 2009: Web 2.0 Security and Privacy.

    [24] B. Viswanath, A. Mislove, M. Cha, K.P. Gummadi, On the evolution ofuser interaction in Facebook, in: Proc. of WOSN 09, pp. 3742.

    [25] M. Cha, A. Mislove, K.P. Gummadi, A Measurement-driven analysisof information propagation in the Flickr Social Network, in: Proc. ofWWW 09, pp. 721730.

    [26] A. Mislove, M. Marcon, K.P. Gummadi, P. Druschel, B. Bhattacharjee,Measurement and analysis of online social networks, in: Proc. ofIMC07.

    [27] X. Su, T.M. Khoshgoftaar, A survey of collaborative lteringtechniques, Adv. Artif. Intell. (2009).

  • [28] D. Guo, J. Wu, H. Chen, Y. Yuan, X. Luo, The dynamic bloom lters,IEEE Trans. Knowl. Data Eng. 22 (2010) 120133.

    [29] F. Dabek, R. Cox, F. Kaashoek, R. Morris, Vivaldi: a DecentralizedNetwork Coordinate System, in: Proc. of ACM SIGCOMM 2004, pp.1526.

    [30] J. Ledlie, P. Gardner, M.I. Seltzer, Network Coordinates in the Wild,in: Proc. of 4th USENIX Conference on Networked Systems Designand Implementation, USENIX Association, Cambridge, MA, 2007. pp.2222.

    [31] Y. Liao, W. Du, P. Geurts, G. Leduc, DMFSGD: A Decentralized MatrixFactorization Algorithm for Network Distance Prediction, CoRR abs/1201.1174, 2012.

    [32] B. Eriksson, P. Barford, R.D. Nowak, Estimating hop distance betweenarbitrary host pairs, in: Proc. of IEEE INFOCOM09, pp. 801809.

    [33] H.H. Song, T.W. Cho, V. Dave, Y. Zhang, L. Qiu, Scalable proximityestimation and link prediction in online social networks, in: Proc. ofIMC 09, pp. 322335.

    [34] G. Danezis, P. Mittal, Sybilinfer: Detecting sybil nodes using socialnetworks, in: NDSS.

    [35] Z. Yang, C. Wilson, X. Wang, T. Gao, B.Y. Zhao, Y. Dai, Uncoveringsocial network sybils in the wild, in: Proceedings of the 2011 ACMSIGCOMM conference on Internet measurement conference, IMC 11,pp. 259268.

    [36] E.J. Cands, Y. Plan, Matrix completion with noise, Proc. IEEE 98(2010) 925936.

    [37] W. Dong, V. Dave, L. Qiu, Y. Zhang, Secure friend discovery in mobilesocial networks, in: Proc. of IEEE INFOCOM11.

    Yongquan Fu received the B.S. degree incomputer science and technology from theSchool of Computer of Shandong University,China, in 2005, and received the M.S. in

    Yijie Wang received the PhD degree from theNational University of Defense Technology,China in 1998. She was a recipient of theNational Excellent Doctoral Dissertation(2001), a recipient of Fok Ying Tong EducationFoundation Award for Young Teachers (2006)and a recipient of the Natural Science Foun-dation for Distinguished Young Scholars ofHunan Province (2010). Now she is a Profes-sor in the National Key Laboratory for Paralleland Distributed Processing, National Univer-sity of Defense Technology. Her research

    interests include network computing, massive data processing, paralleland distributed processing.

    Wei Peng received the M.Sc. and the Ph.Ddegrees in Computer Science from theNational University of Defense Technology,China, in 1997 and 2000 respectively. From2001 to 2002, he is a research assistant at theSchool of Computer from the National Uni-versity of Defense Technology, China, then aresearch fellow from 2003. He was a visitingscholar of University of Essex, UK, from May2007 to April 2008. His major interests areinternet routing, network science, and mobilewireless networks.

    Y. Fu et al. / Computer Networks 64 (2014) 369389 389Computer Science and technology in theSchool of Com- puter Science of NationalUniversity of Defense Technology, China, in2008. He is currently an assistant professor inthe School of Computer Science of NationalUniversity of Defense Technology. He is amember of CCF and ACM. His current researchinterests lie in the areas of cloud manage-

    ment, distributed online social networks and distributed algorithms.

    CommonFinder: A decentralized and privacy-preserving common-friend measurement method for the distributed online social networks1 Introduction2 Background2.1 Distributed online social networks2.1.1 Social graph2.1.2 Privacy-preserving message routing

    2.2 Establishing friendship on the DOSNs2.2.1 Searching friends2.2.2 Recommending friends

    2.3 Maximum margin matrix factorization based collaborative filtering2.3.1 Predicting rating scores with matrix factorization2.3.2 Overview of MMMF2.3.3 Mapping ? to rating scores2.3.4 Soft margin loss function2.3.5 Formulating objective function

    2.4 Differential privacy

    3 Problem definition3.1 Adversary model3.2 Disclosing friend lists leaks privacy3.3 Problem requirements3.4 A naive privacy-preserving NCF-computation solution

    4 Analyzing the NCF metric of the online social networks4.1 Data sets4.2 Distributions of numbers of friends4.3 Distributions of numbers of common friends4.4 Triangle inequality violations of numbers of common friends

    5 Designing a privacy-preserving NCF computation method for DOSNs5.1 Intuitions5.2 Architecture5.3 Exchanging bloom filters for privacy protection5.4 Scale the NCF computation with decentralized prediction5.4.1 Overview5.4.2 Coordinate structure5.4.3 NCF prediction5.4.4 Coordinate dissemination

    5.5 Benefits of decentralized NCF prediction

    6 NCF Prediction with decentralized coordinates6.1 Measuring NCF values based on bloom filters6.2 Neighbor management6.3 Decentralized MMMF optimization6.4 Distributed coordinate maintenance6.4.1 Decentralized conjugate gradient optimization6.4.2 Minibatch based coordinate update

    7 Top- ? ranking8 Privacy analysis8.1 Coordinates hide users friend lists8.2 Bloom filters yield differential privacy8.2.1 Sensitivity of a Bloom filter8.2.2 Differential privacy of bloom filters

    9 Evaluation9.1 Experimental setup9.1.1 Evaluation process9.1.2 Comparison methods9.1.3 Performance metrics

    9.2 Baseline performance of measuring pairwise NCF values9.2.1 Sensitivity to Bloom-filter parameters9.2.2 DBFs accuracy9.2.3 DBFs bandwidth requirements

    9.3 Convergence of the coordinates9.4 Comparison of coordinates based NCF prediction methods9.4.1 Robustness to incomplete measurements9.4.2 Robustness to incorrect measurements9.4.3 Accuracy of top- ? prediction

    9.5 Resilience to decentralized nodes9.6 Parameter sensitivity of optimizing decentralized coordinate9.6.1 Coordinate dimension9.6.2 Number of neighbors9.6.3 Regularized constant ? 9.6.4 Choice of neighbors

    10 Extending the NCF computation to Sybil accounts10.1 Sybil accounts10.2 Learn information about other users10.3 Destabilize coordinate system10.3.1 Fake Bloom filters10.3.2 Fake coordinates

    11 Related work12 ConclusionAcknowledgementsReferences

Recommended

View more >