CommonFinder: A decentralized and privacypreserving commonfriend measurement method for the distributed online social networks
Post on 23Dec2016
212 views
TRANSCRIPT
pod
ollege
ince 4
the distributed online social networks (DOSN for short),e.g., Safebook [1], LotusNet [2], Cuckoo [3], diaspora [4],Peerson [5], VisaVis [6] have been proposed to better
ecommendationtworks are able tocomplete knowlDOSNs, itilling tose of the
leakage. As a result, it is highly desirable to develoable and privacypreserving methods to quantify thebility of being friends for DOSN users.
One of the most popular metric for friend recommendation is based on the Number of Common Friends (NCF forshort), which has been shown to be very effective to recommend new friends or nd old friends [7]. Measuring
http://dx.doi.org/10.1016/j.comnet.2014.02.02013891286/ 2014 Elsevier B.V. All rights reserved.
Corresponding author. Tel.: +86 13308491230 (Y. Wang).Email addresses: yongquanf@nudt.edu.cn (Y. Fu), wangyijie@nudt.
edu.cn (Y. Wang), wpeng@nudt.edu.cn (W. Peng).
Computer Networks 64 (2014) 369389
Contents lists available at ScienceDirect
Computer N
journal homepage: www.elsproviders can peek at users proles at will for targetedadvertisements or even sell these sensitive information tothird parties for prots. As a result, improving the privacyprotection of social networks becomes increasinglyimportant.
Fortunately, borrowing the success of the P2P systems,
for users, which is known as the friend rproblem. The centralized online social necompute possible friends by using theedge of users proles. Unfortunately, formonly believed that end hosts are unwtheir proles to unknown entities becauis compublishprivacyp scalpossi1. Introduction
Online social networks such as Facebook, YouTube,Flickr have become quite popular these days. However,there are also hot debates over the privacy protection ofthese centralized social services. For example, the service
protect users privacy by hosting the DOSN infrastructuresbased on decentralized end hosts. The key idea is to storepersonal data on decentralized nodes and to enforce strictaccess rules on who can visit a users prole.
To increase the popularity of distributed online socialnetworks, an open question is how to nd potential friendsArticle history:Received 19 November 2012Received in revised form 9 October 2013Accepted 3 February 2014Available online 7 March 2014
Keywords:Distributed online social networkPrivacyCommon friendsFriend recommendationCoordinateBloom lterDistributed social networks have been proposed as alternatives for offering scalable andprivacypreserving online social communication. Recommending friends in the distributedsocial networks is an important topic. We propose CommonFinder, a distributed commonfriend estimation scheme that estimates the numbers of commonfriends between anypairs of users without disclosing the friends information. CommonFinder uses privacypreserving Bloom lters to collect a small number of commonfriend samples, andproposes lowdimensional coordinates to estimate the numbers of common friends fromeach user to any other users. Simulation results on realworld social networks conrm thatCommonFinder scales well, converges quickly and is resilient to incomplete measurementsand measurement noises.
2014 Elsevier B.V. All rights reserved.a r t i c l e i n f o a b s t r a c tCommonFinder: A decentralized andcommonfriend measurement methonline social networks
Yongquan Fu a, Yijie Wang a,, Wei Peng ba Science and Technology on Parallel and Distributed Processing Laboratory, CHunan Province 410073, ChinabCollege of Computer, National University of Defense Technology, Hunan Provrivacypreservingfor the distributed
of Computer, National University of Defense Technology,
10073, China
etworks
evier .com/ locate/comnet
preservation of users privacy. Section 4 then analyzes typical characteristics of socialnetwork data sets. Section 5next presents an overview of our proposed NCF predictionmethods. Section 6 presents the coordinate computationprocess. Section 7 next applies the decentralized coordinates to select topk users for recommending friends.Section 8 then measures the degree of the privacy protection offered by CommonFinder. Section 9 next comparesCommonFinders performance with extensive simulationresults. Section 10 next conrms CommonFinder is robustto Sybil accounts. Section 11 then summarizes related
370 Y. Fu et al. / Computer Networks 64 (2014) 369389NCF values in the distributed online social networks ishowever, a difcult task. First, disclosing friend lists tononfriend users could severely leak users privacy information [8,9]. Second, the NCF computation has to scalewell since most end hosts have limited bandwidthcapacity.
Talash [10] computes the common friends based onexchanging friend lists between pairs of users, which notonly leaks personal information, but also does not scalewell. The PSI approach [1114] represents a list of friendswith coefcients of a polynomial. It then exchanges Homomorphic encrypted coefcients via the communicationlinks, and nally computes the sums of coefcients, whichcorrespond to the set of common items in two friend lists.The PSI approach works well for semihonest users, but increases the transmission bandwidth and the computationcosts.
We propose a scalable and privacypreserving distributed NCF estimation method called CommonFinder thatestimates the NCF values between any pairs of users andis able to select topk users that have the largest NCFvalues.
CommonFinder represents a users friend list with thewellknown Bloom lter. Our privacy analysis (Section 8)proves that the Bloom lter provides the differential privacy [15,16] for the friend list: Given a user is Bloom lter,a curious entity is unable to determine who are user isfriends.
Unfortunately, exchanging the Bloom lters may needseveral KBytes bandwidth costs in order to control the falsepositives of the Bloom lter. To further increase the scalability of estimating the NCF values, we propose a distributed maximum margin matrix factorization method toestimate the NCF values by lowdimensional coordinates.As a result, the transmission size is xed to be the length ofthe coordinate that is independent of the length of the friendlists or Bloom lters.
We model the NCF completion problem with the maximum margin matrix factorization method and provide asystematical design of a distributed NCF prediction methodthat signicantly extends our prior work [17,18]. TheMMMF method maps contiguous matrix factorization results to discrete NCF values with adaptive thresholds.These thresholds are learnt during the process of optimizing the matrixfactorization model. To adapt to the decentralization of users, we reformulate the MMMF method inseparable objective functions that involve each users coordinate and a small number of neighbors. For each user, wedesign a fully decentralized conjugate gradient optimization method to adjust the coordinates of each user thatconverges quickly.
Finally, we present extensive privacy analysis and simulation results over the realworld social network topologies. Our results show that CommonFinder not onlyprotects users friend lists, but also signicantly improvesthe prediction accuracy and is more robust against incomplete or incorrect NCF measurements than previousmethods.
The rest of the paper is organized as follows. Section 2introduces the background information. Section 3 nextdenes the problem of predicting NCF values withwork. Section 12 concludes the paper. Table 1 shows keyparameters.
2. Background
2.1. Distributed online social networks
We introduce the basic characteristics for existingDOSNs.
2.1.1. Social graphLet a user be an online entity that participates in the
DOSN. Each user A has a friend list SA, where a friend Bof a user A is a user B that establishes the social link orfriendship link with user A on the DOSN. The commonfriends of two users A and B are represented by the subsetof users that are both friends of user A and B at the sametime.
Users and social links form a social graph. Each userand his/her friends are adjacent on the social graph, whichcorrespond to onehop neighbors to each other. The degree of each user amounts to the size of its neighbors onthe social graph. The number of hops between two usersamounts to the length of the shortest path for these twousers on the social graph.
2.1.2. Privacypreserving message routingUsers data are usually stored into his/her own com
puter. For improving the availability of users data whenthey are ofine, some DOSNs like Safebook replicate eachusers data on his/her friends computers. These friendsreplication will be used for ofine users.
The social graph is maintained via some kind of PeertoPeer substrates. When a user joins the DOSN, each userscomputer needs to be registered in the PeertoPeersubstrate consisting of decentralized computers. Usersreal names are decoupled with randomized keys calledidentiers that are generated by cryptographic hash
Table 1Notations and their meanings.
kI The number of hash functions in a Bloom lter
mI The length of a Bloom lterN The number of users on the DOSNbX The coordinate distance matrixY The pairwise NCF matrix between a set of usersL The maximal NCF valued The coordinate dimensionh The NCFmapping thresholdsk The number of recommended users
user to another nonfriend user may be eavesdropped by
exactly matched users or fails when timeout events
ment to the search process, most OSNs also recommend
Y. Fu et al. / Computer Networks 64 (2014) 369389 371friends based on users proles and friend lists.
2.2.2. Recommending friendsTo recommend friends, the OSNs have to locate the set
of users that are likely to be future friends for a user. TheNumber of Common Friends (NCF) is frequently used toquantify the probabilities of being friends between users.Then recommending friends for each user A is realized byselecting the topk users having the largest NCF valueswith user A. These topk users will serve as the recommended future friends. Afterwards, users are promptedwith these topk users in the OSNs web pages. Each userlater can send invitations to those people in the list. TheNCF values may be appended as a proof of the friendship.Upon receiving the acknowledge of accepting the invitation of being friends, a new friendship link is thenestablished on the DOSNs.occur.
Unfortunately, searching reallife friends is timeconsuming for users, since users have to manually look upevery friend he/she has remembered. Therefore, complecurious or malicious users.
2.2. Establishing friendship on the DOSNs
We introduce how users locate reallife friends on theDOSNs.
2.2.1. Searching friendsA user that logs into an OSN usually searches reallife
friends with their names and lters out nonfriend usersby checking their proles. The search procedure differsfor centralized and distributed OSNs:
When centralized OSNs receive users requests, thebackend servers query the OSNs databases to selectmatched users proles. For most DOSNs, since users proles are stored ondecentralized nodes, users requests are routed amongthe PeertoPeer substrate. The response either locatesfunctions like SHA1. The identier is originally used forrouting messages among users, but for the DOSN context,these identiers also hide persons realworld names, sinceeach identier can be assumed as a perfectly randomvariable.
The identiers are robust against dictionary attacks bycurious users. Since the space of the identiers is huge enough to prohibit the enumeration because of the overwhelming computational overhead. For example, thereare 2160 kind of possible strings for 160bit identiers.
Existing DOSNs usually route messages among usersalong the social connections in a hop by hop manner, in order to protect the privacy of the endtoend communication. Since each logical link corresponds to the realliftfriendship, otherwise, sending a request message from a2.3. Maximum margin matrix factorization basedcollaborative ltering
We next introduce the MaximumMargin Matrix Factorization (MMMF) method [19] for the collaborative lteringproblem, which seeks to predict rating scores from users togoods, given a partially available set of rating histories.
2.3.1. Predicting rating scores with matrix factorizationGiven Nc customers and Ng kinds of items, we can orga
nize the set of rating scores as a NcbyNg rating matrix Y,where Yij denotes the rating score from user i to the itemcorresponding to the jth column. The objective of the collaborative ltering problem is to complete the missingitems in the matrix Y. One of the most popular collaborative ltering approach is the matrix factorization, whichtrains a low rank matrix bX to approximate the rating matrix Y.
To obtain the matrix bX, the matrix factorization has tominimize a loss function that quanties the differencesbetween Y (training set) and bX. For a complete matrix Y,the Singular Value Decomposition (SVD) method yieldsthe optimal lowrank approximation [20]. But when Yhas some missing items, SVD becomes less accurate withincreasing missing items, since it treats all missing itemsas the same constant values.
Worse still, nding the optimal matrix bX for an incomplete matrix Y is likely to fail due to the overtting phenomenon: although items in the training set are predictedperfectly, the items out of the training set however receivelarge prediction errors. Fortunately, in order to mitigatethe overtting issue, it is well known that we can controlthe capacity of the loss function by combining the lossfunction with a regularization model.
Further, the matrix bX are generally contiguous values,while the rating scores in Y are usually discrete integers,e.g., from one star to ve stars. As a result, we have tomap the contiguous outputs to discrete rating values,which is generally a classication problem. Unfortunately,SVD simply rounds the contiguous values to nearby integers, which may degrade classication errors.
2.3.2. Overview of MMMFSuppose there are a total number of L rating scores, in
order to optimize the classication of rating scores, MMMFintroduces a NbyL 1 threshold matrix h. The thresholdmatrix h is unknown a prior and must be learnt from thetraining process. The ith row vector ~hi of the matrix h isused for mapping the ith row vector of the matrix bX to discrete rating scores.
MMMF classies rating scores according to the SupportVector Machine (SVM). Suppose that we need to group bX toa binary matrix Y, the SVM groups bX to two separate classes by locating a hyperplane that separates the items fromone group with items in the other group. In MMMF, thethreshold matrix is analogous to the hyperplane that separates items in bX to discrete rating scores. The optimalhyperplane of a SVM is dened as the one with the largestmargin between pairs of items from two groups, where themargin amounts to the maximum distance of the slab thatis orthogonal with the hyperplane.
query results on stored items. The differentialprivacy
372 Y. Fu et al. / Computer Networks 64 (2014) 3693892.3.3. Mapping bX to rating scoresMMMF maps the ith row vector in matrix bX by the
threshold vector~hi, the ith row vector of the threshold matrix h. Let hi
! hi1; . . . ; hiL1
be a vector of thresholdssorted in the ascending order. We create N adjacentintervals:
1; hi1 ; hi1; hi2 ; . . . ; hiL1;1
with ~hi as L 1 separating points.For each item bXij, its rating score is calculated as the in
dex of the interval that contains the item bXij. For example,suppose there are two thresholds 0:2;0:3 that separatethe whole range of real values into three intervals:
1;0:2 ; 0:2;0:3 ; 0:3;1 An item bXij 0:2 is then mapped to 2, since bXij is within
the second interval 0:2;0:3 .
2.3.4. Soft margin loss functionAn optimal classication of the estimated matrix bX to
the rating score matrix Y means that, each item bXij shouldbe contained in the interval hiYij1; hiYij
i, otherwise, we
have a misclassication.The hard margin matrix factorization seeks a matrix bX
matching observed rating scores by classifying each itembXij with immediate thresholds hiYij1 and hiYij of the Yijthinterval:
hiYij1 1 < bXij < hiYij 1 1for all ij 2 S. Unfortunately, the hard margin is sensitive tonoises. Noises are common in bX, since bX consists of inherent uncertainty. Further, Eq (1) is nondifferentiable, causing the nontrivial difculty for the optimization process.
For improving the robustness against noises, the softmargin matrix factorization relaxes the hard constraint ofEq. (1) by adding slack variables nij P 0
for all ij 2 S:
hiYij1 1 nij < bXij < hiYij 1 nij 2As the optimal solution for Eq. (2) amounts to that by optimizing the hinge loss function h z max 0;1 z [19](the distance from the classication margin), we transformEq. (2) to:
L Yij; bXij h bXij hiYij1 h hiYij bXij 3Enforcing the hinge loss functionmakes Eq. (3) become differentiable, therefore, Eq. (3) can be solved via the convexoptimization.
Further, MMMF also penalizes the errors of classication with all thresholds in each threshold vector in orderto further improve the robustness of Eq. (3):
L Yij; bXij XL1r1Lr; bXij XYij1
r1h bXij hir XL1
rYijh hir bXij
XL1r1
h Trij r;Yij hir bXij 4
where the indicator function Trij determines whether athreshold item r in hi
!is larger than the rating score Yij:Df maxA;B;AB1
f A f B k k1 8
[15]method provides very strong protection of the privacy:For any two databases D1 and D2 that differ only oneitems x, i.e., D1 D2 x. An adversary is unable to distinguish whether the entity x is in the database D1 orD2.
A random function j provides differential privacy iffor any two databases D1 and D2 having one different item,i.e., where jD1 D2j 1 and all /#Range j PrjD1 2 / 6 exp PrjD2 2 / 7where Range denotes the output of the random function j[15]. The random function j adds some random noises tothe query results. The parameter is public to all users.Setting is up to the users decisions.
The degree of the privacy protection of the randomfunction j depends on the scale of the added noise andthe sensitivity of the query result. The sensitivity statesthe maximum difference of the query results due to addingor removing an entity from the database:
Denition 1. For a function f : D! Rd over a certaindomain D, the sensitivity of a function f is2.3.5. Formulating objective functionMMMF seeks to optimize Eq. (4) with robustness
against the overtting phenomenon. To that end, MMMFnds a matrix bX and a classication matrix h by minimizing the loss function in Eq. (4) plus two regularized items Uk k2F Vk k2F , where k kF denotes the Frobenius norm thatamounts to the squared root of the sum of squared items inthe matrix U or V. The regularized item is used for thecapacity control in order to mitigate the overtting problem of Eq. (4).
The objective function of MMMF can be written as:
JU;V ; h XR1r1
Xi;j2X;Yij>0
hTrijhir UiVj a2 Uk k2F Vk k2F
6
where a is a tradeoff constant.MMMF solves Eq. (6) based on the PolakRibire variant
of the nonlinear conjugate gradient methods (PRCG) in acentralized manner. PRCG incurs a linear computationcomplexity, converges quickly and is robust to missing orerroneous inputs [21].
2.4. Differential privacy
The differential privacy [15,16] represents one of thestateofart privacyprotection techniques. A databasestores a set of items. Users interact with the databasethrough statistical queries and the database respondsTrij r;Yij 1 r P Yij1 r < Yij
5
inject fake friendlist information into the system (see
based on the NCF metric, the OSN needs to know the sizesof the intersections of the friend lists among twohop
topk recommended users of user A according to the NCF
DOSN faces severely imbalanced computation and transmission costs.
Y. Fu et al. / Computer Networks 64 (2014) 369389 373users.Unfortunately, directly disclosing the friend lists to
nonfriend causes severe privacy leakage. This is becausethe friendship links on the social graph can serve as the ngerprints to locate users activities [23]. For example, userspersonal proles could be accurately inferred from theirfriends even they hide their proles [8]. Therefore, to better protect users privacy, centralized OSNs allow for usersto hide friend lists from nonfriend.
Further, disclosing friend lists to nonfriend is also vulnerable to the Identity Cloning Attack (ICA) [9]. Supposethat a user T obtains a large number of friend lists of usersSection 10).
3.2. Disclosing friend lists leaks privacy
To recommend each user a set of candidate friendsFor realvalued functions, the most popular differentialprivacy approach [15] is to add independently identicaldistributed (i.i.d) noises from the Laplace distribution tothe query result, which provides differential privacy:
Theorem 1. For a function f : D! Rd over a certain domainD, a random function
Mx f x Laplace Df
d9
provides differential privacy [15].
3. Problem denition
We rst introduce how we model the behaviors ofusers. We next present the requirements of recommendingfriends based on the numbers of common friends.
3.1. Adversary model
Users Trust Friends. We assume that users trust theirfriends and allow their friends to visit their proles including their friend lists, since online friends correspond to therealworld friendships. In the social graph, it means thateach user trusts his onehop neighbors. This model allowsfor the anonymous communication process on the DOSN,since each user can recursively sends messages to his orher friends.
Users Are Semihonest. We further assume that usersare semihonest [22]: users may be curious to learn thefriend lists of nonfriend, but they follow the commonfriend measurement protocol and do not claim fake friendsabout his/her friend list. Particularly, each user is able toeavesdrop on the logical communication links connectingthat user.
Although the semihonest model is simple, designingalgorithms for DOSNs turns out to be nontrivial due tothe large scale and the decentralization of nodes. Further,we will extend our model to tolerate Sybil users that could4. Analyzing the NCF metric of the online socialnetworks
To better understand the characteristics of pairwise NCFvalues, we next analyze the NCF metric with OSN data sets.metric.
3.3. Problem requirements
We next summarize key requirements for computingpairwise NCFs in DOSN settings:
Privacypreserving. Friend lists should be treated asprivacy information since disclosing them may leakusers privacy and suffer from the ICA attack in Section 3.2. However, hiding the friend lists from nonfriend signicantly increases the difculty of computingthe NCF values. Decentralized. As there exist no centralized entitiesthat are able to collect the friend lists of all userson the DOSN, users have to compute the NCF valueswith their twohop neighbors on the social graph ina distributed manner. Further, the NCFcomputingprocess among twohop online users should still workdespite some users may join or leave the DOSN atwill. Scalable. Computing the NCF values should require lowbandwidth overhead and scale well with increasingfriends, since each user only has limited computingand bandwidth resources.
3.4. A naive privacypreserving NCFcomputation solution
One simple decentralized approach is to let each friendB of a user A be the broker of computing the NCF values foruser A [?]. User B computes the NCF values between A andthose in Bs friend list. To that end, user B requests thefriend lists from user A and one friend C, and then calculates the intersection of two friend lists of A and C. Wecan see that the friend lists are always kept at onehopneighbors. Since each user trusts onehop neighbors onthe social graph, the privacy of the friend list is preserved.
For calculating the common friends between any pair oftwohop users, each broker j needs O
Pp2List j List p
com
munication overhead, where List denotes the list offriends of a user. The computation overhead is.
OP
p2List j P
q2List j ;pqjList p jjList q j
for querying theexistence of each item over the friend lists. Due to theskewed distributions of friend lists shown in Fig. 1(a), theon the OSNs. To establish the friend link with each user A,user T creates an account with a friend list by copying fromfriend lists of user A and some other users for randomization. Then user T will have many common friends withuser A, therefore, user T will be likely to be listed in the
0.20.30.40.50.60.70.80.9
1Pr
(X>x
)FacebookYouTubeFlickr
1000
0.20.30.40.50.60.70.80.9
1
N
Pr (X
>x)
FacebookYouTubeFlickr
F dis
0.05
0.1
0.15
0.2
Pr (X
>x)
FacebookYouTubeFlickr
s of ke
374 Y. Fu et al. / Computer Networks 64 (2014) 3693894.1. Data sets
We choose three representative social graphs that areall collected by the online social networks research groupin the Max Planck Institute for Software Systems [2426].Table 2 summarizes the basic information for each of thenetworks.
4.2. Distributions of numbers of friends
We rst plot the Complementary Cumulative Distribution Function (CCDF) of the number of friends for all users.Fig. 1(a) shows that the sizes of friend lists are quite nonuniform. Most friend lists have modest sizes, but a smallnumber of friend lists could have thousands of friends.For Facebook and YouTube data sets, only 5% of users havemore than 100 friends, while for the Flickr data set, over30% of users have over 100 friends.
4.3. Distributions of numbers of common friends
We then plot the NCF values between online users inFig. 1(b). The NCF values are nonuniformly distributed.For the Facebook and YouTube data sets, most onehopor twohop users have only one common friends, but thereexists 10% of pairs having a large number of commonfriends. For the Flickr data set, there are much more common friends than the other two data sets, where around30% of pairs have more than 500 common friends.
1 10 100 10000
0.1
# Friends
(a) Degree distribution.
1 5000
0.1
(b) NC
Fig. 1. The CCDF plot4.4. Triangle inequality violations of numbers of commonfriends
We next test whether the pairwise NCF values meet thetriangleinequality assumption required by the metricspace model. The triangle inequality states that the sumof two NCF values is always not smaller the third NCF valuefor a triple i; j; q: Yij Yjq P Yiq. We dene the triangle
Table 2Dataset summary.
Network Date # Users
Facebook 12/29/2008, 1/3/2009 60,29Flickr 11/2/2006, 104 days 2,570,53YouTube 1/15/2007 1,157,82inequality violation (TIV) as TIVijq YiqYijYjq. When TIVijq > 1,this triple i; j; q has a TIV.
From Fig. 1(c), most triples follow the triangle inequality, but there exist about 5% of triples violating the triangleinequalities. Further, for the YouTube and Flickr data sets,some triples may have large degrees of the triangleinequality violation. Since the pairwise NCF values do notfulll the metricspace requirements, there exist inherentdistortions for predicting NCF values with the metric spacebased methods such as LandmarkMDS.
5. Designing a privacypreserving NCF computationmethod for DOSNs
We next present a scalable and privacypreserving NCFprediction method that measures NCF values with constant bandwidth overhead.
5.1. Intuitions
Since our major objective is to recommend friendsbased on topk NCF values, we do not necessarily knowthe exact list of common friends. Further, the recommendation also tolerates a low degree of prediction errors, asdifferent people have varying subjective views on friends.We can obtain a more efcient NCF computation methodwith sketches of friend lists that have compact storagestructures. For obtaining the pairwise NCF values scalablywith the protection of users privacy, our work applies
1500 2000 2500
CF
tribution.
0 1 2 3 4 50
Yik/(Yij+Yjk)
(c) Triangle inequality violation.y statistical metrics.the wellstudied Bloom lters and the collaborative ltering methods to measure NCF values the decentralizedOSN settings.
5.2. Architecture
We present the main components in the CommonFinder method. For ease of presentation, assume that a user Bob
# Links Authors
0 1,545,686 Viswanath et al. [24]5 33,140,018 Cha et al. [25]7 4,945,382 Mislove et al. [26]
logs into the DOSN. As plotted in Fig. 2, at the top level,CommonFinder provides two interfaces for the upperlayerfriend recommendation application:
Topk ranking: Selects at most k twohop users havingthe largest numbers of common friends with Bob. Theseusers will serve as the recommended friends for Bob.The number of common friends for a pair of users iscomputed via the NCFprediction component. NCF prediction: Computes the NCF value between Boband any other DOSN user based on their decentralizedcoordinates. Each users coordinates is freely exchangedamong DOSN links.
The coordinatemaintenance component manages ausers coordinate in a fully distributed manner. Each useris assigned a random coordinate when he/she logs intothe system; afterwards, a daemon incrementally maintainseach users coordinate.
For maintaining the coordinates, the neighbormanage
messages.
Y. Fu et al. / Computer Networks 64 (2014) 369389 375The Bloomlter component represents each usersfriend list with a Bloom lter. It uses a bit array to represent a list of items. Each item is hashed to a number of bitsaccording to a set of hash functions. The storage and transmission overhead are reduced by a constant factor compared to the friend lists. To test whether an element is inthe friend list, we compute the bits in the bit array usingthe same set of hash functions, and test whether these bitsare all set to ones. If yes, then the Bloom lter returns thatthe element is in the friend list.
5.3. Exchanging bloom lters for privacy protection
We select the Bloom lter as the sketch structure. EachBloom lter has a certain probability of false positives, i.e.,
Topk Ranking NCF prediction
Bloomfilter maintenance
Coordinate maintenance
Neighbormanagement
Fig. 2. The main components of the NCF computation method.ment component selects neighbors from online users having nonzero NCF values. This neighbormanagementcomponent sends heartbeat messages to these users andpulls back online users coordinates via the piggybackit may claim a user that is not in the friend list to be in thelist. The false positives are often treated as one drawback ofthe Bloom lter. However, each user can state that he/sheis not represented by the Bloom lter because of the uncertainty caused by false positives. Therefore, we can use thefalse positives to protect the privacy of users friend lists,since a curious user cannot exactly infer who is in thefriend list by querying the sketch.
Suppose that we represent each friend list with a Bloomlter, two users can exchange their Bloom lters to eachother without worrying about the loss of privacy of thefriend lists. In Section 8.2, we prove that the Bloom lterprovides differential privacy for the friend lists.
5.4. Scale the NCF computation with decentralized prediction
The transmission and computation overhead of theBloom lter are still linear with the sizes of friend lists.For controlling these overhead, we predict NCF values withdecentralized coordinates.
5.4.1. OverviewWe compute the NCF values with the Bloom lters for a
small fraction of users, then we predict the NCF values forthe rest of pairs of twohop users with lowdimensionalcoordinates. Accordingly, each user has a lowdimensionalcoordinate, the NCF value between a pair of users amountsto the coordinate distance between these two users. Estimating NCF values with decentralized coordinates has atleast two desirable advantages:
Scalability: The coordinates incur xed computationand communication costs, which are independent ofthe sizes of the friend lists. Privacyprotection: Since coordinates are lowdimensional realvalued vectors, exchanging coordinates onlyreveals the NCF values without the information of thefriend lists.
Let Y denote a NbyN matrix that denotes the pairwiseNCF values for pairs of users. For correctly predicting NCFvalues, the coordinates must be optimized with respectto an objective function. Selecting an appropriate objective functions is nontrivial:
Sparse: The NCF matrix Y may be quite sparse, becauseYij amounts to zero for any pair i; j of users that aremore than two hops away on the social graph. Distributed: Since users are decentralized in DOSN,optimizing the coordinates has to be performed in a distributed manner, where each user adjusts his/her owncoordinate adaptively with respect to a number ofneighbors.
Predicting NCF values is analogous to completing therating scores by users in the collaborative ltering eld,since both are discrete integers. We thus propose to predictNCF values for decentralized pairs of users by borrowingthe success of the collaborative ltering eld [27]. TheMMMFmethod (see Section 2.3) is robust to missing items,but we have to transform the centralized MMMF
The matrix bX is represented by a linear combination of
users. For each user i, we set its coordinate as two low
Second, CommonFinder fully utilizes the lowdimen
tion overhead. Therefore, CommonFinder scales well withdecentralized users on the DOSNs.
Finally, as NCF has been a very popular metric for recommending friends on centralized OSNs, CommonFindercan serve as a replacement for existing NCF computationin centralized OSNs, which can improve the scalabilityand increase the responsiveness to users requests.
6. NCF Prediction with decentralized coordinates
We next present how to estimate the NCF between anypair of users combining the Bloom lters and the
376 Y. Fu et al. / Computer Networks 64 (2014) 369389dimensional vectors ~ui;~v i and a vector of threshold values ~hi, where ~ui denotes ith row vector of U, ~v i denotesthe ith column vector of V, and ~hi represents the ith rowvector of h. Accordingly, the coordinate distance bXij between two users i and j amounts to the dot productbXij uiv j.5.4.3. NCF prediction
To predict NCF values to another user j, each user imapsthe coordinate distance bXij to a discrete NCF value with his/her own threshold vector ~hi. Algorithm 1 summarizes keysteps to compute the NCF values with coordinates. Step 2computes the distance bXij. Steps 37 determine the indexof the interval that contains bXij. We can see that user i doesnot need user js threshold vector. Fig. 3 shows an exampleof computing NCF values by users.
Algorithm 1. Mapping the coordinate distance to the NCFvalue.
1 MapXtoY (~ui, ~v j, ~hi)input: ~ui: user is coordinate component, ~v j: user jscoordinate component,~hi: User is threshold vector.output: y
2 bXij ~ui ~v j;3 y = 1;4 for l 1! L 1 do5 y bXij P hil?1 : 0;6 end7 return y;two lowrank matrices: bX U V , where U is a N dmatrix, V is a d N matrix, and d N. Each row vectorof the matrix bX is represented by the inner product ofvectors from U and V , i.e., bXij Pdm1uimvmj. The threshold matrix h hilf g, where i 2 S; l 2 1; L 1 ,is learnt from MMMFs optimization process.
For predicting NCF values with the MMMF method, weassign row vectors of the matrices bX and h to decentralizedmethod to a decentralized MMMF method since users aredecentralized in the DOSN context. We solve the distributed MMMF method with a novel decentralized conjugateoptimization method that converges quickly to stablepositions.
5.4.2. Coordinate structureWe next introduce the structure of the coordinates.
Let S denote a set of DOSN users. Let N be the size ofset S. Let d be a constant parameter (d N). Recall thatthe MMMF method computes a NbyN matrix bX and aNbyL 1 threshold matrix h for predicting discreterating scores. The ith row vector of matrix h is used toseparate real values of the ith row vector in matrix bXto discrete values.fore exchanging coordinates among nonfriend users doesnot disclose users friend lists.sional coordinates for calculating pairwise NCF values thathave constant transmission bandwidth costs and computa5.4.4. Coordinate disseminationIn order to predict pairwise NCF values, users need to
exchange the lowdimensional vectors corresponding tothe matrix factorization model, but hides each usersthreshold vector from other users. For example, each useri can request any other user js coordinate components~uj;~v j. However, we do not allow each user to requestother users threshold vector. As a result, each user is onlyable to predict NCF values from himself/herself to otherusers, but cannot determine the NCF values of an arbitrarypair of users.
Sending the threshold vectors could leak users friendship information. Since knowing the nonzero NCF valuesof a user pair indicates that two users are either friendsor friends of friends, thus the social contacts of users canbe easily derived. Consequently, users privacy settingcould be violated and even worse, users otherwise secretinformation could be inferred as discussed in Section 3.2.
5.5. Benets of decentralized NCF prediction
We next summarize the benets of the decentralizedNCF prediction process.
First, CommonFinder protects users privacy via twocomplement data structures. Bloom lters anonymizeusers identiers and hide the existence of any specicusers with a degree of false positives. Coordinates do notcontain any information about who is in a friend list, there
l=1
0.2 0.3
l=2 l=3
>0.3(0.2,0.3]
thresholdsCoordinate distance
levels
Fig. 3. Mapping the coordinate distance to NCF values.
decentralized coordinates. We rst measure a small number of NCF values with the Bloom lters as inputs to maintain the coordinates. We next present the neighborhoodmanagement process with respect to which the coordi
Alice queries Bobs Bloom lter with her friend list, and
After establishing several links with friends on the social
his/her own coordinate with a set of neighbors coordi
Y. Fu et al. / Computer Networks 64 (2014) 369389 377graph, user i puts these friends into his/her neighbor setSi, otherwise, this user will restart the search processlater.1
1 For bootstrapping users online friends, the DOSN may also advertisepopular users proles to each users.vice versa. The subset of friends represented in theBloom lter is returned as the estimated commonfriends.
Assume that an attacker maintains a dictionary of identiers crawled over the DOSN. The attackers cannot exactlyknow whether an identier is indeed in the set, since theBloom lter is a probabilistic data structure with a false positive probability for a query. The false positives are statistical valid for any query, therefore a user is always able todeny that he/she is a friend of another user due to the falsepositives.
6.2. Neighbor management
Each user i maintains a set Si of neighbors with a maximal capacity Sm (Sm 20 by default). Each neighbor is represented with three elds: the identier, the Bloom lterand the coordinate.
The neighbor set Si covers user is friends. If a user hastoo few friends, we also include a randomized set of twohop users as neighbors. But we deliberately set usersfriends to have a higher priority than twohop users sincefriends are trusted. The twohop users help if there are toofew friends. Since we probe NCF values with Bloom lters,users privacy is still preserved.
When a user i joins the DOSN system, he/she rst ndsreallife friends that have already registered in the DOSN.nates are optimized. We then present a distributed algorithm to adjust each users coordinate adaptively andscalably.
6.1. Measuring NCF values based on bloom lters
We represent each friend with an identier created bycryptographic hash functions like SHA1 for data routingin the DOSN. Then each friend list is represented by aBloom lter. We use the Dynamic Bloom lter (DBF) [28]to represent a dynamic set of friends because of its simplicity and low computational complexity. The DBF [28] includes multiple standard Bloom lters where eachstandard Bloom lter stores only c items.
Each user then obtains other nodes Bloom lters andqueries them to estimate the set of common friends withhis/her own friend list:
Alice and Bob exchange their Bloom lters.nates in a fully decentralized manner. We decompose(10) into N subobjective consisting of coordinates andNCF measurements among each user i and his/her neighbors Si:
JY ~ui; ~v i; ~hi
XL1r1
Xj2Si
h Trij r;Yij hir ~ui~v j
k2
~uik k2F ~v ik k2F
11
The objective JY ~ui; ~v i; ~hi
seeks to optimize user iscoordinates. Since Eq. (11) is dened for each user iseparately, each user i is able to optimize Eq. (11) in aEach online user has a daemon process that triggers aperiodical procedure to sample twohop users on the socialgraph for user i. The daemon process randomly selectsone of user is friends, say user j, as the counterpart forcommunication. Then user is daemon requests the daemon process at user j to send the contact address of afriend of user j. The daemon process at user j then selectsa user q uniformly at random from his own friend listand piggybacks user qs data to the daemon process at useri. Finally, the daemon process at user i puts user qsdata into set Si. If user j does not respond within 20 s,the daemon process at user i removes user j from his/herneighbor set.
6.3. Decentralized MMMF optimization
We next formulate the objective function of optimizingusers coordinates for accurate NCF prediction. We formulate the problem of the NCF prediction with the MMMFmethod. Readers are referred to Section 2.3 for the background of the MMMF method.
We formulate the objective of predicting NCFs withMMMFs optimization function in Eq. (6) as:
J u;v ; h XL1r1
Xi;j 2X
h Trij r;Yij hir ~ui~v j
k2
XNi1
~uik k2F ~v ik k2F !
10
where X denotes the set of pairs of users with observedNCF values and k is a regularized constant. We can see thatEq. (10) requires all users information, as a result, solvingEq. (10) belongs to the centralized optimization problemwhere an entity collects the NCF matrix Y and computesthe coordinates for all users. Unfortunately, such a centralized formulation not only leaks the friendship informationof all users, but also does not adapt to the decentralizationof DOSN users where users may leave the systemdynamically.
As a result, we need a decentralized optimization process that is able to adapt the dynamics of DOSN users. Tothat end, we decompose Eq. (10) to a set of distributedoptimization problems that are solved by decentralizedusers. The key idea is to let each user iteratively optimize
of the last round. Then step 6 calculates the moving step.Step 7gradie
378 Y. Fu et al. / Computer Networks 64 (2014) 369389distributed manner based on the local information of itsneighbors.
6.4. Distributed coordinate maintenance
We adjust each users coordinate in a fully distributedmanner via a novel implementation of the PRCG method. We rst present how we transform the centralizedPRCG method to a distributed algorithm for adjustingusers coordinates. We next adapt the coordinate maintenance process to the dynamics of decentralizedcoordinates.
6.4.1. Decentralized conjugate gradient optimizationPRCG is an iterative optimization algorithm. To mini
mize a nonlinear function f x, it iteratively updates thevector ~x according to the conjugate direction of ~x untilreaching a local minima.
Transforming the iterative PRCG process to a distributed PRCG method (DPRCG) algorithm is straightforward. Each user i optimizes Eq. (11) in separate rounds.For each round, user i adjusts the vector ~xi once by oneround of the PRCG method. Further, every two adjacentrounds is separated by s seconds.
(i) Setting Parameters.We dene the vector ~xi for user iwith the concatenation
of coordinate components of i, i.e.,
xi ~ui; ~v i;~hih i
We next derive the gradient of the vector ~xi as
rxJY xi ! @JY@ui ;@JY@v i
;@JY@hi
12
where @JY@ui, @JY@v i
and @JY@hi
are partial derivatives with respect tothe coordinate components ~ui, ~v i and ~hi:
@JY@uiz kuiz
Xj2Si
XL1r1
Trij r;Yij h0 Trij r;Yij hir ui!v j! v jz
and
@JY@v iz
kv iz Xj2Si
XL1r1
Trji r;Yji h0 Trji r;Yji
hjr uj!v i!
ujz
for z 2 1; d , and
@JY@hir Xj2Si
Trij r;Yij h0 Trij r;Yij hir u!i v!j
for r 2 1; L 1 .(ii) Bootstrapping.First, we initialize necessary parameters for bootstrap
ping the optimization process. Each user is coordinate isinitialized as a randomized vector, where each dimensionis selected from the interval 10;10. Then, user i adjustshis coordinate periodically, separated by s second.
We set the steepest direction and the conjugate direction for each user i:direction and the conjugate direction and then reconstructuser is coordinate components based on the updatedvector ~xi.next adds the vector ~xi with ai times of conjugatent direction. Finally, steps 812 cache the steepest Compute the steepest direction Dx as the reverse direction of the gradient rxJY xi :
Dx rxJY xi
since the reverse of rxJY xi indicates the direction of themaximum decrease for f x. Set the conjugate direction Kx1 with the steepestdirection Dx in the rst round.
(iii) Incremental Adjustment.Using the steepest direction to adjust the coordinate is
sensitive to input noises such as the coordinate oscillationor erroneous NCF values. We therefore use the conjugatedirection to adjust the coordinates. At each round l > 1,we updates the conjugate direction Kx of vector ~x as theweighed sum of the steepest direction and the conjugatedirection of the last round:
Kxl Dxl bl Kxl1 13
where bl > 0 controls the capacity of the conjugate direction in the current round. We use a revised PolakRibirescalar:
bi !max 0;DxilT Dxil Dxi1l 1
Dxil 1TDxil 1
( )14
that automatically resets the conjugate direction. Theparameter bl trades off the robustness and the convergencespeed: Decreasing bl favors the steepest direction, but becomes more sensitive to noises, while increasing bl reducesthe effect of the steepest direction, but slows down the convergence speed of the optimization process.
We determines the step al of moving the vector ~x. Weuse the wellstudied linesearch method [21] to be awareof the convergence status of the vector ~x. the line searchmethod minimizes the objective function (11). The moreaccurate the vector~x is, the smaller the movement step becomes. Therefore, the coordinates become more stablewith increasing rounds of adjustments.
We then adjust the vector~x at a small distance with theconjugate direction Klx and the movement step al:~x ~x al Klx 15
Algorithm 2 summarizes the above process for adjusting the coordinates. Step 2 calculates the concatenation~xiof the coordinate components of user i. Then step 3 computes the new steepest direction D as the reverse directionof the current gradient of~xi. Step 4 then computes the PolakRibire scalar b that trades off well between therobustness of the optimization process and the convergence speed. Step 5 next updates the conjugate directionwith the steepest direction D and the conjugate gradient
As users coordinates are modied at each round byAl sco ipr qu r,fre ina hla eco em
]pr hap sco oth epo
llne sth qu di o
nate distances. Unfortunately, since we only care about the
lists of topk users from his/her friends. The top k users
Y. Fu et al. / Computer Networks 64 (2014) 369389 379recorded historical records. The minibatch approach worksquite well in practice. For instance, CommonFinderscoordinates converge within twenty rounds.gorithm 2, we have to adapt to the dynamics of userordinates. One simple approach is to let each userobe the NCF values to neighbors in set Si and to reest these users coordinates at each round. Howevequently measuring NCF values or requesting coordtes to all neighbors incurs heavy trafcs at nodes witrge friend lists. Therefore, we have to adapt to thordinate modications of neighbors and control theasurement trafcs.We follow the wellstudiedminibatch approach [2931oposed in the networkcoordinate eld. The minibatcproach reuses historical positions of most neighborordinates, since historical coordinates become closer te current positions as coordinates converge to stablsitions.Each user caches the NCF values and coordinates of aighbors in the memory; at each round, each user refresheis cache by probing the NCF value to one neighbor and reests this neighbors coordinate, and then updates his coornate with respect to all cached neighbors according tFrom Algorithm 2, the lengths of steepest directionand the conjugate direction amount to the twice sizeof the coordinate. The overall space overhead forAlgorithm 2 is O jSij 2d L . The communicationcomplexity is linear with the number of neighbors. Inpractice, we limit the number jSij of friends, the lengthd of coordinate vector to be around 20, the communication complexity and the storage overhead of Algorithm 2is quite modest.
6.4.2. Minibatch based coordinate updateAlgorithm 2. Distributed PolakRibire nonlinear conjugate gradient optimization algorithm.
1 Update~ui;~v i;~hi;Dxi;Kxi; Si;Yinput: a user is current coordinate~ui, ~v i,~hi, a user issteepest direction Dxi, a user is conjugate directionKxi, the set of neighbors Si, the NCF values from auser i to its neighbors in Si, the coordinates ofneighbors in Si.
2 xi ~ui;~v i;~hih i
;
3 D rxJY xi ;4 b DT DDxi
DxTiDxi
;
5 K D bKxi;6 ai argmin
aiJY xi aiK ;
7 xi xi aiK;8 Dxi D;9 Kxi K;10 ~ui xi1 : d;11 ~v i xi d 1 : 2d;12 ~hi xi 2d 1 : 2d L ;in the merged list will be returned as the topk recommended friends. The initiator initializes an empty list Listand then merges each newly received list dListkj sent fromhis/her friend j. Both Listi and dListkj are represented withthe linkedlist data structure for exible expansion.
Theorem 2 conrms that merging the local topk usersyields the correct results. Extending to cases where someNCF values are identical is straightforward, since usershaving the same NCF values with the initiator areexchangeable with each other.
Theorem 2. For ease of presentation, we assume thatpairwise NCF values are different. Suppose that an initiator iseeks the topk ranking list of users. By merging local rankinglists from friends of the initiator i, user i has the correct topkranking list.
Proof. Let i; j and j; q be two pairs of friends in thesocial graph. Assume that user q is among the topk ranking list for i, but is pruned from the local merging step byuser j. We will prove by contradiction.
Since user q is pruned by user j, user js topk rankinglist for i must exclude user q. In other words, user j has atleast k friends having larger NCF values with i than user q.As a result, user qmust not be among the topk ranking listfor user i. Thus, a contradiction happens. Therefore, user qmust be included in the topk ranking list of the initiator i,the correct topk list is preserved. h
8. Privacy analysis
8.1. Coordinates hide users friend lists
Recall that in CommonFinder, each user i can send hiscoordinate component ~ui;~v i to any other user. The~ui;~v i components help other users compute the numberof common friends with user i (see Algorithm 1).topk users, most of the communication is wasteful.We present a divideandconquer based method to re
duce the transmission bandwidth costs.Spreading: A user (called the initiator) spreads the
ranking task to all of his/her friends. Upon receiving theranking task, each user j lters out those who have alreadybeen the initiators friends and selects k users from the restof friends having the largest NCF values with the initiator.The NCF values are computed with the decentralized coordinates. Finally, user j piggybacks the initiator with thetopk users.
Merging: The initiator then aggressively merges the7. Topk ranking
Having presented the decentralized coordinates, wenext show how to use coordinates to select topk usershaving the highest NCF values as recommended friendson the DOSN.
A naive approach is to aggregate all coordinates of twohop users and to compute the descending order of coordi
Intuitively, since ~ui;~v i are two lowdimensional real Proof. Let I1 and I2 denote the bit arrays for the Bloom lters BFS1 and BFS2, respectively.
We rst compute the probability Pi that two ith bits inthe bit arrays I1 and I2 are different, i.e., I1i 0 andI2i 1. In other words, Pi is the probability that the ithbit in BFS2 does not change after inserting the itemxn1f g, which amounts to the probability that I1i is stillzero after inserting n keys:
pi enkImI 17
Since after inserting n keys into a sizedmI Bloom lterwith kI hash functions, the probability that a bit is still zero
amounts to 1 1mI nkI enkImI .
We next show the expected L1 difference between theBloom lters BFS1 and BFS2. Let kI hashed positions forthe item xn1 be a1; . . . ; akI
.
[
380 Y. Fu et al. / Computer Networks 64 (2014) 3693898.2.1. Sensitivity of a Bloom lterLet mI and kI be the length of the bit array I and the
number of hash functions for a Bloom lter. Letf : S! BFS be a function that maps a set of identifers Sto a Bloom lter BFS. Given two sets S1 x1; . . . ; xnf g,S2 x1; . . . ; xn; xn1f g that differ in only one identier xn1.
We compute the L1 difference of the bit arrays of twoBloom lters BFS1 and BFS2:
g S1; S2 XmIi1jIBF S1 i IBF S2 i j 16
in order to reveal the difference of the bit arrays by hashingS1 and S2 into two Bloom lters. Theorem 3 states the expected number of different bits for two Bloom lters representing two sets S1 and S2 that differ in one item.
Theorem 3. Given two sets S1 and S2 satisfying thatS2 S1 [ xn1f g. The expected number of different bitsbetween the Bloom lter BFS1 and BFS2 amounts toE g S1; S2 kIe
nkImI .8.2. Bloom lters yield differential privacy
We next show that the Bloom lters also protect theprivacy of users friend lists. Intuitively, since Bloom ltermay introduce false positives, i.e., the estimation may notcoincide with the groundtruth common friends. Foraround 99% of all cases, the Bloom lter is able to predictcorrect common friends. Moreover, we can turn the falsepositives into a way of protecting the privacy: a user candeny that a user is his (or her) friend, since the Bloom ltercould fail due to the false positives. Increasing the size ofthe Bloom lter may increase the accuracy of estimation,but still fail to distinguish whether a user is in the Bloomlter with 100% certainty.valued vectors, which do not encode any informationabout user is friend list. Therefore, disclosing the ~ui;~v icomponents of user i to his nonfriend does not leak useris friend list.
CommonFinder requires to hide each users thresholdvector ~h. This is because for each user i, his/her thresholdvector~h is only useful for computing the numbers of common friends between i and other users. Further, disclosing~h leaks users friend links. This is because for each user i, amalicious user can compute the set of users having somecommon friends with user i, indicating that these usersare at most twohop away from user i on the social graph.Therefore, these users are either user is friends or friendsof user is friends. Consequently, the friendship links aredisclosed to malicious users.
Suppose a malicious user collects user is coordinatecomponents ~ui;~v i, but does not known user is thresholdvector. We can see that this malicious user cannot determine the number of common friends from i to other users,since only the threshold vector uniquely determines theNCF values.0 200 400 600 800 10001
2
n
E
Fig. 4. Expected number of different bits with increasing items n.Let Zl be the indicator function denoting whether twoalth bits in BFS1 and BFS2 are identical:
Zl 1 I1 al I2 al 0 else
;
where al 2 1;mI . Let Z be the sum of kI indicatorfunctions:
Z XkIl1
Zl 18
We compute the expected number E Z of different bitsfor BFS1 and BFS2 as:
E Z EXkIl1
Zal
" #XkIl1
E Zal XkI
l11Pal kIe
nkImI
19We next plot the expected number E Z of different bits
as the number of items n increases. From Fig. 4, the number of different bits is mostly lower than four. Further,the expected number of different bits drops quickly asthe number of item increases.
3
4
5
6
diffe
rent
bits
]
m=5000,k=3m=5000,k=6m=10000,k=3m=10000,k=6
9.1.1. Evaluation processOur evaluation tries to ask whether CommonFinder can
NCF values among twohop users with CommonFinderand related methods.
Unfortunately, since most DOSNs are still under developments, no DOSN data sets are publicly available yet.Since DOSNs have consistent social functionalities withthe centralized OSNs, we use the wideadopted data setsof centralized OSNs introduced in Section 4. These datasets correspond to static social graphs, which do not revealwhen each user adds friends. As a result, our simulation as
the accuracy. We set the default number of neighbors for
All experiments are performed on a laptop with
Y. Fu et al. / Computer Networks 64 (2014) 369389 381estimate the number of common friends accurately andscalably: (1) How does the Bloom lter based NCF estimation method scale for realworld social networks? (2) Is thedecentralized coordinate more accurate than existing NCFestimation methods? (3) How does CommonFinder scalefor realworld social networks.
The simulation seeks to be general enough for anyexisting DOSN proposals. Therefore, rather than integrating with specic DOSNs, the simulation focuses on predicting NCF values for pairs of users. The simulation takes thesocial graph among users as the input and then predicts8.2.2. Differential privacy of bloom ltersHaving shown that the Bloom lters are slightly modi
ed by adding or removing an item, we next derive thecapability of the differential privacy for the Bloom lter(see Section 2.4 for brief introduction).
Theorem 4. Given a sizedmI Bloom lter with ki hash functions. Let n be the number of items inserted into this Bloom
lter. This Bloom lter provides kI ln 1 exp nkImI

differential privacy.
Proof. For two sets S2 and S02 that differ in one item, wecompute the ratio of the probabilities that two Bloom lters representing sets S2 and S02 having the same bit array:
Pr IS2 I
Pr IS02 IjS02 S2 [ yf g
Pr y does not change the bit array of IS2
YkIi1
Pr IS2 hi y 1
1exp nkImI
kI exp kI ln 1exp nkImI
We see that
Pr IS2 I
6 Pr IS02 IjS02 S2 [ yf g exp kI ln 1 exp nkImI
Therefore, representing two sets that differ in one item
with the Bloom lters provides kI ln 1 exp nkImI

differential privacy. hTheorem 4 conrms that using the Bloom lter repre
senting a friend list provides differential privacy for usersin this friend list. As a result, a malicious user cannot becertain about who is a users friends.
9. Evaluation
We next test CommonFinders performance with extensive simulations.
9.1. Experimental setup2.13 GHz CPU and a 2 GB RAM. We have implemented allrelated methods with Matlab 7.0. The experiments are
Table 3CommonFinder parameters.
Parameter Value
Coordinate dimension d 5Number of neighbors for adjusting coordinates 20Regularized parameter k 0.3Neighborselection method RandomRounds of coordinate updates 60adjusting the coordinate to 20, which is modest compared to the sizes of most users friend lists. We congure the regularized parameter k to 0.3, which yieldsstable accuracy for optimizing the decentralized coordinates. We select neighbors randomly for each user fromthe whole set of candidate users due to its robustness.Finally, we evaluate CommonFinders accuracy after eachusers coordinate is updated in 60 rounds. In fact, allusers coordinates have converged to stable positionswithin 20 rounds.
9.1.3. Performance metricsTo quantify the accuracy of estimations, we use the
popular metric Normalized Mean Average Error (NMAE),which is dened asPi;j:Yij>0jYij bY ijjP
i;j:Yij>0Yij20
where Yij is the groundtruth value, bY ij is the estimated value. Smaller NMAE values correspond to higher predictionaccuracy.sumes that each user has all friends that are available onthe social graph.
9.1.2. Comparison methodsWe compare CommonFinder with two representative
NCF prediction methods, LandmarkMDS, a geometriccoordinate based method [32], and ProximityEmbed, amatrixfactorization based method [33]. We congure theparameters of LandmarkMDS and ProximityEmbed according to the recommended parameters.
We represent each user with a 160bit randomizedstring generated with a SHA1 cryptographic hash function. We set CommonFinders default parameters inTable 3. We choose the coordinate dimension to be ve,since further increasing the dimension does not improve
repeated in ten times and we report the average resultsand the corresponding standard deviations.
9.2. Baseline performance of measuring pairwise NCF values
To tune the coordinate positions, CommonFinder usesthe DBF to probe the NCF values from a user to his friends.Moreover, DBF also lists the set of common friends that arenot used in the coordinate computation process. The obtained NCF values to a set of neighbors are then treatedas the inputs to adjust each users coordinates. Supposethe NCF measurements by the DBF is inaccurate, the coordinate adjustment will be impaired accordingly. In otherwords, DBF sets up the upper bound for the predictionaccuracy of the NCF values.
We therefore rst characterize the baseline performance of measuring the NCF values for the CommonFinder. We focus on DBFs performance with n 6 1000, since
9.2.1. Sensitivity to Bloomlter parameters
items per SBF also decreases the degradation of the falsepositive probability of DBF.
Second, from Fig. 5(b), the storage size of the DBF increases with decreasing number c of items inserted into aSBF or the SBF size m. For c 6 50, the storage size increasesquickly with increasing SBF size m. As a result, we need tochoose the number c of items to be bigger than 50 in orderto control the increment of the storage size.
We next show the detailed variation of performance byxing c or m. From Fig. 6(a) and (b), we see that increasingthe upper bound c exponentially increases the false positive probability of DBF, but decreases the storage size signicantly. As a result, there is a natural trade off betweenthe false positive probability and the storage size. FromFig. 6(c) and (d), increasing the SBF size m exponentiallydecreases the false positive probability, but also increasesthe storage size. As a result, there are no optimal choicesfor DBFs parameters, we need to select c and m balancingthe false positive probability and the storage size. For the
we next test DBFs accuracy in measuring the pairwise NCF
exchanging DBFs between pairs of twohop users. From
iesnd
upper bound .
(bfi
andability
382 Y. Fu et al. / Computer Networks 64 (2014) 369389Fig. 5. Variation of the performance of DBF as we vary the upper bound cnumber k of hash functions is selected to optimize the false positive probTo investigate DBFs sensitivity to parameter settings,we compute DBFs false positive probability and calculateits storage size with increasing SBF size m and the numberc of items inserted into each SBF.
First, from Fig. 5(a), increasing the SBF size m from 0 to1000 bits signicantly decreases the false positive probability of DBF, compared to DBFs with SBF size m > 1000.As a result, we need to select a SBF size m bigger than1000 bits in order to control the false positive probability.Besides, decreasing the upper bound c of the number of
050
100150
200
01000
20003000
4000
00.20.40.60.8
1
cm
FP
(a) The expected false positive probabilitas a function of the Bloom filter size amost of the number of online userss friends is below1000 as shown in Section 4.1. To scale well with dynamicitems, Dynamic Bloom Filter (DBF) [28] includes multipleCBF where each CBF stores only c items. After having c inserted items, a new CBF is added to store subsequent insertions. The false positive probability of a DBF is:
fm;k;c;n1 1 1ekc=m k bn=cc
1 1ekncbn=cc=m k 21Fig. 7, we see that the storage sizes of all social networktopologies are quite modest, implying that the Bloom lters are quite practical for measuring the common friends.
050100150200
01000
20003000
4000
0
20
40
60
cm
Stor
age
(KB)
) Storage sizes as a function of the Bloomlter size and upper bound
the SBF size m. The number of items inserted into the DBF n 1000. Theof each SBF.values for twohop users on the social graph. We computethe Percentage of incorrect intersection (PII for short) asthe percentage of correct NCF results by DBF.
Table 4 shows the results on three social network topologies. We see that more than 99% of all tests are correct,which implies that the Bloom lter based NCF measurements are quite accurate.
9.2.3. DBFs bandwidth requirementsWe next compute the bandwidth requirements ofrest of the paper, we choose c 100, m 2000 as the default parameters of the DBF.
9.2.2. DBFs accuracyHaving determined the default parameters for the DBF,
50 100 150 2001010
108
106
104
102
100
Items Per SBF (c)
FP n=10000n=1000n=500n=100
(a) The expected false positive probability asa function of the upper bound
(bb
1000 2000 3000108
106
104
102
100
SBF size (m)
FP n=10000n=1000n=500n=100
(c) The false positive probabilities as a function of the SBF size
(d
Fig. 6. The expected false positive probability and the storage size of DBF as wparameters for the experiments are as follows: n 1000, c 100, m 2000. Tprobability of each SBF.
Table 4Percentages of incorrect intersection results onthree data sets. DBF is congured with the defaultparameters m 2000, c 100.
Data sets PII
Facebook 0.000015YouTube 0.000012Flickr 0.00842
0.5 1 2 3 4 50
0.2
0.4
0.6
0.8
1
Storage (KB) (log scale)
Pr (X
>x)
Facebook
YouTube
Flickr
Fig. 7. The storage sizes of DBF for three social network topologies.
Y. Fu et al. / Computer Networks 64 (2014) 369389 383101
100
101
102
Stor
age
(KBy
tes)
n=10000n=1000n=500n=1009.3. Convergence of the coordinates
We next test the convergence of the NCF estimations inrounds of coordinate updates. Each user selects at mosttwenty friends for the coordinate update in Algorithm 2in each round.
Fig. 8 shows the dynamics of estimation accuracy withincreasing rounds of coordinate updates. The estimationaccuracy is quickly improved as the coordinate updaterounds increase. The coordinates have converged to stable
50 100 150 200
Items Per SBF (c)) Storage size as a function of the upper
ound
1000 2000 3000101
100
101
102
SBF size (m)
Stor
age
(KBy
tes)
n=10000n=1000n=500n=100
) Storage size as a function of the SBF size
e vary the upper bound c and the SBF size m, respectively. The defaulthe number k of hash functions is selected to optimize the false positive
0 10 20 30 40 500
5
10
15
20
25
Round
NM
AE
FacebookYouTubeFlickr
Fig. 8. Convergence of CommonFinder on three data sets.
positions after ten rounds of coordinate updates. Furthercoordinate updates do not signicantly improve the estimation accuracy.
Furthermore, we also measure the computation time ofCommonFinder for each user. The averaged computationtime of each user on Facebook, YouTube and Flickr is lessthan 0.1 s, which indicates that the coordinateupdate process is quite efcient.
9.4. Comparison of coordinates based NCF prediction methods
Having conrmed the convergence of the coordinatecomputation, we next compare CommonFinders coordinate based NCF prediction method with existing methods.From Fig. 9, we can see that CommonFinder outperformsLandmarkMDS and ProximityEmbed in several times.CommonFinder is also consistently accurate on different
384 Y. Fu et al. / Computer Networks 64 (2014) 369389social networks. However, we also see that the performance on different data sets varies. This is because different social networks have varying distributions of commonfriends.
9.4.1. Robustness to incomplete measurementsWe next test the robustness of the NCF estimations to
incomplete measurements because of ofine users. Werandomly set a fraction of pairwise NCF values to be zero.Then we compare the NCF estimation results of differentmethods with the available NCF observations. We set CommonFinders parameters to be the default values in Table 3.
Fig. 10 plots the dynamics of the accuracy with increasing missing measurements. CommonFinder is stably accurate when the missing fraction is below 0.8, and gracefullydegrades the accuracy as the missing fraction increases to0.9. Finally, CommonFinder is signicantly inaccurate after90% of NCF probing results are missing. On the other hand,the proximity embedding approach degrades the accuracysignicantly as the missing fraction increases. We can seethat CommonFinder is quite robust to a wide range ofmissing measurements.
9.4.2. Robustness to incorrect measurementsSince some NCFprobing results could be incorrect due
to the false positives of the Bloom lters, we next test
Facebook YouTube Flickr0
0.2
0.4
0.6
0.8
Data Sets
NM
AE
CommonFinderLandmarkMDSProximityEmbed
Fig. 9. Comparing the estimation accuracy of different methods on threedata sets.the NCFestimation accuracy with increasing percentageof incorrect NCF measurements. We select a fraction ofNCF probing results uniformly at random and add an errorconstant e (e > 0) to the groundtruth NCF values. We varythe percentage of erroneous NCF results and the error constant e to verify the robustness against erroneous NCFobservations. We set CommonFinders parameters to bethe default values in Table 3.
Fig. 11 shows the results. CommonFinder gracefully degrades the accuracy as the error constant e increases. Onthe other hand, ProximityEmbed signicantly degradesthe accuracy. Furthermore, CommonFinder and ProximityEmbed are less robust on the Facebook data set than therest two data sets. This is because the NCF values on theFacebook data set are much lower than those on the othertwo data sets, which makes the NCF estimation results onthe Facebook data set be more sensitive to erroneous NCFprobing results on the other two data sets.
9.4.3. Accuracy of topk predictionWe next compare the accuracy of estimating topk
highest NCF values. For each user i and a constant param
eter k, we calculate the multiplicative ratio
Py2bSkYiyPx2Sk
Yix, where
Sk denotes the set of groundtruth topk users with the
highest NCF values, bSk denotes the set of estimate topkusers with the NCF prediction methods. We can see thatthe ratio values are at most 1 and larger ratios lead to better prediction results.
From Fig. 12, we see that CommonFinder signicantlyoutperforms both ProximityEmbed and LandmarkMDSmethods. The multiplicative ratios for CommonFinder aremostly larger than 0.9 and become closer to one as we increase the number k of recommended users. This is because CommonFinder predicts pairwise NCF values muchmore accurate than ProximityEmbed and LandmarkMDS.Further, the ProximityEmbed is much more accurate thanthe LandmarkMDS. This is because ProximityEmbed usesthe matrix factorization to adapt the triangle inequalityviolations of the NCF values, while LandmarkMDS assumesthe triangle inequality to hold for all triples, which however does not hold for social data sets.
9.5. Resilience to decentralized nodes
We next test whether CommonFinder is sensitive toerroneous coordinates due to the decentralization of users.We divide the overall set of users in the data sets into twoequal halves. Only half users in the data sets are added tothe simulator at the rst round, some users friends arein the second half of users and thus may not be addedyet. The other half users join the system after 40 rounds.After that cycle, all links between friends in the data setsare present in the simulator. As a result, the erroneouscoordinates are injected into the system after 40 rounds.
To quantify the resilience of the coordinate, we calculate the Coordinate Drift with the L1 norm dened asX2dLl1jxi l ~xi l j 22
0 0.2 0.4 0.6 0.8 10
0.10.20.30.40.50.60.70.80.9
11.1
Percent of Missing Measurements
NM
AE
CommonFinderProximityEmbedLandmarkMDS
(a) Facebook
0 0.2 0.4 0.6 0.8 10
0.20.40.60.8
11.21.41.61.8
22.2
Percent of Missing Measurements
NM
AE
CommonFinderProximityEmbedLandmarkMDS
(b) YouTube
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Percent of Missing Measurements
NM
AE
CommonFinderProximityEmbedLandmarkMDS
(c) FlickrFig. 10. Accuracy comparison by varying the percentage of missing measurements to neighbors.
0.01 0.02 0.03 0.04 0.05102
101
100
101
Percent of Incorrect Measurements
NM
AE
ProximityEmbed,e=3ProximityEmbed,e=10CommonFinder,e=3CommonFinder,e=10LandmarkMDS,e=3LandmarkMDS,e=10
(a) Facebook
0.01 0.02 0.03 0.04 0.05102
101
100
101
Percent of Incorrect Measurements
NM
AE
ProximityEmbed,e=3ProximityEmbed,e=10CommonFinder,e=3CommonFinder,e=10LandmarkMDS,e=3LandmarkMDS,e=10
(b) YouTube
0.01 0.02 0.03 0.04 0.05102
101
100
101
Percent of Incorrect Measurements
NM
AE
ProximityEmbed,e=3ProximityEmbed,e=10CommonFinder,e=3CommonFinder,e=10LandmarkMDS,e=3LandmarkMDS,e=10
(c) FlickrFig. 11. Accuracy comparison by varying the percent of incorrect measurements.
0 10 20 300.5
0.6
0.7
0.8
0.9
1
Number of Recommendations
Rat
io
ProximityEmbedLandmarkMDSCommonFinder
(a) Facebook
0 10 20 300
0.2
0.4
0.6
0.8
1
Number of Recommendations
Rat
io ProximityEmbedLandmarkMDSCommonFinder
(b) YouTube
0 10 20 300.5
0.6
0.7
0.8
0.9
1
Number of Recommendations
Rat
io
ProximityEmbedLandmarkMDSCommonFinder
(c) FlickrFig. 12. Accuracy of selecting twohop users with the topk NCF values.
20 40 60 800
0.10.20.30.40.50.60.70.80.9
1
Round
NM
AE
0 00.511.522.533.54
Drif
t
AccuracyDrift
(a) Facebook
0 20 40 60 8000.10.20.30.40.50.60.70.80.9
1
Round
NM
AE
00.511.522.533.54
Drif
t
AccuracyDrift
(b) YouTube
0 20 40 60 800
0.10.20.30.40.50.60.70.80.9
1
Round
NM
AE
00.511.522.533.54
Drif
t
AccuracyDrift
(c) FlickrFig. 13. The effect of higherror coordinates on the rate of convergence.
Y. Fu et al. / Computer Networks 64 (2014) 369389 385
where xi and ~xi denote the updated coordinate and the lastround coordinate, respectively. We plot the mean coordinate drifts and the mean/standard deviations of the estimation errors in Fig. 13.
The rst half users converge to stable coordinates within 20 rounds and keep steady until 40 rounds. Furthermore, the coordinate drifts are reduced to be below 0.1after the coordinates are stabilized after 20 rounds.
When the other half users join the system after 40rounds, the overall coordinate errors increase sharply,since the coordinates of newlyjoined users are randomlyinitialized and thus may incur high errors. Furthermore,the coordinate drifts also increase due to the similar reasons. However, the whole set of coordinates convergewithin the next 20 rounds to stable positions. The newlystabilized coordinates have the similar accuracy as thosebefore 40 rounds. Furthermore, the coordinate drifts alsodecrease to similar degrees as those before 40 rounds. Asa result, the coordinateupdate process is quite resilientto erroneous coordinates.
9.6. Parameter sensitivity of optimizing decentralizedcoordinate
We then test whether the coordinate adjustment process is sensitive to parameter settings. We vary one parameter and set the rest of CommonFinders parameters to bethe default values in Table 3. We report the mean NMAEvalues and the standard deviations.
9.6.1. Coordinate dimensionFig. 14(a) plots the accuracy with increasing coordinate
dimensions. CommonFinder accurately predicts NCFs evenwith low coordinate dimensions. The standard deviationsare also quite low, which imply that the NCF estimationis quite stable.
9.6.2. Number of neighborsFig. 14(b) shows the dynamics of accuracy with increas
ing neighbors. CommonFinder becomes more accuratewith increasing neighbors, but keeps stably accurate whenthe number of neighbors exceeds 20. Therefore, CommonFinder is able to predict accurate NCFs with a modest number of neighbors.
9.6.3. Regularized constant kFig. 14(c) plots the accuracy of CommonFinder by
increasing regularized constant k. The accuracy decreaseswith increasing k for the YouTube and Flickr data sets,but keeps stably accurate when we vary k for the Facebookdata set. Therefore, we should choose a smaller k for updating coordinates.
9.6.4. Choice of neighborsWe next evaluate the choice of neighbors on the
accuracy of CommonFinder. We test four popular neighborselection rules: (i) Random, we choose neighborsuniformly at random from the whole set of friends.(ii) Closest, we choose neighbors as friends that have the
0.02
(s
(n
inder
386 Y. Fu et al. / Computer Networks 64 (2014) 3693890 5 10 150
Coordinate Dimension(a) NMAE as a function of the coordinate dimension.
0 0.5 1 1.5 20
0.05
0.1
0.15
0.2
NM
AE
FacebookYouTubeFlickr
(c) NMAE as a function of the parameter
Fig. 14. NMAE of CommonF0.040.060.08
0.10.120.140.160.18
0.2
NM
AE
FacebookYouTubeFlickr10 20 30 400
0.05
0.1
0.15
0.2
0.25
Number of Neighbors
NM
AE
FacebookYouTubeFlickr
b) NMAE as a function of the neighborhoodize.
FB YT FR0
0.1
0.2
0.3
0.4
0.5
0.6
Data Sets
NM
AE
RandomClosestFarthestHybrid
d) NMAE as a function of the choice ofeighbors.
as we vary its parameters.
lowest NCFs. (iii) Farthest, we choose neighbors as friendsthat have the highest NCFs. (iv) Hybrid, we select halfneighbors using the Closest based selection and the otherhalf neighbors using the Farthest based selection.Fig. 14(d) shows the results. FB, YT and FR denote the Facebook, YouTube and Flickr data sets, respectively. The Random policy provides the highest accuracy, while otherthree policies incur much higher errors than the Randompolicy, since the randomness in neighbors brings muchhigher diversity in the distribution of NCFs, which makesthe optimization process to be robust to bad local minima.
10. Extending the NCF computation to Sybil accounts
coordinate computation process. In order to destabilizethe coordinates of normal users, the Sybil accounts needto have friend links with some normal users and then create fake Bloom lters and/or generate fake coordinatepositions.
10.3.1. Fake Bloom ltersFake Bloom lters bring incorrect NCF measurements.
However, as discussed in the simulation section, CommonFinder is robust to the incorrect NCF measurements because of the inaccuracy brought by the Bloom lters. Thisis because the decentralized MMMF method used by CommonFinder mitigates the problem of overtting by regular
0 Ran) Yoe inc
Y. Fu et al. / Computer Networks 64 (2014) 369389 38710.1. Sybil accounts
Prior work [34,35] broadly group those fake accountsfor spamming or performing privacy attacks as Sybil accounts. To launch the attacks, the Sybil accounts generallyseek to integrate with the social graph by establishingfriend links with normal users. Therefore, they have strongmotivations to use fake friend lists to obtain informationabout other users. Further, these Sybil accounts usuallycollude together to establish friend links betweenthemselves.
If Sybil accounts do not establish friend links with normal users, we can see that the CommonFinder protocol willnot be affected, since each user adjusts its coordinate withfriends. Therefore, our analysis assumes that Sybil accounts have established several friend links with normalusers. We call these normal users as infected ones.
10.2. Learn information about other users
We argue that the CommonFinder protocol does notleak sensitive privacy information to Sybil accounts. First,since we enforce the onehop trustiness, noninfectedusers friend lists will not be disclosed to Sybil accounts.Second, during the coordinate adjustment process, disclosing the Bloom lter and the coordinate of a user do not leakthis users privacy information due to the randomizationbrought by these two data structures.
10.3. Destabilize coordinate system
The coordinate system, however may be sensitive to Sybil accounts, when they inject fake information into the
0 0.1 0.2 0.3 0.40
0.01
0.02
0.03
0.04
0.05
Percent of Random Coordinates
NM
AE
(a) Facebook
0 0.10
0.05
0.1
0.15
0.2
Percent of
NM
AE
(bFig. 15. The accuracy of the NCF prediction process as wizing the coordinates with nuclear norms that is robust tonoises [36].
10.3.2. Fake coordinatesFake coordinates generally refer to those that are cre
ated by not following the CommonFinder protocol. Suppose that a Sybil user sends a fake coordinate to a friendB that is a normal user. As a normal user always has someother normal users as friends, user B will move its coordinates with both Sybil and normal neighbors. When there isonly a small fraction of Sybil accounts for a normal user,this normal users coordinate will keep to converge to stable positions since the conjugate gradient optimizationtries to balance the coordinates from Sybil and normalusers.
As it is difcult to model the Sybil nodes behaviors. Inthis paper, we use a simple model to represent uncertaincoordinates: To simulate the randomness of coordinate errors in decentralized settings, we randomize the coordinates of a set of online nodes. For each round of thecoordinate update process of a node, we randomly choosea fraction p of neighbors to have random coordinates. Byvarying the percent p of random coordinates, we are ableto study impact of the degree of node dynamics on thecoordinates accuracy.
Fig. 15 plots the dynamics of CommonFinders accuracyas we increase the percent of randomized coordinates.Increasing randomized coordinates only smoothly degrades the accuracy.
11. Related work
Talash [10] calculates the common friends by exchanging two friend lists, which however, does not scale well
.2 0.3 0.4dom CoordinatesuTube
0 0.1 0.2 0.3 0.40.05
0.06
0.07
0.08
0.09
Percent of Random Coordinates
NM
AE
(c) Flickrrease the percent of random coordinates for each user.
methods need a large amount of bandwidth cost and com
decentralized coordinate approaches select neighbors that
388 Y. Fu et al. / Computer Networks 64 (2014) 369389are used for updating coordinates uniformly at randomfrom the whole set of nodes, which could leak the coordinate positions to nonfriend users. On the contrary, CommonFinder selects neighbors from the friend list of eachuser, which avoids the disclosure of coordinates to unknown users.
12. Conclusion
To estimate common friends in distributed social networks scalably with privacy protection, we propose a distributed commonfriends prediction method thatcombines the privacypreserving commonfriend measurement by using Bloom lters and the distributed commonfriend estimation with decentralized coordinates.Simulation results show that our coordinate based methodestimates the NCF values scalably and accurately withmodest transmission costs. Besides, CommonFinder is alsoresilient to incomplete NCF measurements and noisy NCFmeasurements. We plan to implement our method as aplugin for distributed social networks.
Acknowledgements
This work was supported by the National Grand Fundamental Research 973 Program of China (Grant No.2011CB302601), the National Natural Science Foundationof China (Grant No. 61379052), the National High Technology Research and Development 863 Program of China(Grant No. 2013AA01A213), the Natural Science Foundation for Distinguished Young Scholars of Hunan Province(Grant No. S2010J5050), Specialized Research Fund forputation cost, which renders them to be less practical forbandwidthlimited end hosts.
Song et al. [33] estimate diverse proximity metrics between users through a matrix factorization process thatis computed in a centralized manner, which is impracticalfor distributed online social networks. Dong et al. [37] propose a secure and privacypreserving dot product protocolfor static coordinates computed by [33], in order to estimate the social proximity in mobile social networks.
Finally, network coordinate methods such as Vivaldi[29], LandmarkMDS [32] can be used to predict NCF values.Network coordinates embed hosts into a geometric spaceand estimate the pairwise distance of two nodes basedon the corresponding coordinate distance. However, thesewith increasing friends and also discloses sensitive personal information to unknown entities.
The Private Set Intersection (PSI) or private matchingmethod [1114] computes the intersection of two setswith privacy protection. Generally, each set is representedby a list of coefcients of a polynomial. Estimating theintersection of two sets is implemented as the arithmeticoperations on the coefcients of two polynomials. To improve the privacy protection, the homomorphic encryptionis enforced over the arithmetic operations in order to hidethe coefcients to the other user. Unfortunately, the PSIthe Doctoral Program of Higher Education (Grant No.20124307110015).
References
[1] L.A. Cutillo, R. Molva, T. Strufe, Safebook: Feasibility of TransitiveCooperation for Privacy on a Decentralized Social Network, in: Proc.of IEEE WOWMOM09, pp. 16.
[2] L.M. Aiello, G. Ruffo, LotusNet: tunable privacy for distributed onlinesocial network services, Comput. Commun. 35 (2012) 7588.
[3] T. Xu, Y. Chen, J. Zhao, X. Fu, Cuckoo: Towards Decentralized, Socioaware Online Microblogging Services and Data Measurements, in:Proc. of HotPlanet 10, pp. 4:14:6.
[4] Diaspora, Diaspora Alpha, 2011. .[5] S. Buchegger, D. Schiberg, L.H. Vu, A. Datta, PeerSoN: P2P Social
Networking: Early Experiences and Insights, in: Proc. of SNS 09, pp.4652.
[6] A. Shakimov, H. Lim, R. Cceres, L.P. Cox, K.A. Li, D. Liu, A. Varshavsky,VisVis: Privacypreserving Online Social Networking via VirtualIndividual Servers, in: Proc. of COMSNETS11, pp. 110.
[7] D. LibenNowell, J. Kleinberg, The Link Prediction Problem for SocialNetworks, in: Proc. of CIKM 03, pp. 556559.
[8] A. Mislove, B. Viswanath, K.P. Gummadi, P. Druschel, You are whoyou know: inferring user proles in online social networks, in:Proceedings of the third ACM international conference on Websearch and data mining, WSDM 10, pp. 251260.
[9] L. Bilge, T. Strufe, D. Balzarotti, E. Kirda, All your contacts are belongto us: automated identity theft attacks on social networks, in:Proceedings of the 18th international conference on World wideweb, WWW 09, pp. 551560.
[10] R. Dhekane, B. Vibber, Talash: Friend Finding In Federated SocialNetworks, in: Proc. of LDOW2011.
[11] M.J. Freedman, K. Nissim, B. Pinkas, Efcient private matching andset intersection, in: Proc. of EUROCRYPT 2004, pp. 119.
[12] E.D. Cristofaro, P. Gasti, G. Tsudik, Fast and private computation ofset intersection cardinality, Cryptology ePrint Archive, Report 2011/141, 2011. .
[13] L. Kissner, D. Song, PrivacyPreserving Set Operations, in: Proc. ofCRYPTO 2005, pp. 241257.
[14] D. DachmanSoled, T. Malkin, M. Raykova, M. Yung, Efcient robustprivate set intersection, in: Applied Cryptography and NetworkSecurity, vol. 5536, 2009, pp. 125142.
[15] C. Dwork, Differential privacy: a survey of results, in: TAMC, pp. 119.
[16] F. McSherry, K. Talwar, Mechanism design via differential privacy,in: Proc. of FOCS 07, pp. 94103.
[17] Y. Fu, Y. Wang, On the feasibility of commonfriend measurementsfor the distributed online social networks, in: The 7th InternationalICST Conference on Communications and networking in China(CHINACOM), 2012, pp. 2429.
[18] Y. Fu, Y.Wang, BCE: A privacypreserving commonfriend estimationmethod for distributed online social networks withoutcryptography, in: The 7th International ICST conference onCommunications and Networking in China (CHINACOM), 2012, pp.212217.
[19] J.D.M. Rennie, N. Srebro, Fast maximum margin matrix factorizationfor collaborative prediction, in: Proc. of ICML05, pp. 713719.
[20] G.H. Golub, C.F. Van Loan, Matrix Computations, third ed., JohnsHopkins University Press, Baltimore, MD, USA, 1996.
[21] J.R. Shewchuk, An Introduction to the Conjugate GradientMethod Without the Agonizing Pain, Technical Report, Pittsburgh,PA, USA, 1994. .
[22] O. Goldreich, Foundations of Cryptography, Basic Applications, 2,Cambridge University Press, 2004.
[23] J. Becker, H. Chen, Measuring privacy risk in online social networks,in: Proc. of W2SP 2009: Web 2.0 Security and Privacy.
[24] B. Viswanath, A. Mislove, M. Cha, K.P. Gummadi, On the evolution ofuser interaction in Facebook, in: Proc. of WOSN 09, pp. 3742.
[25] M. Cha, A. Mislove, K.P. Gummadi, A Measurementdriven analysisof information propagation in the Flickr Social Network, in: Proc. ofWWW 09, pp. 721730.
[26] A. Mislove, M. Marcon, K.P. Gummadi, P. Druschel, B. Bhattacharjee,Measurement and analysis of online social networks, in: Proc. ofIMC07.
[27] X. Su, T.M. Khoshgoftaar, A survey of collaborative lteringtechniques, Adv. Artif. Intell. (2009).
[28] D. Guo, J. Wu, H. Chen, Y. Yuan, X. Luo, The dynamic bloom lters,IEEE Trans. Knowl. Data Eng. 22 (2010) 120133.
[29] F. Dabek, R. Cox, F. Kaashoek, R. Morris, Vivaldi: a DecentralizedNetwork Coordinate System, in: Proc. of ACM SIGCOMM 2004, pp.1526.
[30] J. Ledlie, P. Gardner, M.I. Seltzer, Network Coordinates in the Wild,in: Proc. of 4th USENIX Conference on Networked Systems Designand Implementation, USENIX Association, Cambridge, MA, 2007. pp.2222.
[31] Y. Liao, W. Du, P. Geurts, G. Leduc, DMFSGD: A Decentralized MatrixFactorization Algorithm for Network Distance Prediction, CoRR abs/1201.1174, 2012.
[32] B. Eriksson, P. Barford, R.D. Nowak, Estimating hop distance betweenarbitrary host pairs, in: Proc. of IEEE INFOCOM09, pp. 801809.
[33] H.H. Song, T.W. Cho, V. Dave, Y. Zhang, L. Qiu, Scalable proximityestimation and link prediction in online social networks, in: Proc. ofIMC 09, pp. 322335.
[34] G. Danezis, P. Mittal, Sybilinfer: Detecting sybil nodes using socialnetworks, in: NDSS.
[35] Z. Yang, C. Wilson, X. Wang, T. Gao, B.Y. Zhao, Y. Dai, Uncoveringsocial network sybils in the wild, in: Proceedings of the 2011 ACMSIGCOMM conference on Internet measurement conference, IMC 11,pp. 259268.
[36] E.J. Cands, Y. Plan, Matrix completion with noise, Proc. IEEE 98(2010) 925936.
[37] W. Dong, V. Dave, L. Qiu, Y. Zhang, Secure friend discovery in mobilesocial networks, in: Proc. of IEEE INFOCOM11.
Yongquan Fu received the B.S. degree incomputer science and technology from theSchool of Computer of Shandong University,China, in 2005, and received the M.S. in
Yijie Wang received the PhD degree from theNational University of Defense Technology,China in 1998. She was a recipient of theNational Excellent Doctoral Dissertation(2001), a recipient of Fok Ying Tong EducationFoundation Award for Young Teachers (2006)and a recipient of the Natural Science Foundation for Distinguished Young Scholars ofHunan Province (2010). Now she is a Professor in the National Key Laboratory for Paralleland Distributed Processing, National University of Defense Technology. Her research
interests include network computing, massive data processing, paralleland distributed processing.
Wei Peng received the M.Sc. and the Ph.Ddegrees in Computer Science from theNational University of Defense Technology,China, in 1997 and 2000 respectively. From2001 to 2002, he is a research assistant at theSchool of Computer from the National University of Defense Technology, China, then aresearch fellow from 2003. He was a visitingscholar of University of Essex, UK, from May2007 to April 2008. His major interests areinternet routing, network science, and mobilewireless networks.
Y. Fu et al. / Computer Networks 64 (2014) 369389 389Computer Science and technology in theSchool of Com puter Science of NationalUniversity of Defense Technology, China, in2008. He is currently an assistant professor inthe School of Computer Science of NationalUniversity of Defense Technology. He is amember of CCF and ACM. His current researchinterests lie in the areas of cloud manage
ment, distributed online social networks and distributed algorithms.
CommonFinder: A decentralized and privacypreserving commonfriend measurement method for the distributed online social networks1 Introduction2 Background2.1 Distributed online social networks2.1.1 Social graph2.1.2 Privacypreserving message routing
2.2 Establishing friendship on the DOSNs2.2.1 Searching friends2.2.2 Recommending friends
2.3 Maximum margin matrix factorization based collaborative filtering2.3.1 Predicting rating scores with matrix factorization2.3.2 Overview of MMMF2.3.3 Mapping ? to rating scores2.3.4 Soft margin loss function2.3.5 Formulating objective function
2.4 Differential privacy
3 Problem definition3.1 Adversary model3.2 Disclosing friend lists leaks privacy3.3 Problem requirements3.4 A naive privacypreserving NCFcomputation solution
4 Analyzing the NCF metric of the online social networks4.1 Data sets4.2 Distributions of numbers of friends4.3 Distributions of numbers of common friends4.4 Triangle inequality violations of numbers of common friends
5 Designing a privacypreserving NCF computation method for DOSNs5.1 Intuitions5.2 Architecture5.3 Exchanging bloom filters for privacy protection5.4 Scale the NCF computation with decentralized prediction5.4.1 Overview5.4.2 Coordinate structure5.4.3 NCF prediction5.4.4 Coordinate dissemination
5.5 Benefits of decentralized NCF prediction
6 NCF Prediction with decentralized coordinates6.1 Measuring NCF values based on bloom filters6.2 Neighbor management6.3 Decentralized MMMF optimization6.4 Distributed coordinate maintenance6.4.1 Decentralized conjugate gradient optimization6.4.2 Minibatch based coordinate update
7 Top ? ranking8 Privacy analysis8.1 Coordinates hide users friend lists8.2 Bloom filters yield differential privacy8.2.1 Sensitivity of a Bloom filter8.2.2 Differential privacy of bloom filters
9 Evaluation9.1 Experimental setup9.1.1 Evaluation process9.1.2 Comparison methods9.1.3 Performance metrics
9.2 Baseline performance of measuring pairwise NCF values9.2.1 Sensitivity to Bloomfilter parameters9.2.2 DBFs accuracy9.2.3 DBFs bandwidth requirements
9.3 Convergence of the coordinates9.4 Comparison of coordinates based NCF prediction methods9.4.1 Robustness to incomplete measurements9.4.2 Robustness to incorrect measurements9.4.3 Accuracy of top ? prediction
9.5 Resilience to decentralized nodes9.6 Parameter sensitivity of optimizing decentralized coordinate9.6.1 Coordinate dimension9.6.2 Number of neighbors9.6.3 Regularized constant ? 9.6.4 Choice of neighbors
10 Extending the NCF computation to Sybil accounts10.1 Sybil accounts10.2 Learn information about other users10.3 Destabilize coordinate system10.3.1 Fake Bloom filters10.3.2 Fake coordinates
11 Related work12 ConclusionAcknowledgementsReferences
Recommended