walking in facebook: a case study of unbiased sampling of osns 2010.6.9 junction
Post on 12-Jan-2016
231 Views
Preview:
TRANSCRIPT
WALKING IN FACEBOOK:A CASE STUDY OF UNBIASED SAMPLING OF OSNS
2010.6.9 junction
Outline
Motivation and Problem Statement Sampling Methodology Evaluation of Sampling Techniques Facebook Data Analysis Conclusion
Online Social Networks (OSNs) A network of declared
friendships between users, allowing users to maintain relationships
Many popular OSNs with different focus Facebook, LinkedIn, Flickr, …
C
A
EGF
BD
H
Facebook More than 400 million active users 50% of them log on to Facebook in any given day Average user has 130 friends People spend over 500 billion minutes per month on
Facebook more than 100 million mobile users Mobile user are twice more active than non-mobile users.
Social Graph• undirected graph• G = (V, E)• V: nodes (users)• E: edges (relationships)• kv : node degree
Why Sample OSNs?
Representative samples desirable study properties test algorithms
Obtaining complete dataset difficult companies usually unwilling to share data tremendous overhead to measure all
(~100TB for Facebook)
Problem Statement
Obtain a representative sample of users in a given OSN by exploration of the social graph. Uniform sample of Facebook users explore graph using various crawling
techniques
Outline
Motivation and Problem Statement Sampling Methodology
Crawling Methods Convergence Evaluation Data Collection
Evaluation of Sampling Techniques Facebook Data Analysis Conclusion
Crawling Methods
Crawling Methods Breadth First Search (BFS) Random Walk (RW) Re-Weighted Random Walk (RWRW) Metropolis-Hastings Random Walk (MHRW) Uniform Sampling (UNI)
Breadth First Search (BFS)
Early measurement studies of OSNs use BFS as primary sampling technique
Starting from a seed, explores all neighbor nodes.
As this method discovers all nodes within some distance from the starting point, an incomplete BFS is likely to densely cover only some specific region of the graph.
BFS leads to bias towards high degree nodes
C
A
EGF
BD
H
Unexplored
Explored
Visited
Random Walk (RW)
Explores graph one node at a time with replacement
In the stationary distribution
biased towards higher degree nodes (πv ~ kv)
C
A
EGF
BD
H
1/3
1/3
1/3
Next candidate
Current node
,
1RWwP k
Degree of node υ
2
k
E
Number of edges
Re-Weighted Random Walk (RWRW)
Corrects for degree bias at the end of collection Without re-weighting, the probability distribution
for node property A is: (e.g. the degree, network size ...)
Re-Weighted probability distribution :1/
( )1/iu A u
iu V u
kp A
k
Degree of node u
15
2)
5
1
3
1
3
1(1 MH
AAP
Metropolis-Hastings Random Walk (MHRW)
Explore graph one node at a time with replacement
In the stationary distribution
Exactly the uniform distribution
C
A
EGF
BD
H
1/3
1/5
1/3
Next candidate
Current node
2/15
5
1
5
3
3
1MH
ACP
,
,
1min(1, ) if neighbor of
1 if =
MH ww
MHy
y
kw
k kPP w
1
V
Uniform Sampling (UNI)
As a basis for comparison (ground truth) Rejection sampling
uniform sampling of on the 32-bit IDs discarding the non-existing ones yields a uniform sample of the existing user IDs in
Facebook for any allocation policy (i.e. even if the userIDs are not evenly allocated in the 32-bit address space)
UNI not a general solution for sampling OSNs userID space must not be sparse names instead of numbers must be supported by the systems
Convergence Detection
Number of samples (iterations) to loose dependence from starting points?
Convergence Evaluation
Using Multiple Parallel Walks to improve convergence avoid getting trapped in certain region starting from 28 different randomly chosen initial
nodes Detecting Convergence with Online Diagnostics
sampling longer and discard a number of initial “burn-in” iterations Consumed BW (TB) and measurement time (days) Crucial to decide appropriate ‘burn-in’ and total running
time Grweke Diagnostic Gelman-Rubin Diagnostic
Geweke Diagnostic
Detect the convergence of a single Markov chain
With increasing number of iterations, Xa and Xb move further apart, which limits the correlation between them.
according to the law of large numbers, the z values become normally distributed ~ (0, 1)
Declare convergence when most values fall in the [-1,1] interval
Xa Xb ( ) ( )
( ) ( )a b
a b
E X E Xz
Var X Var X
Walk 1
Walk 2
Walk 3
1 1n m BR
n mn W
Between walks variance
Within walks variance
Gelman-Rubin Diagnostic
Detects convergence for m>1 walks (m: # of chains)
Compare the empirical distributions of individual chains with the empirical distribution of all sequences together
if they are similar enough (R,1.02) , declare convergence
Data Collection
Information collected
UserIDName NetworksPrivacy settings
Friend ListUserIDName
NetworksPrivacy Settings
UserIDName NetworksPrivacy settings
u
1111
Profile PhotoAdd as Friend
Regional School/Workplace
UserIDName NetworksPrivacy settings
View FriendsSend Message
Data Collection
Summary of data set• 28 x 81K = 2.26 M• 28 initial starting nodes• crawl until exactly 81K samples are collected
• repeat the same node in a walk• # of rejected nodes without repetition : 645 K
• 18.53M nodes picked uniform from [1, 232]• only 1216 K users existed• 228 K users had zero friends
• RW: 97 % nodes are unique• BFS: 97 % nodes are unique• confirms that the random seeding chose different areas of FB
Outline
Motivation and Problem Statement Sampling Methodology Evaluation of Sampling Techniques
Convergence Analysis Methods Comparison Unbiased Estimation
Facebook Data Anaylsis Conclusion
What is a fair way to compare the results of MHRW with RW and BFS? MHRW visits fewer unique nodes than RW and
BFS MHRW stays at some nodes for relatively long
time/iterations Happens usually at some low degree node
An appropriate practical comparison should be based on the number of visited unique nodes
Convergence Analysis
,
,
1min(1, ) if neighbor of
1 if =
MH ww
MHy
y
kw
k kPP w
3
2
3
11
3
1)
3
1,1min(
1
1
,MHCAp
Node Degree
Convergence Test
When does it reach equilibrium?
Burn-in determined to be 3K -> discard 6K
• converge when all 28 values fall in the [-1, 1] interval• 500 iterations• converge when all R scores drop below 1.02• (0,1): not in / in• 3000 iterations
Methods Comparison
MHRW, RWRW produce good in estimating the probability of a node degree The degree
distribution will converge fast to a good uniform sample
Poor performance for BFS, RW
28 crawls
Unbiased Estimation (BFS, RW) Node degree distribution
introduce a strong bias towards the high degree nodes
the low-degree nodes are under-represented
Unbiased Estimation (MHRW) Degree distribution identical to UNI (MHRW,
RWRW)
Outline
Motivation and Problem Statement Sampling Methodology Evaluation of Sampling Techniques Facebook Data Analysis Conclusion
FB Social Graph – degree distribution
Degree distribution not a power law
a2=3.3
8
a1=1.3
2
FB Social Graph - Assortativity Assortativity
nodes tend to connect to similar or different nodes?
positive correlation: high degree nodes tend to connect to other high degree nodes
FB Social Graph – Privacy Awareness
Outline
Motivation and Problem Statement Sampling Methodology Evaluation of Sampling Techniques Facebook Data Analysis Conclusion
Conclusion
Compared graph crawling methods MHRW, RWRW performed remarkably well BFS, RW lead to substantial bias
Practical recommendations correct for bias usage of online convergence diagnostics proper use of multiple chains
top related