minas gjoka, uc irvinewalking in facebook 1 walking in facebook: a case study of unbiased sampling...
TRANSCRIPT
![Page 1: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/1.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 1
Walking in Facebook:A Case Study of Unbiased
Sampling of OSNs
Minas Gjoka, Maciej Kurant ‡, Carter Butts, Athina Markopoulou
UC Irvine, EPFL ‡
![Page 2: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/2.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 2
Outline
• Motivation and Problem Statement• Sampling Methodology• Data Analysis• Conclusion
![Page 3: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/3.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 3
Online Social Networks (OSNs)
• A network of declared friendships between users
• Allows users to maintain relationships
• Many popular OSNs with different focus– Facebook, LinkedIn, Flickr, …
C
A
EGF
BD
H
Social Graph
![Page 4: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/4.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 4
Why Sample OSNs?
• Representative samples desirable– study properties– test algorithms
• Obtaining complete dataset difficult– companies usually unwilling to share data – tremendous overhead to measure all (~100TB
for Facebook)
![Page 5: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/5.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 5
Problem statement
• Obtain a representative sample of users in a given OSN by exploration of the social graph. – in this work we sample Facebook (FB)– explore graph using various crawling
techniques
![Page 6: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/6.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 6
Related Work
• Graph traversal (BFS)– A. Mislove et al, IMC 2007– Y. Ahn et al, WWW 2007– C. Wilson, Eurosys 2009
• Random walks (MHRW, RDS)– M. Henzinger et al, WWW 2000– D. Stutbach et al, IMC 2006– A. Rasti et al, Mini Infocom 2009
![Page 7: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/7.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 7
Our Contributions• Compare various crawling techniques in FB’s social
graph – Breadth-First-Search (BFS)– Random Walk (RW)– Metropolis-Hastings Random Walk (MHRW)
• Practical recommendations– online convergence diagnostic tests– proper use of multiple parallel chains– methods that perform better and tradeoffs
• Uniform sample of Facebook users– collection and analysis– made available to researchers
![Page 8: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/8.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 8
Outline
• Motivation and Problem Statement• Sampling Methodology
– crawling methods– data collection– convergence evaluation– method comparisons
• Data Analysis• Conclusion
![Page 9: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/9.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 9
(1) Breadth-First-Search (BFS)
C
A
EGF
BD
H
Unexplored
Explored
Visited
• Starting from a seed, explores all neighbor nodes. Process continues iteratively without replacement.
• BFS leads to bias towards high degree nodes
Lee et al, “Statistical properties of Sampled Networks”, Phys Review E, 2006
• Early measurement studies of OSNs use BFS as primary sampling techniquei.e [Mislove et al], [Ahn et al], [Wilson et al.]
![Page 10: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/10.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 10
(2) Random Walk (RW)
C
A
EGF
BD
H
1/3
1/3
1/3
Next candidate
Current node
• Explores graph one node at a time with replacement
• In the stationary distribution
,
1RWwP k
2
k
E
Degree of node υ
Number of edges
![Page 11: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/11.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 11
(3) Re-Weighted Random Walk (RWRW)
Hansen-Hurwitz estimator • Corrects for degree bias at the end of collection
• Without re-weighting, the probability distribution for node property A is:
• Re-Weighted probability distribution :
1/( )
1/iu A u
iu V u
kp A
k
1 | |( )
1 | |iu A i
iu V
Ap A
V
Subset of sampled nodes with value i
All sampled nodes
Degree of node u
![Page 12: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/12.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 12
(4) Metropolis-Hastings Random Walk (MHRW)
• Explore graph one node at a time with replacement
• In the stationary distribution
,
,
1min(1, ) if neighbor of
1 if =
MH ww
MHy
y
kw
k kPP w
C
A
EGF
BD
H
1/3
1/5
1/3
Next candidate
Current node
2/15
1
V
5
1
5
3
3
1MH
ACP 15
2)
5
1
3
1
3
1(1 MH
AAP
![Page 13: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/13.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 13
Uniform userID Sampling (UNI)
• As a basis for comparison, we collect a uniform sample of Facebook userIDs (UNI)– rejection sampling on the 32-bit userID space
• UNI not a general solution for sampling OSNs– userID space must not be sparse– names instead of numbers
![Page 14: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/14.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 14
Summary of DatasetsSampling method MHRW RW BFS UNI
#Valid Users 28x81K28x81K
28x81K 984K
# Unique Users 957K 2.19M 2.20M 984K
• Egonets for a subsample of MHRW- local properties of nodes
• Datasets available at: http://odysseas.calit2.uci.edu/research/osn.html
![Page 15: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/15.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 15
Data CollectionBasic Node Information
• What information do we collect for each sampled node u?
UserIDName NetworksPrivacy settings
Friend ListUserIDName
NetworksPrivacy Settings
UserIDName NetworksPrivacy settings
u
1111
Profile PhotoAdd as Friend
Regional School/Workplace
UserIDName NetworksPrivacy settings
View FriendsSend Message
![Page 16: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/16.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 16
Detecting Convergence
• Number of samples (iterations) to loose dependence from starting points?
![Page 17: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/17.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 17
Online Convergence Diagnostics
Geweke• Detects convergence for a single walk. Let X be a
sequence of samples for metric of interest.
J. Geweke, “Evaluating the accuracy of sampling based approaches to calculate posterior moments“ in Bayesian Statistics 4, 1992
Xa Xb
( ) ( )
( ) ( )a b
a b
E X E Xz
Var X Var X
![Page 18: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/18.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 18
Online Convergence Diagnostics
Gelman-Rubin• Detects convergence for m>1 walks
A. Gelman, D. Rubin, “Inference from iterative simulation using multiple sequences“ in Statistical Science Volume 7, 1992
Walk 1
Walk 2
Walk 3
1 1n m BR
n mn W
Between walksvariance
Within walksvariance
![Page 19: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/19.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 19
When do we reach equilibrium?
Burn-in determined to be 3K
Node Degree
![Page 20: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/20.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 20
Methods ComparisonNode Degree
• Poor performance for BFS, RW
• MHRW, RWRW produce good estimates – per chain– overall
28 crawls
![Page 21: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/21.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 21
Sampling BiasBFS
• Low degree nodes under-represented by two orders of magnitude
• BFS is biased
![Page 22: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/22.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 22
Sampling Bias MHRW, RW, RWRW
• Degree distribution identical to UNI (MHRW,RWRW)• RW as biased as BFS but with smaller variance in each
walk
![Page 23: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/23.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 23
Practical Recommendations for Sampling Methods
• Use MHRW or RWRW. Do not use BFS, RW.
• Use formal convergence diagnostics– assess convergence online– use multiple parallel walks
• MHRW vs RWRW– RWRW slightly better performance– MHRW provides a “ready-to-use” sample
![Page 24: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/24.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 24
Outline
• Motivation and Problem Statement• Sampling Methodology• Data Analysis• Conclusion
![Page 25: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/25.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 25
FB Social GraphDegree Distribution
• Degree distribution not a power law
a2=3.38
a1=1.32
![Page 26: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/26.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 26
FB Social GraphTopological Characteristics
• Our MHRW sample – Assortativity Coefficient = 0.233– range of Clustering Coefficient [0.05, 0.35]
• [Wilson et al, Eurosys 09] – Assortativity Coefficient = 0.17– range of Clustering Coefficient [0.05, 0.18]
• More details in our paper and technical report
![Page 27: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/27.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 27
Conclusion• Compared graph crawling methods
– MHRW, RWRW performed remarkably well– BFS, RW lead to substantial bias
• Practical recommendations– correct for bias– usage of online convergence diagnostics– proper use of multiple chains
• Datasets publicly available– http://odysseas.calit2.uci.edu/research/osn.html
![Page 28: Minas Gjoka, UC IrvineWalking in Facebook 1 Walking in Facebook: A Case Study of Unbiased Sampling of OSNs Minas Gjoka, Maciej Kurant ‡, Carter Butts,](https://reader030.vdocuments.mx/reader030/viewer/2022032704/56649d825503460f94a6734b/html5/thumbnails/28.jpg)
Minas Gjoka, UC Irvine Walking in Facebook 28
Thank you Questions?