overview and comparison of random walk based …cgormley/costnet/ribnomeeting...tour based...
TRANSCRIPT
Overview and comparison of random walk
based techniques for estimating network
averages
Konstantin Avrachenkov (Inria, France)
Ribno COSTNET Conference, 21 Sept. 2016
Motivation
Analysing (online) social networks one would like to know:
I How young is given social network?
I How many friends has an average network member?
I What proportion of population supports some political party?
I etc
Motivation
All such questions are related to the problem of estimating anaverage of a function f (·) defined on the network nodes.
Let G = (V , E), with |V| = n, |E| = m, be an undirectedgraph representing a social network.
Then, we are interested to estimate
µ(G) =1
|V|∑v∈V
f (v). (1)
Constraints
Clearly, we are interested in the case when the topology of thenetwork is not known and complete crawl of the network is notpossible.
E.g., in online social networks, crawling is subject to an APIlimit on the number of requests per minute.
A standard Twitter account can make no more than onerequest per minute.
With this rate, we would crawl the entire Twitter socialnetwork in 950 years...
Snowball vs Random Walk
There are two main approaches to organize network sampling:
I Snowball sampling: sample all neighbours of a newlydiscovered node;
I Random Walk sampling: sample just one neighbour of anewly discovered node;
Our focus will be on the random walk based methods, sincesnowball sampling quickly requires excessive amount ofresources and is biased by the principal eigenvector of theadjacency matrix.(M. Newman, Networks, 2010; A. Maiya & T. Berger-Wolf,2010)
Random Walk on Graph (Background)
The (discrete-time) Standard Random Walk {Xk , k = 0, 1, ...}is defined by the transition probabilities
P(Xk+1 = j | Xk = i) = pij =
{1/di , if j is a neighbour of i,0, otherwise.
where di is the degree of node i .
Random Walk on Graph (Background)
Assuming the network is connected, the stationary distributionof the standard Random Walk has a simple expression
πi =di
2m.
This stationary distribution is achieved roughly after therelaxation time
trel =1
1− |λ2|,
where λ2 is the second largest by modulus the eigenvalue ofthe transition matrix P .
We also have
||P(Xk = i)− πi || ≤ C |λ2|k , k = 1, 2, ...
Metropolis-Hastings Sampling
Since the standard random walk is biased towards large degreenodes, it might not be a good idea to use directly theestimator
µ(N) =1
N
N∑k=1
f (Xk).
One way around this problem is to use Metropolis-Hastingschain with the following transition matrix PMH :
PMHij =
{1
max(di ,dj )if j 6= i
1−∑k 6=i1
max(di ,dk )if j = i .
Metropolis-Hastings Sampling
By using the CLT for MCs (see e.g., Bremaud 1999), one canshow the following central limit theorem for MH Chain.
Proposition(Central Limit Theorem for MH) For MH Markov chain, itholds that
√N(µ(N)MH(G)− µ(G)
)D−→ N (0, σ2
MH), as N →∞,
where σ2MH = σ2
ff = 2nfTZf − 1
nfT f −
(1nfT1)2
and whereZ = [I− P + 1πT ]−1 is the fundamental matrix.
In the context of online social networks, the use of MHestimator was first proposed in (M. Gjoka et al, 2010).
Respondent Driven Sampling (RDS)
MH approach is known to be not very efficient because offrequent resampling.
D. Heckathorn and co-authors in early 2000’s proposed to usethe standard random walk but to unbias the estimator in thefollowing way:
µ(N)RDS (G) =
∑Nt=1 f (Xt)/d(Xt)∑N
t=1 1/d(Xt):=
∑Nt=1 f
′(Xt)∑N
t=1 g(Xt), (2)
Respondent Driven Sampling (RDS)
Using 2D CLT for MCs from (E. Nummelin, 2002), we canshow (our CSoNet’16 paper) that the RDS estimator isasymptotically consistent with a given asymptotic variance.
PropositionThe RDS estimate µRDS (G) satisfies
√N(µ
(N)RDS (G)− µ(G))
D−→ N (0, σ2RDS ),
with σ2RDS given by
σ2RDS = d2
av
(σ21 + σ2
2µ2(G)− 2µ(G)σ2
12
),
where σ21 = 1
|E |fTZf
′ − 12|E |∑
xf (x)2
d(x)−(
12|E |f
T1)2,
σ22 = σ2
gg = 1|E |1
TZg − 12|E |g
T1− ( 1dav
)2 and
σ212 = 1
2|E |fTZg + 1
2|E |1TZf
′ − 12|E |f
Tg − 1dav
12|E |1
T f.
RDS Variance vs MH Variance
0 2000 4000 6000 8000 10000
Budget B
0
1
2
3
4
5M
SE×B
MH-MCMC
Asymp. variance of MH-MCMC
RDS
Asymp. variance of RDS
RL technique
Tour based estimators
One very fruitful idea is to use tours for the construction ofestimators. This idea goes back to the works (L. Massoulie etal, 2006) and (C. Cooper et al, 2013).
For instance, suppose we would like to estimate the number ofedges in the network.
Consider the first return time to node i
T+i = min{t > 0 : St = i & S0 = i}.
The expected value of the first return time is given by
E [T+i ] =
1
πi=
2m
di.
Tour based estimators
Let Rk =∑k
j=1 Tk be the time of the k-th return to node i .Then, we can use the following estimator for the number ofedges
m =diRk
2k.
This idea can be easily extended to estimate a large variety ofnetwork characteristics.
Twitter as example
Twitter as example
Assuming that a rough estimation of the number of users is500 · 106 and the average number of followers per user is 10,the expected return time from the nodes like “Katy Perry” or“Justin Bieber” is about 2 · 10 · 500 · 106/50 · 106 = 200.
To obtain a decent error (≤ 5%), we need about 1000samples, and hence in total about 200000 operations. This isorders of magnitude less than the size of the Twitter followergraph!
Tour based estimators
In the tour-based estimators we return just to one node. Ofcourse, hitting a set of several nodes should be much easier.
The problem is that the process becomes not Markovian...
Fortunately, there are at least two solutions to this problem.
Super-node idea
The first solution is... to change the problem...
ba
cd
e
f
gi
h
kl
mn
pq
r
Figure: Original network
⇒
ba
d
e
f
i
h
l
mn
p
r
Figure: Modified network with asuper-node
Super-node idea
Using the results from (F. Chung, 1997), we can show that
PropositionThe random walk on the modified graph with super-node hasa smaller mixing time with respect to the walk on the originalgraph.
Super-node idea
The super-node can be grown in time (with some care!) toinclude for instance large degree nodes found during the tours.
More details in (K.A., B. Ribeiro and J. Sreedharan, ACMSigmetrics 2016).
Super-node idea
The second idea is based on reinforcement learning (ourCSoNet 2016 paper).
Define Yn := Xτn for τn := successive times to visit some setof nodes V0.
Then {(Yn, τn)} is a semi-Markov process on V0.
Super-node idea
Let ξ := min{n > 0 : Xn ∈ V0} and define
Ti := Ei [ξ],
h(i) := Ei
[ξ∑
m=1
f (Xm)
], i ∈ V0.
Then (see e.g., Ross, 2013), the Poisson equation for thesemi-Markov process (Yn, τn) is
V (i) = h(i)− βTi +∑j∈V0
pY (j |i)V (j), i ∈ V0. (3)
Here β is the desired stationary average of f .
Super-node idea
The reinforcement learning algorithm for the solution of thePoisson equation works as follows:
Let {z} be IID uniform on V0. For each n ≥ 1, generate anindependent copy {X n
m} of {Xm} with X n0 = z for
0 ≤ m ≤ ξ(n) := the first return time to V0.
A reinforcement learning step is then
Vn+1(i) = Vn(i) + a(n)I{z = i}× ξ(n)∑m=1
f (X nm)
− Vn(i0)ξ(n) + Vn(X nξ(n))− Vn(i)
,(4)
where a(n) > 0 are stepsizes satisfying∑n a(n) =∞, ∑n a(n)2 <∞.
Super-node idea
0 2000 4000 6000 8000 10000
Budget B
0.0
0.2
0.4
0.6
0.8
1.0N
RM
SEMH-MCMC
RDS
RL technique: |V0| = 5000, a(n) = 1/(d n1000e)
RL technique: |V0| = 2500, a(n) = 1/(d n1000e)
RL technique: |V0| = 1000, a(n) = 1/(d n1000e)
More details in (K.A., V. Borkar, A. Kadavankandy and J.Sreedharan, CSoNet 2016).
Tour estimator with super-node
A very promising idea is to combine tour-type RDS estimatorswith super-node:
0 2000 4000 6000 8000 10000
Budget B
0.00
0.01
0.02
0.03
0.04
0.05
0.06
NR
MS
E
Tour technique
RDS
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
×104
0
1
2
3×10−3
Friendster graph:
Estimate1
|V |∑
u∈VI{d(u) > 50}
RDS
Historical interlude
Conclusion
Unfortunately, there is yet no unifying theory for networksampling.
Fortunately, there is a number of interesting open questions:
I Can we do a theoretical analysis for NMSE which includesboth bias and asymptotic variance?
I Can we analyse main sampling schemes on some typicalrandom graph models? Is there the best sampling algorithmfor a given random graph model?
I Is there a substantial effect of network function on theperformance of an estimator?
I Is there a scheduling approach better than snowball andrandom walk? Some compromise or generalization?
I Is there a better use of information in case of subsampling?
References
I Avrachenkov, K., Ribeiro, B. and Sreedharan, J.K. Inferencein OSNs via Lightweight Partial Crawls. In Proceedings ofthe 2016 ACM Sigmetrics 2016.
I Avrachenkov, K., Borkar, V.S., Kadavankandy, A. andSreedharan, J.K. Comparison of Random Walk BasedTechniques for Estimating Network Averages. In Proceedingsof CSoNet 2016.
I Avrachenkov, K., Neglia, G. and Tuholukova, A. Subsamplingfor Chain-Referral Methods. In Proceedings of ASMTA 2016.
I Bremaud, P. Markov Chains: Gibbs Fields, Monte CarloSimulation, and Queues. Springer, 1999.
References
I Chung, F.R., 1997. Spectral Graph Theory. American Math.Soc., 1997.
I Cooper, C., Radzik, T. and Siantos Y. Fast Low-CostEstimation of Network Properties Using Random Walks. InProceedings of WAW 2013.
I Gjoka, M., Kurant, M., Butts, C.T. and Markopoulou, A.Walking in Facebook: A Case Study of Unbiased Sampling ofOSNs. In Proceedings of IEEE Infocom 2010.
I Granovetter, M. Network Sampling: Some First Steps.American Journal of Sociology, 81(6), pp.1287-1303, 1976.
References
I Massoulie, L., Le Merrer, E., Kermarrec, A.M. and Ganesh,A. Peer Counting and Sampling in Overlay Networks:Random Walk Methods. In Proceedings of PODC 2006.
I Maiya, A.S. and Berger-Wolf, T.Y. Sampling CommunityStructure. In Proceedings of WWW 2010.
I Newman, M. Networks: An Introduction. Oxford universitypress, 2010.
I Nummelin, E. MC’s for MCMC’ists. International StatisticalReview, 70(2), pp.215-240, 2002.
I Salganik, M.J. and Heckathorn, D.D. Sampling andEstimation in Hidden Populations Using Respondent-DrivenSampling. Sociological Methodology, 34(1), pp.193-240,2004.
Thank you!
Any questions and suggestions are welcome.