literature9-23 v2
TRANSCRIPT
Stochastic Blockmodel and Community Detection
Lin Zhang
September 25, 2015
Lin Zhang SBM and CD September 25, 2015 1 / 24
Outline
Outline
1 Introduction to Stochastic blockmodel2 Community number detection methods
Network cross-validationLikelihood based model selectionLikelihood based cross-validation method.
Lin Zhang SBM and CD September 25, 2015 2 / 24
Introduction to SBM
Stochastic Block Model
Network analysis quickly becomes a key tool for modern data analysiswith wide applications from neuroscience to sociology to biochemistry.Stochastic Block Model(SBM) gives a simple community structure.
Holland, Laskey and Leinhardt. (1983)”A stochastic model is proposed for social networks in which theactors (vertices) in a network are partitioned into subgroups calledblocks (communities).”
SBM is a probabilistic model, which assigns a probability value to eachpair of i, j objects in the network.
Lin Zhang SBM and CD September 25, 2015 3 / 24
Introduction to SBM
Why SBM?
1 SBM is usually the first model to use because of its simplicity.2 The basic idea is that objects with a similar (or same) connectivity
can be grouped together.3 The initial graph can be reduced to a simpler one without losing
too much information.
The whole Community Detection Here is based on SBM.
Lin Zhang SBM and CD September 25, 2015 4 / 24
Introduction to SBM
A Toy graph for SBM
Lin Zhang SBM and CD September 25, 2015 5 / 24
Introduction to SBM
Community Detection
? What is community detection?The Procedure to find the community structure (clustering).
1 Community Recovery
2 Community Number Detection
To illuminate the overall network structure.They are Two Sides of a coin, but with different importance.
Lin Zhang SBM and CD September 25, 2015 6 / 24
Introduction to SBM
Basic Setting of SBM
The most basic version of SBM is defined by a scalar value and twosimple data structures:
Graph=G(V,E): Vertex V and edges E.K : Community number.A : Adjacency Matrix. n× n symmetric binary matrix.φ: Labeling function with φ ∈ {1, · · · ,K}.B : Connectivity Matrix. A K ×K stochastic block matrix, whereBφ(i)φ(j) gives the probability that a vertex i is connected to avertex j.
Lin Zhang SBM and CD September 25, 2015 7 / 24
Introduction to SBM
Basic Setting (Cont.)
Common assumptions for community detection under SBM:Statistical equivalency.Cross-group density q and Within-group density p, withp− q := δ > 0.nmin := min1≤l≤K |φ−1(l)| : minimum community size.
Consistency of the estimators depends on the conditions of δ and nmin.
Lin Zhang SBM and CD September 25, 2015 8 / 24
Introduction to SBM
Types of SBM
SBM gives a versatile ways to infer the underlying structures of agraph.
1 The graph can be directed or undirected.2 The graph can be binary or weighted.3 The model can rely on a labeling function or assume unknown
labels.The SBM discussed here is undirected, binary and has a labelingfunction. Extensions on degree corrected blockmodel.
Lin Zhang SBM and CD September 25, 2015 9 / 24
Community Detection
Two Focuses of Community Detection
1 Community recovery: for a fixed community number K, constructvariety of algorithms to estimate (φ, B).
2 Community number detection: find the best K under certaincriteria: BIC, CL-BIC, maximized likelihood or minimized lossfunction.
Lin Zhang SBM and CD September 25, 2015 10 / 24
Community Detection
Community Recovery Algorithms
Greedy algorithm (a.k.a hierarchical agglomeration):computational feasible, but low statisitcal accuracy.Profile likelihood method: consistent when K fixed, but NP hard.Spectral clustering: popular algorithm, fast computation, easyimplementation, work well for both dense and sparse graphs.
Spielmat and Teng. (1996) Goldenberg et al. (2010) provide acomprehensive reviews on spectral clustering and SBM, respectively.
Lin Zhang SBM and CD September 25, 2015 11 / 24
Community Detection
Community Number Detection
1 K=1: Erdos-Renyi graph. Hypothesis Test: K = 1 vs K > 1.2 Composite likelihood BIC (CL-BIC). D.F Saldana & Y. Yu and Y.
Feng (2014) propose composite likelihood BIC to select thenumber of communities.
Lin Zhang SBM and CD September 25, 2015 12 / 24
Community Detection
Community Number Detection
1 Network cross-validation.K. Chen & J. Lei (2014) propose ablock-wise edge splitting technique. This technique can becombined with an integrated step of community recovery usingsub-blocks of the adjacency matrix.
2 Likelihood-based model selection for SBM. Y.X Rachel Wang andPeter J.Bickel (2015) consider an approach based on the loglikelihood ratio statistic and analyze its asymptotic propertiesunder model misspecification.
Lin Zhang SBM and CD September 25, 2015 13 / 24
Community Detection
I. Network Cross-validation Algorithm
Input: Adjacency matrix A, K, training block size n1.1 Randomly split A = (A(11), A(12);A(12), A(22)), where A(11)
contains edges between training nodes, A(22) contains edgesbetween test nodes, and A(12) contains edges between trainingand test nodes.
2 Estimate model parameters (φ, B) using the rectangularsub-matrix A(1) = (A(11), A(12)).
3 Output is the predictive loss evaluated on A(22)
L(φ, B) =∑
i,j∈A22i 6=j
−`(φ, B)
Lin Zhang SBM and CD September 25, 2015 14 / 24
Community Detection
NCV Algorithm Explanation
Given a candidate value K of K, and the Training Set N1,1 perform a singular value decomposition on A(1)
2 estimate φ by applying K−means clustering on the rows ofn× Kmatrix with K right singular vectors.
Bk,k′ =
∑
i∈N1,k,j∈N1,k′⋃
N2,k′Aij
n1,k(n1,k′+n2,k′ )k 6= k′;∑
i,j∈N1,k,i<j Aij+∑
i∈N1,k,j∈N2,kAij
(n1,k−1)n1,k/2+n1,kn2,kk = k′.
3 Use Predictive loss on Testing Set N2.
L(A, K) =∑
i,j∈N2,i 6=j`(Aij , Pij), Pij = Bφiφj .
Consistency on the assumptions of connectivity matrix, smallestcommunity size and the training block size.
Lin Zhang SBM and CD September 25, 2015 15 / 24
Community Detection
II. Likelihood Ratio Method
1 Different models can be separated using the log likelihood ratio
LK,K′ = logsupθ∈ΘK′ `(A; θ)
supθ∈ΘK′ `(A; θ)
The correct K-block model and fitting a misspecified K ′-blockmodel.
2 The log likelihood ratio statistic has an asymptotic normaldistribution when a smaller model with fewer blocks is specified.
n−3/2LK,K−1 −√nµ1θ
∗ D→ N(0, σ21(θ∗)).
3 In the case of misfitting a larger model, they obtain theconvergence rate of LK,K′ .
4 The likelihood-based model selection criterion is asymptoticallyconsistent.
Lin Zhang SBM and CD September 25, 2015 16 / 24
Community Detection
Model Selection Criteria
β(K ′) = supθ∈ΘK′
log `(A, θ)− λK′(K ′ + 1)
2log n
Variational log likelihood
J(q, θ;A) = −DKL(q||f(Z|A; θ)) + log `(A; θ)
where q(z) ∈ D′K =∏ki=1 qi(z), Z = (z1, · · · , zn) is the latent
variable.
The variational estimates is given by
θvarK′ = arg maxθ∈ΘK′
maxq∈D′
K
J(q, θ;A).
Lin Zhang SBM and CD September 25, 2015 17 / 24
Real World Networks Application
Facebook Ego Networks
Facebook users categorize their friends by social circles.
An ego network is created by extracting subgraphs formed on theneighbors of a central (ego) node.
Lin Zhang SBM and CD September 25, 2015 18 / 24
Real World Networks Application
Facebook Ego Networks (Cont.)
There is no isolated nodes. The actual sizes of the networks and thenumber of communities selected by the three methods are shown inTable 1.
Lin Zhang SBM and CD September 25, 2015 19 / 24
Real World Networks Application
Facebook Ego Networks(Cont.)
The PL and VB produce comparable community numbers.NCV (3-fold network cross validation) favors small communitynumbers.
Lin Zhang SBM and CD September 25, 2015 20 / 24
My Research Topics
Network Likelihood Cross Validation
Use straightforward cross-validation method to determinecommunity numbers.
1 Training set⇒ φ, B, p.2 Test set⇒ given φ∗, p∗ (best estimation)
P (`(A; φ)− `(A; φ∗)|N2> 0)→ 0.
Under some proper assumptions, consistency of the estimate willbe proposed.Computation Comparison with VB, VLH, and NCV.
Lin Zhang SBM and CD September 25, 2015 21 / 24
My Research Topics
Hoeffding’s Inequality
Set Xi are strictly bounded by the intervals [ai, bi], the sumSn = X1 + · · ·+Xn has the property as
P (|Sn − ESn| ≥ t) ≤ 2 exp
(− 2t2∑
(bi − ai)2
).
Lin Zhang SBM and CD September 25, 2015 22 / 24
My Research Topics
Lin Zhang SBM and CD September 25, 2015 23 / 24
My Research Topics
Main References
1 T.Tony Cai and Xiaodong Li (2014). Robust and ComputationallyFeasible Community Detection in the presence of arbitrary outliernodes.To be appear in Ann. Statist
2 Kehui Li (2014). Network Cross-Validation for Determining theNumber of Communities in Network Data. eprint arXiv:1411.1715
3 D.F Saldana (2014). How many Communities are there? eprintarXiv:1412.1684
4 Celisse, A., Daudin, J and Pierre, L (2012). Consistency ofmaximum-likelihood and variational estimators in the StochasticBlock Model. Electron. J. Statist. 6 1847-1899.
5 Y.X. Rachel Wang and Peter J. Bickel (2015) Likelihoood-basedModel Selection for Stochastic Block Models. Submitted to theAnnals of Statistics.
Lin Zhang SBM and CD September 25, 2015 24 / 24