literature9-23 v2

Stochastic Blockmodel and Community Detection

Lin Zhang

September 25, 2015

Lin Zhang SBM and CD September 25, 2015 1 / 24

Outline

Outline

1 Introduction to Stochastic blockmodel2 Community number detection methods

Network cross-validationLikelihood based model selectionLikelihood based cross-validation method.


Introduction to SBM

Stochastic Block Model

Network analysis quickly becomes a key tool for modern data analysiswith wide applications from neuroscience to sociology to biochemistry.Stochastic Block Model(SBM) gives a simple community structure.

Holland, Laskey and Leinhardt. (1983)”A stochastic model is proposed for social networks in which theactors (vertices) in a network are partitioned into subgroups calledblocks (communities).”

SBM is a probabilistic model, which assigns a probability value to eachpair of i, j objects in the network.


http://www.stat.cmu.edu/~brian/780/bibliography/04%20Blockmodels/Holland%20-%201983%20-%20Stochastic%20blockmodels,%20first%20steps.pdf

Introduction to SBM

Why SBM?

1 SBM is usually the first model to use because of its simplicity.2 The basic idea is that objects with a similar (or same) connectivity

can be grouped together.3 The initial graph can be reduced to a simpler one without losing

too much information.

The whole Community Detection Here is based on SBM.


Introduction to SBM

A Toy graph for SBM


Introduction to SBM

Community Detection

? What is community detection?The Procedure to find the community structure (clustering).

1 Community Recovery

2 Community Number Detection

To illuminate the overall network structure.They are Two Sides of a coin, but with different importance.


Introduction to SBM

Basic Setting of SBM

The most basic version of SBM is defined by a scalar value and twosimple data structures:

Graph=G(V,E): Vertex V and edges E.K : Community number.A : Adjacency Matrix. n× n symmetric binary matrix.φ: Labeling function with φ ∈ {1, · · · ,K}.B : Connectivity Matrix. A K ×K stochastic block matrix, whereBφ(i)φ(j) gives the probability that a vertex i is connected to avertex j.


Introduction to SBM

Basic Setting (Cont.)

Common assumptions for community detection under SBM:Statistical equivalency.Cross-group density q and Within-group density p, withp− q := δ > 0.nmin := min1≤l≤K |φ−1(l)| : minimum community size.

Consistency of the estimators depends on the conditions of δ and nmin.


Introduction to SBM

Types of SBM

SBM gives a versatile ways to infer the underlying structures of agraph.

1 The graph can be directed or undirected.2 The graph can be binary or weighted.3 The model can rely on a labeling function or assume unknown

labels.The SBM discussed here is undirected, binary and has a labelingfunction. Extensions on degree corrected blockmodel.


Community Detection

Two Focuses of Community Detection

1 Community recovery: for a fixed community number K, constructvariety of algorithms to estimate (φ, B).

2 Community number detection: find the best K under certaincriteria: BIC, CL-BIC, maximized likelihood or minimized lossfunction.


Community Detection

Community Recovery Algorithms

Greedy algorithm (a.k.a hierarchical agglomeration):computational feasible, but low statisitcal accuracy.Profile likelihood method: consistent when K fixed, but NP hard.Spectral clustering: popular algorithm, fast computation, easyimplementation, work well for both dense and sparse graphs.

Spielmat and Teng. (1996) Goldenberg et al. (2010) provide acomprehensive reviews on spectral clustering and SBM, respectively.


http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.717&rep=rep1&type=pdf

http://arxiv.org/pdf/0912.5410.pdf

Community Detection

Community Number Detection

1 K=1: Erdos-Renyi graph. Hypothesis Test: K = 1 vs K > 1.2 Composite likelihood BIC (CL-BIC). D.F Saldana & Y. Yu and Y.

Feng (2014) propose composite likelihood BIC to select thenumber of communities.


http://arxiv.org/pdf/1412.1684v2.pdf


Community Detection

Community Number Detection

1 Network cross-validation.K. Chen & J. Lei (2014) propose ablock-wise edge splitting technique. This technique can becombined with an integrated step of community recovery usingsub-blocks of the adjacency matrix.

2 Likelihood-based model selection for SBM. Y.X Rachel Wang andPeter J.Bickel (2015) consider an approach based on the loglikelihood ratio statistic and analyze its asymptotic propertiesunder model misspecification.





Community Detection

I. Network Cross-validation Algorithm

Input: Adjacency matrix A, K, training block size n1.1 Randomly split A = (A(11), A(12);A(12), A(22)), where A(11)

contains edges between training nodes, A(22) contains edgesbetween test nodes, and A(12) contains edges between trainingand test nodes.

2 Estimate model parameters (φ, B) using the rectangularsub-matrix A(1) = (A(11), A(12)).

3 Output is the predictive loss evaluated on A(22)

L(φ, B) =∑

i,j∈A22i 6=j

−`(φ, B)


Community Detection

NCV Algorithm Explanation

Given a candidate value K of K, and the Training Set N1,1 perform a singular value decomposition on A(1)

2 estimate φ by applying K−means clustering on the rows ofn× Kmatrix with K right singular vectors.

Bk,k′ =

∑

i∈N1,k,j∈N1,k′⋃

N2,k′Aij

n1,k(n1,k′+n2,k′ )k 6= k′;∑

i,j∈N1,k,i<j Aij+∑

i∈N1,k,j∈N2,kAij

(n1,k−1)n1,k/2+n1,kn2,kk = k′.

3 Use Predictive loss on Testing Set N2.

L(A, K) =∑

i,j∈N2,i 6=j`(Aij , Pij), Pij = Bφiφj .

Consistency on the assumptions of connectivity matrix, smallestcommunity size and the training block size.


Community Detection

II. Likelihood Ratio Method

1 Different models can be separated using the log likelihood ratio

LK,K′ = logsupθ∈ΘK′ `(A; θ)

supθ∈ΘK′ `(A; θ)

The correct K-block model and fitting a misspecified K ′-blockmodel.

2 The log likelihood ratio statistic has an asymptotic normaldistribution when a smaller model with fewer blocks is specified.

n−3/2LK,K−1 −√nµ1θ

∗ D→ N(0, σ21(θ∗)).

3 In the case of misfitting a larger model, they obtain theconvergence rate of LK,K′ .

4 The likelihood-based model selection criterion is asymptoticallyconsistent.


Community Detection

Model Selection Criteria

β(K ′) = supθ∈ΘK′

log `(A, θ)− λK′(K ′ + 1)

2log n

Variational log likelihood

J(q, θ;A) = −DKL(q||f(Z|A; θ)) + log `(A; θ)

where q(z) ∈ D′K =∏ki=1 qi(z), Z = (z1, · · · , zn) is the latent

variable.

The variational estimates is given by

θvarK′ = arg maxθ∈ΘK′

maxq∈D′

K

J(q, θ;A).


Real World Networks Application

Facebook Ego Networks

Facebook users categorize their friends by social circles.

An ego network is created by extracting subgraphs formed on theneighbors of a central (ego) node.



Facebook Ego Networks (Cont.)

There is no isolated nodes. The actual sizes of the networks and thenumber of communities selected by the three methods are shown inTable 1.



Facebook Ego Networks(Cont.)

The PL and VB produce comparable community numbers.NCV (3-fold network cross validation) favors small communitynumbers.


My Research Topics

Network Likelihood Cross Validation

Use straightforward cross-validation method to determinecommunity numbers.

1 Training set⇒ φ, B, p.2 Test set⇒ given φ∗, p∗ (best estimation)

P (`(A; φ)− `(A; φ∗)|N2> 0)→ 0.

Under some proper assumptions, consistency of the estimate willbe proposed.Computation Comparison with VB, VLH, and NCV.


My Research Topics

Hoeffding’s Inequality

Set Xi are strictly bounded by the intervals [ai, bi], the sumSn = X1 + · · ·+Xn has the property as

P (|Sn − ESn| ≥ t) ≤ 2 exp

(− 2t2∑

(bi − ai)2

).


My Research Topics


My Research Topics

Main References

1 T.Tony Cai and Xiaodong Li (2014). Robust and ComputationallyFeasible Community Detection in the presence of arbitrary outliernodes.To be appear in Ann. Statist

2 Kehui Li (2014). Network Cross-Validation for Determining theNumber of Communities in Network Data. eprint arXiv:1411.1715

3 D.F Saldana (2014). How many Communities are there? eprintarXiv:1412.1684

4 Celisse, A., Daudin, J and Pierre, L (2012). Consistency ofmaximum-likelihood and variational estimators in the StochasticBlock Model. Electron. J. Statist. 6 1847-1899.

5 Y.X. Rachel Wang and Peter J. Bickel (2015) Likelihoood-basedModel Selection for Stochastic Block Models. Submitted to theAnnals of Statistics.


literature9-23 v2

Documents