literature9-23 v2

24
Stochastic Blockmodel and Community Detection Lin Zhang September 25, 2015 Lin Zhang SBM and CD September 25, 2015 1 / 24

Upload: lin-zhang

Post on 23-Jan-2017

211 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Literature9-23 v2

Stochastic Blockmodel and Community Detection

Lin Zhang

September 25, 2015

Lin Zhang SBM and CD September 25, 2015 1 / 24

Page 2: Literature9-23 v2

Outline

Outline

1 Introduction to Stochastic blockmodel2 Community number detection methods

Network cross-validationLikelihood based model selectionLikelihood based cross-validation method.

Lin Zhang SBM and CD September 25, 2015 2 / 24

Page 3: Literature9-23 v2

Introduction to SBM

Stochastic Block Model

Network analysis quickly becomes a key tool for modern data analysiswith wide applications from neuroscience to sociology to biochemistry.Stochastic Block Model(SBM) gives a simple community structure.

Holland, Laskey and Leinhardt. (1983)”A stochastic model is proposed for social networks in which theactors (vertices) in a network are partitioned into subgroups calledblocks (communities).”

SBM is a probabilistic model, which assigns a probability value to eachpair of i, j objects in the network.

Lin Zhang SBM and CD September 25, 2015 3 / 24

Page 4: Literature9-23 v2

Introduction to SBM

Why SBM?

1 SBM is usually the first model to use because of its simplicity.2 The basic idea is that objects with a similar (or same) connectivity

can be grouped together.3 The initial graph can be reduced to a simpler one without losing

too much information.

The whole Community Detection Here is based on SBM.

Lin Zhang SBM and CD September 25, 2015 4 / 24

Page 5: Literature9-23 v2

Introduction to SBM

A Toy graph for SBM

Lin Zhang SBM and CD September 25, 2015 5 / 24

Page 6: Literature9-23 v2

Introduction to SBM

Community Detection

? What is community detection?The Procedure to find the community structure (clustering).

1 Community Recovery

2 Community Number Detection

To illuminate the overall network structure.They are Two Sides of a coin, but with different importance.

Lin Zhang SBM and CD September 25, 2015 6 / 24

Page 7: Literature9-23 v2

Introduction to SBM

Basic Setting of SBM

The most basic version of SBM is defined by a scalar value and twosimple data structures:

Graph=G(V,E): Vertex V and edges E.K : Community number.A : Adjacency Matrix. n× n symmetric binary matrix.φ: Labeling function with φ ∈ {1, · · · ,K}.B : Connectivity Matrix. A K ×K stochastic block matrix, whereBφ(i)φ(j) gives the probability that a vertex i is connected to avertex j.

Lin Zhang SBM and CD September 25, 2015 7 / 24

Page 8: Literature9-23 v2

Introduction to SBM

Basic Setting (Cont.)

Common assumptions for community detection under SBM:Statistical equivalency.Cross-group density q and Within-group density p, withp− q := δ > 0.nmin := min1≤l≤K |φ−1(l)| : minimum community size.

Consistency of the estimators depends on the conditions of δ and nmin.

Lin Zhang SBM and CD September 25, 2015 8 / 24

Page 9: Literature9-23 v2

Introduction to SBM

Types of SBM

SBM gives a versatile ways to infer the underlying structures of agraph.

1 The graph can be directed or undirected.2 The graph can be binary or weighted.3 The model can rely on a labeling function or assume unknown

labels.The SBM discussed here is undirected, binary and has a labelingfunction. Extensions on degree corrected blockmodel.

Lin Zhang SBM and CD September 25, 2015 9 / 24

Page 10: Literature9-23 v2

Community Detection

Two Focuses of Community Detection

1 Community recovery: for a fixed community number K, constructvariety of algorithms to estimate (φ, B).

2 Community number detection: find the best K under certaincriteria: BIC, CL-BIC, maximized likelihood or minimized lossfunction.

Lin Zhang SBM and CD September 25, 2015 10 / 24

Page 11: Literature9-23 v2

Community Detection

Community Recovery Algorithms

Greedy algorithm (a.k.a hierarchical agglomeration):computational feasible, but low statisitcal accuracy.Profile likelihood method: consistent when K fixed, but NP hard.Spectral clustering: popular algorithm, fast computation, easyimplementation, work well for both dense and sparse graphs.

Spielmat and Teng. (1996) Goldenberg et al. (2010) provide acomprehensive reviews on spectral clustering and SBM, respectively.

Lin Zhang SBM and CD September 25, 2015 11 / 24

Page 12: Literature9-23 v2

Community Detection

Community Number Detection

1 K=1: Erdos-Renyi graph. Hypothesis Test: K = 1 vs K > 1.2 Composite likelihood BIC (CL-BIC). D.F Saldana & Y. Yu and Y.

Feng (2014) propose composite likelihood BIC to select thenumber of communities.

Lin Zhang SBM and CD September 25, 2015 12 / 24

Page 13: Literature9-23 v2

Community Detection

Community Number Detection

1 Network cross-validation.K. Chen & J. Lei (2014) propose ablock-wise edge splitting technique. This technique can becombined with an integrated step of community recovery usingsub-blocks of the adjacency matrix.

2 Likelihood-based model selection for SBM. Y.X Rachel Wang andPeter J.Bickel (2015) consider an approach based on the loglikelihood ratio statistic and analyze its asymptotic propertiesunder model misspecification.

Lin Zhang SBM and CD September 25, 2015 13 / 24

Page 14: Literature9-23 v2

Community Detection

I. Network Cross-validation Algorithm

Input: Adjacency matrix A, K, training block size n1.1 Randomly split A = (A(11), A(12);A(12), A(22)), where A(11)

contains edges between training nodes, A(22) contains edgesbetween test nodes, and A(12) contains edges between trainingand test nodes.

2 Estimate model parameters (φ, B) using the rectangularsub-matrix A(1) = (A(11), A(12)).

3 Output is the predictive loss evaluated on A(22)

L(φ, B) =∑

i,j∈A22i 6=j

−`(φ, B)

Lin Zhang SBM and CD September 25, 2015 14 / 24

Page 15: Literature9-23 v2

Community Detection

NCV Algorithm Explanation

Given a candidate value K of K, and the Training Set N1,1 perform a singular value decomposition on A(1)

2 estimate φ by applying K−means clustering on the rows ofn× Kmatrix with K right singular vectors.

Bk,k′ =

i∈N1,k,j∈N1,k′⋃

N2,k′Aij

n1,k(n1,k′+n2,k′ )k 6= k′;∑

i,j∈N1,k,i<j Aij+∑

i∈N1,k,j∈N2,kAij

(n1,k−1)n1,k/2+n1,kn2,kk = k′.

3 Use Predictive loss on Testing Set N2.

L(A, K) =∑

i,j∈N2,i 6=j`(Aij , Pij), Pij = Bφiφj .

Consistency on the assumptions of connectivity matrix, smallestcommunity size and the training block size.

Lin Zhang SBM and CD September 25, 2015 15 / 24

Page 16: Literature9-23 v2

Community Detection

II. Likelihood Ratio Method

1 Different models can be separated using the log likelihood ratio

LK,K′ = logsupθ∈ΘK′ `(A; θ)

supθ∈ΘK′ `(A; θ)

The correct K-block model and fitting a misspecified K ′-blockmodel.

2 The log likelihood ratio statistic has an asymptotic normaldistribution when a smaller model with fewer blocks is specified.

n−3/2LK,K−1 −√nµ1θ

∗ D→ N(0, σ21(θ∗)).

3 In the case of misfitting a larger model, they obtain theconvergence rate of LK,K′ .

4 The likelihood-based model selection criterion is asymptoticallyconsistent.

Lin Zhang SBM and CD September 25, 2015 16 / 24

Page 17: Literature9-23 v2

Community Detection

Model Selection Criteria

β(K ′) = supθ∈ΘK′

log `(A, θ)− λK′(K ′ + 1)

2log n

Variational log likelihood

J(q, θ;A) = −DKL(q||f(Z|A; θ)) + log `(A; θ)

where q(z) ∈ D′K =∏ki=1 qi(z), Z = (z1, · · · , zn) is the latent

variable.

The variational estimates is given by

θvarK′ = arg maxθ∈ΘK′

maxq∈D′

K

J(q, θ;A).

Lin Zhang SBM and CD September 25, 2015 17 / 24

Page 18: Literature9-23 v2

Real World Networks Application

Facebook Ego Networks

Facebook users categorize their friends by social circles.

An ego network is created by extracting subgraphs formed on theneighbors of a central (ego) node.

Lin Zhang SBM and CD September 25, 2015 18 / 24

Page 19: Literature9-23 v2

Real World Networks Application

Facebook Ego Networks (Cont.)

There is no isolated nodes. The actual sizes of the networks and thenumber of communities selected by the three methods are shown inTable 1.

Lin Zhang SBM and CD September 25, 2015 19 / 24

Page 20: Literature9-23 v2

Real World Networks Application

Facebook Ego Networks(Cont.)

The PL and VB produce comparable community numbers.NCV (3-fold network cross validation) favors small communitynumbers.

Lin Zhang SBM and CD September 25, 2015 20 / 24

Page 21: Literature9-23 v2

My Research Topics

Network Likelihood Cross Validation

Use straightforward cross-validation method to determinecommunity numbers.

1 Training set⇒ φ, B, p.2 Test set⇒ given φ∗, p∗ (best estimation)

P (`(A; φ)− `(A; φ∗)|N2> 0)→ 0.

Under some proper assumptions, consistency of the estimate willbe proposed.Computation Comparison with VB, VLH, and NCV.

Lin Zhang SBM and CD September 25, 2015 21 / 24

Page 22: Literature9-23 v2

My Research Topics

Hoeffding’s Inequality

Set Xi are strictly bounded by the intervals [ai, bi], the sumSn = X1 + · · ·+Xn has the property as

P (|Sn − ESn| ≥ t) ≤ 2 exp

(− 2t2∑

(bi − ai)2

).

Lin Zhang SBM and CD September 25, 2015 22 / 24

Page 23: Literature9-23 v2

My Research Topics

Lin Zhang SBM and CD September 25, 2015 23 / 24

Page 24: Literature9-23 v2

My Research Topics

Main References

1 T.Tony Cai and Xiaodong Li (2014). Robust and ComputationallyFeasible Community Detection in the presence of arbitrary outliernodes.To be appear in Ann. Statist

2 Kehui Li (2014). Network Cross-Validation for Determining theNumber of Communities in Network Data. eprint arXiv:1411.1715

3 D.F Saldana (2014). How many Communities are there? eprintarXiv:1412.1684

4 Celisse, A., Daudin, J and Pierre, L (2012). Consistency ofmaximum-likelihood and variational estimators in the StochasticBlock Model. Electron. J. Statist. 6 1847-1899.

5 Y.X. Rachel Wang and Peter J. Bickel (2015) Likelihoood-basedModel Selection for Stochastic Block Models. Submitted to theAnnals of Statistics.

Lin Zhang SBM and CD September 25, 2015 24 / 24