moses: community finding using model-based overlapping seed expansion

47
Model-based Overlapping Seed ExpanSion (MOSES) Aaron McDaid and Neil Hurley. This research was supported by Science Foundation Ireland (SFI) Grant No. 08/SRC/I1407. Clique: Graph & Network Analysis Cluster School of Computer Science & Informatics University College Dublin, Ireland

Upload: aaronmcdaid

Post on 29-Aug-2014

1.547 views

Category:

Technology


0 download

DESCRIPTION

Presented at ASONAM 2010 by Aaron McDaid, describing a new model and algorithm for overlapping community finding. Location: University of Sour

TRANSCRIPT

Page 1: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Model-based Overlapping Seed ExpanSion(MOSES)

Aaron McDaid and Neil Hurley. This research was supported byScience Foundation Ireland (SFI) Grant No. 08/SRC/I1407.

Clique: Graph & Network Analysis ClusterSchool of Computer Science & Informatics

University College Dublin, Ireland

Page 2: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Overview

I Community finding

I The MOSES model

I The MOSES algorithm

I Evaluation

I Scalability

I Other/future work

August 7, 2010 2

Page 3: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Communities

August 7, 2010 3

Page 4: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Facebook

I Traud et al. Community Structure In Online Collegiate SocialNetworks

I M. Salter-Townshend and T.B. Murphy. Variational BayesianInference for the Latent Position Cluster Model

I Marlow et al. Maintained relationships on Facebook

August 7, 2010 4

Page 5: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Communities

I Some nodes assigned to multiple communities.

I Most edges assigned to just one community.

I Multiple researchers have found Facebook members being in 6or 7 communities.

August 7, 2010 5

Page 6: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Communities

I A partition will break some of the communities in that simpleexample.

I Graclus breaks synthetic communities with low levels ofoverlap. (A. Lancichinetti and S. Fortunato, Benchmarks fortesting community detection algorithms on directed andweighted graphs with overlapping communities. )

I Graclus breaks communities found by MOSES in Facebooknetworks. (Traud et al, Community Structure in OnlineCollegiate Social Networks)

I Modularity has known problems, but we need to go furtherand move on from partitioning.

August 7, 2010 6

Page 7: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Facebook

I Traud et al’s five university networks.

I Average of 7 communities per node.

August 7, 2010 7

Page 8: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Community finding

A general-purpose community finding algorithm must allow:

I Each node to be assigned to any number of communities.

I Pervasive overlap. Ahn et al. Link communities revealmultiscale complexity in networks. (Nature).

I The intersection (number of shared nodes) between a pair ofcommunities can vary. It can be small, even when the numberof communities-per-node is high.

August 7, 2010 8

Page 9: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

MOSES

I MOSES deals only with undirected, unweighted, networks.

I No attributes/weights associated with nodes or edges.

August 7, 2010 9

Page 10: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

The MOSES model

Model that:

I Every pair of nodes has a chance of having an edge.

I Independent for each pair of nodes, given the communities,but probability is higher for pairs that share community(ies).

I (This is an OSBM - Latouche et al. Annals of AppliedStatisticshttp://www.imstat.org/aoas/next_issue.html.)

August 7, 2010 10

Page 11: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

MOSES model

Ignoring the observed edgesfor now. Just consider thenodes and a (proposed) set ofcommunities

August 7, 2010 11

Page 12: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

MOSES model

These communities createprobabilities for the edges.

P(v1 ∼ v2) = pout where thetwo vertices do NOT share acommunity.

P(v1 ∼ v2) = 1−(1−pout)(1−pin) where the two vertices doshare 1 community.

August 7, 2010 12

Page 13: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

MOSES model

These communities createprobabilities for the edges.

P(v1 � v2) = qout where thetwo vertices do NOT share acommunity.

P(v1 � v2) = qoutqin wherethe two vertices do share 1community.

P(v1 � v2) = qoutqins(v1,v2)

where s(v1, v2) is the numberof communities shared by v1

and v2.

August 7, 2010 13

Page 14: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

MOSES model

I We now have a model that, for a given set of communities,assigns probabilities for edges.

I P(g |z , pin, pout)

I g is the observed graph of nodes and edges. z is the proposedset of communities.

I How do we match that with the observed edges to get a goodestimate of the set of communities?

I Naive approach: find (z , pin, pout) that maximizesP(g |z , pin, pout).

August 7, 2010 14

Page 15: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

MOSES model

I We now have a model that, for a given set of communities,assigns probabilities for edges.

I P(g |z , pin, pout)

I g is the observed graph of nodes and edges. z is the proposedset of communities.

I How do we match that with the observed edges to get a goodestimate of the set of communities?

I Naive approach: find (z , pin, pout) that maximizesP(g |z , pin, pout).

August 7, 2010 14

Page 16: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

MOSES model

I P(g |z , pin, pout) is maximized when pin = 1, pout = 1, andwhen z is defined as exactly one community around each edge.

I i.e. we don’t want to maximize P(g |z , pin, pout).

August 7, 2010 15

Page 17: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

MOSES model

I P(z , pin, pout |g)

August 7, 2010 16

Page 18: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

MOSES model

I Apply Bayes’ Theorem:

I P(z , pin, pout |g) ∝ P(g |z , pin, pout) P(z) P(pin, pout)

I

P(z) ∼ k!∏

1≤i≤k

(1

N + 1

1(Nni

))I where k is the number of communities, and ni is the number

of nodes in community i .

August 7, 2010 17

Page 19: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

MOSES model

I Apply Bayes’ Theorem:

I P(z , pin, pout |g) ∝ P(g |z , pin, pout) P(z) P(pin, pout)

I

P(z) ∼ k!∏

1≤i≤k

(1

N + 1

1(Nni

))I where k is the number of communities, and ni is the number

of nodes in community i .

August 7, 2010 17

Page 20: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

MOSES model

I We can correctly integrate out the number of communities, k ,and search across the resulting varying-dimensional space.

I No need for model selection, e.g. BIC.

August 7, 2010 18

Page 21: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

MOSES Algorithm

I For the MOSES algorithm, we chose to look at the jointdistribution over (z , pin, pout) and aim to maximize it.

I The algorithm is a heuristic approximate algorithm, and we donot claim that it finds the maximum.

August 7, 2010 19

Page 22: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

MOSES Algorithm

I Choose an edge at random to form a seed, and expand.

I Accept/reject those expanded seeds that contribute positivelyto the objective.

I Update pin, pout based on the graph and the current set ofcommunities.

I Delete communities that don’t make a positive contribution tothe objective.

I Final fine-tuning that moves nodes one at a time.

I It is not a Markov Chain, nor an EM algorithm. We can makeno such guarantees.

I The algorithm will be reaching a local maximum, and maywell have strong biases.

August 7, 2010 20

Page 23: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

MOSES Algorithm

I Choose an edge at random to form a seed, and expand.

I Accept/reject those expanded seeds that contribute positivelyto the objective.

I Update pin, pout based on the graph and the current set ofcommunities.

I Delete communities that don’t make a positive contribution tothe objective.

I Final fine-tuning that moves nodes one at a time.

I It is not a Markov Chain, nor an EM algorithm. We can makeno such guarantees.

I The algorithm will be reaching a local maximum, and maywell have strong biases.

August 7, 2010 20

Page 24: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Evaluation

Synthetic benchmarks

I Networks created randomly by software.

I Ground truth communities are builtin to these networks.

I Check if the algorithms can discover the correct communitieswhen fed the network.

I To measure the similarity between the found communities andthe ground truth communities, overlapping NMI is used.(Lancichinetti et al. Detecting the overlapping andhierarchical community structure in complex networks)

August 7, 2010 21

Page 25: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Evaluation

I 2000 nodes

I Define hundreds of communities.

I Each community contains 20 nodes chosen at random fromthe 2000 nodes.

I Some nodes may be assigned to many communities. Somemay not be assigned to a community.

I pin = 0.4. About 40% of the pairs of nodes that share acommunity are then joined.

I pout = 0.005. Finally, a small amount of background noise isadded.

August 7, 2010 22

Page 26: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Evaluation

20-node communities (pin = 0.4), po = 0.005

2 4 6 8 10 12 14

0.0

0.2

0.4

0.6

0.8

1.0

Average Overlap

NM

I

1 15

MOSESLFM (default)

LFM (last Collection)GCE

Louvain methodcopra

5−clique percolation4−clique percolation (dashed)

Iterative Scan (dashed)

August 7, 2010 23

Page 27: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Evaluation, LFR benchmarks

1 2 5 10

0.0

0.2

0.4

0.6

0.8

1.0

Communities per node

NM

I

3 4 6 7 8 91.2 1.6

MOSESLFM2−firstColLFM2−lastCol

GCESCP−3

Louvain methodcopra

SCP−4

Evaluation, degree = 15,15 ≤ c ≤ 60

August 7, 2010 24

Page 28: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Evaluation, LFR benchmarks

1 2 5 10

0.0

0.2

0.4

0.6

0.8

1.0

Communities per node

NM

I

3 4 6 7 8 91.2 1.6

MOSESLFM2−firstColLFM2−lastCol

GCELouvain method

copraSCP−4

degree ∼ 15, maxdegree = 45, 15 ≤ c ≤ 60

August 7, 2010 25

Page 29: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Facebook

1 5 10 50 500

0.0

0.1

0.2

0.3

0.4

Degree

Den

sity

August 7, 2010 26

Page 30: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Facebook

1 2 5 10 20 50 100

0.0

0.1

0.2

0.3

0.4

0.5

Communities−per−person

Den

sity

August 7, 2010 27

Page 31: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Facebook

1 5 10 50 500

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Size of community

Den

sity

OklahomaPrincetonUNCGeorgetownCaltech

August 7, 2010 28

Page 32: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Facebook

0 200 400 600 800 1000 1200

0

10

20

30

40

50

60

70

Degree

Com

mun

itier

s pe

r no

de

172

14421528635842950057264371478585792899910711142

Counts

August 7, 2010 29

Page 33: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Facebook

Table: Summary of Traud et al’s five university Facebook datasets, andof MOSES’s output.

Ca

ltec

h

Pri

nce

ton

Geo

rget

ow

n

UN

C

Ok

lah

om

a

Edges 16656 293320 425638 766800 892528Nodes 769 6596 9414 18163 17425

Average Degree 43.3 88.9 90.4 84.4 102.4

Communities found 62 832 1284 2725 3073Average Overlap 3.29 6.28 6.67 6.96 7.46

MOSES runtime (s) 41 553 839 1585 2233

August 7, 2010 30

Page 34: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Scalability

1 2 5 10

1e−

021e

+00

1e+

02

Communities per node

Tim

e(s)

3 4 6 7 8 91.2 1.6

MOSESLFM2−firstColLFM2−lastCol

GCEblondel

copraSCP−4

degree ∼ 15, maxdegree = 45, 15 ≤ c ≤ 60

August 7, 2010 31

Page 35: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Scalability

I In general, community finding means overlapping communityfinding, (in my interpretation).

I Partitioning breaks communities.

I So, partitioning is scalable, but partitioning doesn’t help withcommunity finding.

I Challenge: a very scalable algorithm that can credibly claim tobe a community-finding algorithm.

August 7, 2010 32

Page 36: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Scalability

I In general, community finding means overlapping communityfinding, (in my interpretation).

I Partitioning breaks communities.

I So, partitioning is scalable, but partitioning doesn’t help withcommunity finding.

I Challenge: a very scalable algorithm that can credibly claim tobe a community-finding algorithm.

August 7, 2010 32

Page 37: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Scalability

I In general, community finding means overlapping communityfinding, (in my interpretation).

I Partitioning breaks communities.

I So, partitioning is scalable, but partitioning doesn’t help withcommunity finding.

I Challenge: a very scalable algorithm that can credibly claim tobe a community-finding algorithm.

August 7, 2010 32

Page 38: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Scalability

I In general, community finding means overlapping communityfinding, (in my interpretation).

I Partitioning breaks communities.

I So, partitioning is scalable, but partitioning doesn’t help withcommunity finding.

I Challenge: a very scalable algorithm that can credibly claim tobe a community-finding algorithm.

August 7, 2010 32

Page 39: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Other/future research

I Markov Chain Monte CarloI Working with Prof. Brendan Murphy on an MCMC method.I Very different algorithm, which allows us to investigate the

model directly.

I MOSES algorithm may have many biases we’ll never fullygrasp.

I Different model (still an OSBM) where each community hasits own internal-connection probability.

I MOSES breaks down on synthetic data if the communities arenot equally dense (pin).

I Draw from this distribution: P(z , pout , p1, p2, p3, ...|g)I Multiple MCMC chains, where chains propose splits/merge to

each other.I (Modern) statisticians are innovative about scalability, e.g.

Hybrid Monte Carlo.

August 7, 2010 33

Page 40: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Other/future research

I Markov Chain Monte CarloI Working with Prof. Brendan Murphy on an MCMC method.I Very different algorithm, which allows us to investigate the

model directly.I MOSES algorithm may have many biases we’ll never fully

grasp.

I Different model (still an OSBM) where each community hasits own internal-connection probability.

I MOSES breaks down on synthetic data if the communities arenot equally dense (pin).

I Draw from this distribution: P(z , pout , p1, p2, p3, ...|g)I Multiple MCMC chains, where chains propose splits/merge to

each other.I (Modern) statisticians are innovative about scalability, e.g.

Hybrid Monte Carlo.

August 7, 2010 33

Page 41: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Other/future research

I Markov Chain Monte CarloI Working with Prof. Brendan Murphy on an MCMC method.I Very different algorithm, which allows us to investigate the

model directly.I MOSES algorithm may have many biases we’ll never fully

grasp.I Different model (still an OSBM) where each community has

its own internal-connection probability.I MOSES breaks down on synthetic data if the communities are

not equally dense (pin).

I Draw from this distribution: P(z , pout , p1, p2, p3, ...|g)I Multiple MCMC chains, where chains propose splits/merge to

each other.I (Modern) statisticians are innovative about scalability, e.g.

Hybrid Monte Carlo.

August 7, 2010 33

Page 42: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Other/future research

I Markov Chain Monte CarloI Working with Prof. Brendan Murphy on an MCMC method.I Very different algorithm, which allows us to investigate the

model directly.I MOSES algorithm may have many biases we’ll never fully

grasp.I Different model (still an OSBM) where each community has

its own internal-connection probability.I MOSES breaks down on synthetic data if the communities are

not equally dense (pin).I Draw from this distribution: P(z , pout , p1, p2, p3, ...|g)

I Multiple MCMC chains, where chains propose splits/merge toeach other.

I (Modern) statisticians are innovative about scalability, e.g.Hybrid Monte Carlo.

August 7, 2010 33

Page 43: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Other/future research

I Markov Chain Monte CarloI Working with Prof. Brendan Murphy on an MCMC method.I Very different algorithm, which allows us to investigate the

model directly.I MOSES algorithm may have many biases we’ll never fully

grasp.I Different model (still an OSBM) where each community has

its own internal-connection probability.I MOSES breaks down on synthetic data if the communities are

not equally dense (pin).I Draw from this distribution: P(z , pout , p1, p2, p3, ...|g)I Multiple MCMC chains, where chains propose splits/merge to

each other.I (Modern) statisticians are innovative about scalability, e.g.

Hybrid Monte Carlo.

August 7, 2010 33

Page 44: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Take home messages

I Community finding should be about discovering structure, notforcing the structure. Overlapping, hierarchy, et cetera.

I MOSES is a proof-of-concept: We show that quality results,overlapping communities, and scalability, are not incompatible.

I Very-scalable community finding algorithms don’t exist. Thisis an interesting challenge.

August 7, 2010 34

Page 45: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Take home messages

I Community finding should be about discovering structure, notforcing the structure. Overlapping, hierarchy, et cetera.

I MOSES is a proof-of-concept: We show that quality results,overlapping communities, and scalability, are not incompatible.

I Very-scalable community finding algorithms don’t exist. Thisis an interesting challenge.

August 7, 2010 34

Page 46: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Take home messages

I Community finding should be about discovering structure, notforcing the structure. Overlapping, hierarchy, et cetera.

I MOSES is a proof-of-concept: We show that quality results,overlapping communities, and scalability, are not incompatible.

I Very-scalable community finding algorithms don’t exist. Thisis an interesting challenge.

August 7, 2010 34

Page 47: MOSES: Community finding using Model-based Overlapping Seed ExpanSion

Acknowledgments

This research was supported by Science Foundation Ireland (SFI)Grant No. 08/SRC/I1407.

I http://clique.ucd.ie/software

I http://www.aaronmcdaid.com

I [email protected] , [email protected]

August 7, 2010 35