generative network models and community detection algorithms on social networks · generative...

Generative Network Models and Community DetectionAlgorithms on Social Networks

Jordan Hartzell

October 31, 2019

1 Introduction

The field of network science is rooted in the study of graph theory, made popular by Leon-hard Euler’s “Konigsberg bridge problem.” In this problem, Euler sought to prove that theredid not exist a path between four land masses which crossed the seven bridges between themexactly one time each [1]. Though it wasn’t until the twentieth century that graph theorywas revived (and network theory arose) in mathematics literature, graph theory providedinteresting insights in the field of topology.

Today, network science tackles much larger, more complex systems across varying fields: thedawn of the Internet Age, the creation of large-scale social networking platforms, and theavailability of in-depth biological data have allowed for both the imposition of structure uponand the discovery of inherent structure within new and sprawling systems. These studiesseek to find “patterns of connections between components” in a complex system, which isdistinct from the perhaps narrower task of studying either the individual components of asystem or the nature of the connections between components [3].

(a) Euler Bridge. (b) Very large network.

In this project, we will study patterns in the structure of social networks. So far, wehave explored network formation algorithms including the Erdos-Reyni random graph, thesmall-world model, and the preferential attachment model. We are particularly interestedin social networks; we will next learn about algorithms for community detection in complexsocial networks.

2 Background

We have discussed three models of network formation in order to better understand boththe history of network science as well as the motivations for designing effective communitydetection algorithms. These include the Erdos-Reyni model, the small-world model, and thepreferential attachment model.

2.1 Erdos-Reyni Model

The Erdos-Reyni model of random graphs is defined in the 1958 paper “On Random GraphsI” [2]. Their model defines a random graph as one in which there exist n vertices, and theprobability that an undirected edge (i, j) (a link between vertices i and j ) appears in thegraph is equal to p for each edge. This model is particularly interesting because as thenumber of vertices grows exponentially, the degree distribution of the network tends towarda Poisson distribution. The degree distribution provides a rough idea of the network’s localtopology. On a local scale, a random graph has a Binomial degree distribution B(n,p), wheren is the number of nodes in the network and p is the probability that an edge will be present.

In seeking to better understand the motivation for the development of other network models,we proved the convergence of the degree distribution in a random graph from a Binomialdistribution to a Poisson distribution. This result motivates the study of other graphicalmodels because the Poisson distribution is not often found in the study of large-scale, real-world networks.Note: it is helpful to use the following fact in this proof: 〈k〉 = p(n− 1)

Proposition 1. The degree distribution of a random graph G(n, p) tends toward a Poissondistribution as n grows to infinity.

Proof.

limn→∞

P (| N(i) |= k) = limn→∞

(n− 1

k

)pk(1− p)n−1−k

= limn→∞

(n− 1

k

)( 〈k〉n− 1

)k(1− 〈k〉

n− 1

)n−1−k

= limn→∞

(n− 1)(n− 2)...(n− k)

k!

( 〈k〉n− 1

)k(1− 〈k〉

n− 1

)n−1−k

= limn→∞

nk + O(nk−1)

(n− 1)k〈k〉k

k!

(1− 〈k〉

n− 1

)n−1−k

= (1) limn→∞

〈k〉k

k!

(1− 〈k〉

n− 1

)n−1−k

=〈k〉k

k!(e)−〈k〉

2

Using the definition of natural logarithm, we know that the final expression is a Poisson dis-tribution. Thus we have proven the above proposition stating that the degree distributionof a random graph follows a Poisson distribution in the limit of the number of vertices.

Two other network models were particularly relevant in our initial research: the preferential-attachment model and the small-world model.

2.2 Small-World Model

The small-world model captures another interesting characteristic of real-world networks: ahigh clustering coefficient. A graph’s clustering coefficient measures the density of connec-tions between vertices by calculating the probability that the neighbors of a vertex are alsoconnected to one another by an edge [3]. Understandably, this is an important characteristicin detecting communities in very large networks. Psychologist Stanley Milgram famouslyconducted the “small-world experiment“ to test whether the distance between two vertices,or actors, in a network was actually small in real-world networks, as it was theoreticallyproven to be. In this experiment, Milgram mailed passports to 96 people in Omaha, Ne-braska, including instructions that asked the recipient to mail the passport to someone theyknew on a first-name basis whom would have the best chance of getting the passport back toBoston, where Milgram studied. The passport contained only the end target’s name, address,and occupation. Of the 18 returned passports, the average number of steps taken from Om-aha to Boston (as marked in the passport) was around 6. Thus this experiment supportedthe idea that very large networks (including the network of acquaintances in America) hada small distance between any two nodes (people), or that there are “6 degrees of separation”between any two people in the world [3].

The following procedure yields a small-world network: Start with a lattice structure withperiodic boundary conditions in which each vertex has degree c. Go through each edge in thegraph and with probability p, rewire an edge by placing its end at a vertex chosen uniformlyat random. This procedure produces a network that has both a high clustering coefficient,as is present in lattices, as well as a short average path length, as is characteristic of anErdos-Reyni random graph. The small-world model’s expression of both of these attributesmakes it an interesting model to study.

2.3 Preferential Attachment Model

As the number of nodes in many real-world networks grows very large, the degree distributionfollows a power distribution, In other words,“real networks typically have right-skewed degreedistributions, with most vertices having low degree but with a small number of high-degreehubs in the tail of the distribution” [3]. Related to this fact, random graphs also do notaccount for the creation of community structure in networks, since each edge is placedwith equal probability. The preferential attachment model of network formation solves thisissue. Derek J. de Solla Price built upon economist Herbert Simon’s study of economicinequality, in which Simon observed power law distributions in economic data that confirmthe accumulation of wealth and the deepening of economic disparity; in other words, how

3

“the rich get richer.” Price called Simon’s idea cumulative advantage and applied it tonetworks, formulating a model of network generation that yields a network with a degreedistribution following the power law in the limit. Price’s model was further developed byBarabasi and Albert in 1999; their model produced an undirected network and becameknown as the preferential attachment model [3].

3 Proposed Methodology

This semester, we plan to follow Newman’s text, Networks: An Introduction, to further ex-plore the theory of network formation. We have already gained some important insightsabout the ways in which the structure of small networks and the rules of network growthcan affect the characteristics of a late-stage network. However, we would like to delve intosome of the complex proofs of generative network model characteristics, focusing on thosecharacteristics that relate to the creation of community.

In our analysis of social networks, we are especially interested in both the ways in whichcommunities form among groups of vertices and methods to find and differentiate betweenthese communities in very large networks. This problem is known as community detection.

One algorithm that effectively accomplishes these tasks is the “Louvain algorithm.” Thisalgorithm uses a modularity score to analyze the amount of connectedness that exists amonga community of vertices in the network. A low modularity score corresponds to a randomassignment of edges between vertices in the network, whereas a high score means that thereexists a clustered community between the nodes [4]. Next, we’ll read “Fast unfolding of com-munities in large networks” by Blondel, et. al., a classical paper in the scope of the problemat hand [5]. Then, we’ll analyze “Finding community structure in very large networks” byMoore et. al., which presents a hierarchical agglomeration algorithm for community detec-tion [6].

References

[1] JESUS NAJERA. ”Graph Theory – History and Overview” The LATEX Companion.Addison-Wesley, Reading, Massachusetts, 1993.https://towardsdatascience.com/graph-theory-history-overview-f89a3efc0478

[2] PAUL ERDOS, ALFRED REYNI. 1959. “On random graphs I.”

[3] M.E.J. NEWMAN. Networks: An Introduction, Oxford University Press, 2010.

[4] “6.1 The Louvain Algorithm,” Neo4j Graph Algorithms library.https://neo4j.com/docs/graph-algorithms/current/algorithms/louvain/

[5] VINCENT D. BLONDEL, JEAN-LOUP GUILLAUME, RENAUD LAMBIOTTE, ETI-ENNE LEFEBVRE. 2008. “Fast unfolding of communities in large networks.” University

4

of Louvain.https://arxiv.org/pdf/0803.0476.pdf

[6] AARON CLAUSET, M.E.J. NEWMAN, CRISTOPHER MOORE. 2004. “Finding com-munity structure in very large networks.” Phys. Rev. E 70, 066111.https://arxiv.org/abs/cond-mat/0408187

5

generative network models and community detection algorithms on social networks · generative...

Documents