[lecture notes in computer science] advances in social network mining and analysis volume 5498 ||

Lecture Notes in Computer Science 5498Commenced Publication in 1973Founding and Former Series Editors:Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board

David HutchisonLancaster University, UK

Takeo KanadeCarnegie Mellon University, Pittsburgh, PA, USA

Josef KittlerUniversity of Surrey, Guildford, UK

Jon M. KleinbergCornell University, Ithaca, NY, USA

Alfred KobsaUniversity of California, Irvine, CA, USA

Friedemann MatternETH Zurich, Switzerland

John C. MitchellStanford University, CA, USA

Moni NaorWeizmann Institute of Science, Rehovot, Israel

Oscar NierstraszUniversity of Bern, Switzerland

C. Pandu RanganIndian Institute of Technology, Madras, India

Bernhard SteffenTU Dortmund University, Germany

Madhu SudanMicrosoft Research, Cambridge, MA, USA

Demetri TerzopoulosUniversity of California, Los Angeles, CA, USA

Doug TygarUniversity of California, Berkeley, CA, USA

Gerhard WeikumMax-Planck Institute of Computer Science, Saarbruecken, Germany

Lee Giles Marc Smith John YenHaizheng Zhang (Eds.)

Advances in SocialNetwork Miningand Analysis

Second International Workshop, SNAKDD 2008Las Vegas, NV, USA, August 24-27, 2008Revised Selected Papers

13

Volume Editors

Lee GilesJohn YenPennsylvania State University, College of Information Science and TechnologyUniversity Park, PA 16802, USAE-mail: {giles, jyen}@ist.psu.edu

Marc SmithMicrosoft Research, One Microsoft Way, Redmond, WA 98002, USAE-mail: [email protected]

Haizheng ZhangAmazon.com., Seattle, WA, USAE-mail: [email protected]

Library of Congress Control Number: 2010931753

CR Subject Classification (1998): H.3, H.4, I.2, H.2.8, C.2, H.2

LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues

ISSN 0302-9743ISBN-10 3-642-14928-6 Springer Berlin Heidelberg New YorkISBN-13 978-3-642-14928-3 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material isconcerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publicationor parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,in its current version, and permission for use must always be obtained from Springer. Violations are liableto prosecution under the German Copyright Law.

springer.com

© Springer-Verlag Berlin Heidelberg 2010Printed in Germany

Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, IndiaPrinted on acid-free paper 06/3180

Preface

This year’s volume of Advances in Social Network Analysis contains the pro-ceedings for the Second International Workshop on Social Network Analysis(SNAKDD 2008). The annual workshop co-locates with the ACM SIGKDD In-ternational Conference on Knowledge Discovery and Data Mining (KDD). Thesecond SNAKDD workshop was held with KDD 2008 and received more than32 submissions on social network mining and analysis topics. We accepted 11regular papers and 8 short papers. Seven of the papers are included in thisvolume.

In recent years, social network research has advanced significantly, thanks tothe prevalence of the online social websites and instant messaging systems aswell as the availability of a variety of large-scale offline social network systems.These social network systems are usually characterized by the complex networkstructures and rich accompanying contextual information. Researchers are in-creasingly interested in addressing a wide range of challenges residing in thesedisparate social network systems, including identifying common static topolog-ical properties and dynamic properties during the formation and evolution ofthese social networks, and how contextual information can help in analyzing thepertaining social networks. These issues have important implications on commu-nity discovery, anomaly detection, trend prediction and can enhance applicationsin multiple domains such as information retrieval, recommendation systems, se-curity and so on.

The second SNAKDD workshop focused on knowledge discovery and datamining in social networks, such as contextual community discovery, link analysis,the growth and evolution of social networks, algorithms for large-scale graphs,techniques that can be used for recovering and constructing social networksfrom online social systems, search on social networks, multi-agent-based socialnetwork simulation, trend prediction of social network evolution, and relatedapplications in other domains such as information retrieval and security. Theworkshop was concerned with inter-disciplinary and cross-domain studies span-ning a variety of areas in computer science including graph and data mining,machine learning, computational organizational and multi-agent studies, infor-mation extraction and retrieval, and security, as well as other disciplines such asinformation science, and social science.

In the first paper “Leveraging Label-Independent Features for Classificationin Sparsely Labeled Networks: An Empirical Study,” Brian Gallagher and TinaEliassi-Rad study the problem of within-network classification in sparsely labelednetworks. The authors present an empirical study and show that the use of LIfeatures produces classifiers that are less sensitive to specific label assignmentsand can lead to significant performance improvement.

VI Preface

In the second paper “Community Detection Using a Measure of Global In-fluence,” Rumi Ghosh and Kristina Lerman, define “influence” as the numberof paths, of any length, that exist between two nodes and argue that this givesa better measure of network connectivity. The authors use the influence met-ric to partition a network into groups or communities by looking for regions ofthe network where nodes have more influence over each other than over nodesoutside the community.

The third paper, “Communication Dynamics of Blog Networks,” by MarkGoldberg, Malik Magdon-Ismil, Stephen Kelley, Konstantin Mertsalov, andWilliam Wallace, studies the communication dynamics of Blog networks by ex-ploring the Russian section of LiveJournal. The two fundamental questions thatthis paper is concerned with include (1) what models adequately describe suchdynamic communication behavior; and (2) how does one detect changes in thenature of the communication dynamics. The paper leverages stable statistics intheir research in disclosing the dynamics of the networks and characterizing thelocality properties.

In the fourth paper, “Finding Spread Blockers in Dynamic Networks,” Habiba,Yintao Yu, Tanya Berger-Wolf, and Jared Saia extend standard structural net-work measures to dynamic networks. The authors also compare the blockingability of individuals in the order of ranking by the new dynamic measures. Theauthors found that overall, simple ranking according to a nodes static degree, orthe dynamic version of a nodes degree, performed consistently well. Surprisinglythe dynamic clustering coefficient seems to be a good indicator, while its staticversion performs worse than the random ranking.

The fifth paper, “Social Network Mining with Nonparametric RelationalModels,” by ZZhao Xu, Volker Tresp, Achim Rettinger, and Kristian Kerst-ing, discusses how the infinite hidden relational model (IHRM) can be used tomodel and analyze social networks, where each edge is associated with a randomvariable and the probabilistic dependencies between variables are specified bythe model based on the relational structure. The hidden variables are able totransport information such that non-local probabilistic dependencies can be ob-tained. The IHRM provides effective relationship prediction and cluster analysisfor social networks. The experiments demonstrate that this model can providegood prediction accuracy and capture the inherent relations among social actors.

In the sixth paper, “Using Friendship Ties and Family Circles for Link Pre-diction,” the authors (Elena Zheleva, Lise Getoor, Jennifer Golbeck, and UgurKuter) investigate how networks can be overlaid and propose a feature taxonomyfor link prediction. The authors show that the accuracy of link prediction canbe improved when there are tightly-knit family circles in a social network. Theirexperiments demonstrate significantly higher prediction accuracy compared tousing traditional features such as descriptive node attributes and structuralfeatures.

The last paper, “Information Theoretic Criteria for Community Detection”by Karl Branting, studies the resolution limit problem with two compression-based algorithms that were designed to overcome such limits. The author

Preface VII

identifies the aspect of each approach that is responsible for the resolution limitand proposes a variant, SGE, that addresses this limitation. The paper demon-strates on three artificial data sets that (1) SGE does not exhibit a resolutionlimit on graphs in which other approaches do, and that (2) modularity and thecompression-based algorithms, including SGE, behave similarly on graphs notsubject to the resolution limit.

We would like to thank the authors of all submitted papers for both the jointworkshop and this proceedings volume. We are further indebted to the ProgramCommittee members for their rigorous and timely reviewing. They allowed usto make this workshop a major success.

Lee GilesMarc Smith

John YenHaizheng Zhang

Organization

Program Chairs

Lee Giles Pennsylvania State University, USAMarc Smith Microsoft, USAJohn Yen Pennsylvania State University, USAHaizheng Zhang Amazon.com, USA

Program Committee

Lada AdamicAris AnagnostopoulosArindam BanerjeeTanya Berger-WolfYun ChiAaron ClausetIsaac CouncillTina Eliassi-RadLise GetoorMark GoldbergLarry HolderAndreas HothoGueorgi KossinetsKristina LermanWei LiYi LiuRamesh NallapatiJennifer NevilleCheng NiuDou ShenBingjun SunJie TangAndrea TapiaAlessandro VespiganiXuerui WangMichael WurstXiaowei Xu

X Organization

Referees

Vladimir BarashMustafa BilgicMatthias BroechelerGuihong CaoBin Cao

Sanmay DasAnirban DasguptaRobert JAschkeLiu LiuGalileo Namata

Evan XiangXiaowei XuLimin YaoJing ZhangYi Zhang

Table of Contents

Leveraging Label-Independent Features for Classification in SparselyLabeled Networks: An Empirical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Brian Gallagher and Tina Eliassi-Rad

Community Detection Using a Measure of Global Influence . . . . . . . . . . . . 20Rumi Ghosh and Kristina Lerman

Communication Dynamics of Blog Networks . . . . . . . . . . . . . . . . . . . . . . . . . 36Mark Goldberg, Stephen Kelley, Malik Magdon-Ismail,Konstantin Mertsalov, and William (Al) Wallace

Finding Spread Blockers in Dynamic Networks . . . . . . . . . . . . . . . . . . . . . . 55Habiba, Yintao Yu, Tanya Y. Berger-Wolf, and Jared Saia

Social Network Mining with Nonparametric Relational Models . . . . . . . . . 77Zhao Xu, Volker Tresp, Achim Rettinger, and Kristian Kersting

Using Friendship Ties and Family Circles for Link Prediction . . . . . . . . . . 97Elena Zheleva, Lise Getoor, Jennifer Golbeck, and Ugur Kuter

Information Theoretic Criteria for Community Detection . . . . . . . . . . . . . . 114L. Karl Branting

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Leveraging Label-Independent Features forClassification in Sparsely Labeled Networks:

An Empirical Study

Brian Gallagher and Tina Eliassi-Rad

Lawrence Livermore National LaboratoryP.O. Box 808, L-560, Livermore, CA 94551, USA

{bgallagher,eliassi}@llnl.gov

Abstract. We address the problem of within-network classification insparsely labeled networks. Recent work has demonstrated success withstatistical relational learning (SRL) and semi-supervised learning (SSL)on such problems. However, both approaches rely on the availability oflabeled nodes to infer the values of missing labels. When few labels areavailable, the performance of these approaches can degrade. In addition,many such approaches are sensitive to the specific set of nodes labeled.So, although average performance may be acceptable, the performanceon a specific task may not. We explore a complimentary approach towithin-network classification, based on the use of label-independent (LI )features – i.e., features calculated without using the values of class labels.While previous work has made some use of LI features, the effects of thesefeatures on classification performance have not been extensively studied.Here, we present an empirical study in order to better understand theseeffects. Through experiments on several real-world data sets, we showthat the use of LI features produces classifiers that are less sensitive tospecific label assignments and can lead to performance improvements ofover 40% for both SRL- and SSL-based classifiers. We also examine therelative utility of individual LI features; and show that, in many cases, itis a combination of a few diverse network-based structural characteristicsthat is most informative.

Keywords: Statistical relational learning; semi-supervised learning; so-cial network analysis; feature extraction; collective classification.

1 Introduction

In this paper, we address the problem of within-network classification. We aregiven a network in which some of the nodes are “labeled” and others are “un-labeled” (see Figure 1). Our goal is to assign the correct labels to the unlabelednodes from among a set of possible class labels (i.e., to “classify” them). For exam-ple, we may wish to identify cell phone users as either ‘fraudulent’ or ‘legitimate.’

Cell phone fraud is an example of an application where networks are oftenvery sparsely labeled. We may have a handful of known fraudsters and a handful

L. Giles et al. (Eds.): SNAKDD 2008, LNCS 5498, pp. 1–19, 2010.c© Springer-Verlag Berlin Heidelberg 2010

2 B. Gallagher and T. Eliassi-Rad

Fig. 1. Portion of the MIT Reality Mining call graph. We know the class labels for theblack (dark) nodes, but do not have labels for the yellow (light) nodes.

of known legitimate users, but for the vast majority of users, we do not know thecorrect label. For such applications, it is reasonable to expect that we may haveaccess to labels for fewer than 10%, 5%, or even 1% of the nodes. In addition,cell phone networks are generally anonymized. That is, nodes in these networksoften contain no attributes besides class labels that could be used to identifythem. It is this kind of sparsely labeled, anonymized network that is the focusof this work. Put another way, our work focuses on univariate within-networkclassification in sparsely labeled networks.

Relational classifiers have been shown to perform well on network classificationtasks because of their ability to make use of dependencies between class labels(or attributes) of related nodes [1]. However, because of their dependence onclass labels, the performance of relational classifiers can substantially degradewhen a large proportion of neighboring instances are also unlabeled. In manycases, collective classification provides a solution to this problem, by enablingthe simultaneous classification of a number of related instances [2]. However,previous work has shown that the performance of collective classification canalso degrade when there are too few labels available, eventually to the pointwhere classifiers perform better without it [3].

In this paper, we explore another source of information present in networksthat does not depend on the availability or accuracy of node labels. Such infor-mation can be represented using what we call label-independent (LI ) features.The main contribution of this paper is an in-depth examination of the effectsof label-independent features on within-network classification. In particular, weaddress the following questions:

1. Can LI features make up for a lack of information due to sparsely labeleddata? Answer: Yes.

Leveraging Label-Independent Features 3

2. Can LI features provide information above and beyond that provided by theclass labels? Answer: Yes.

3. How do LI features improve classification performance? Answer: Becausethey are less sensitive to the specific labeling assigned to a graph, classifiersthat use label-independent features produce more consistent results acrossprediction tasks.

4. Which LI features are the most useful? Answer: A combination of a few di-verse network-based structural characteristics (such as node and link countsplus betweenness) is the most informative.

Section 2 covers related work. Section 3 describes our approach for modelinglabel-independent characteristics of networks. Sections 4 and 5, respectively,present our experimental design and results. We conclude the paper in Section 6.

2 Related Work

In recent years, there has been a great deal of work on models for learning andinference in relational data (i.e., statistical relational learning or SRL) [3,4,5,6,7].All SRL techniques make use of label-dependent relational information. Someuse label-independent information as well.

Relational Probability Trees (RPTs) [8] use label-independent degree-basedfeatures (i.e., neighboring node and link counts). However, existing RPT studiesdo not specifically consider the impact of label-independent features on classifierperformance.

Perlich and Provost [9] provide a nice study on aggregation of relational at-tributes, based on a hierarchy of relational concepts. However, they do not con-sider label-independent features.

Singh et al. [10] use descriptive attributes and structural properties (i.e., nodedegree and betweenness centrality) to prune a network down to its ‘most infor-mative’ affiliations and relationships for the task of attribute prediction. Theydo not use label-independent features directly as input to their classifiers.

Neville and Jensen [11] use spectral clustering to group instances based ontheir link structure (where link density within a group is high and betweengroups is low). This group information is subsequently used in conjunction withattribute information to learn classifiers on network data.

There has also been extensive work on overcoming label sparsity through tech-niques for label propagation. This work falls into two research areas: (1) collectiveclassification [2,3,7,12,13,14] and (2) graph-based semi-supervised learning (SSL)[15,16].

Previous work confirms our observation that the performance of collectiveclassification can suffer when labeled data is very sparse [3]. McDowell et al. [14]demonstrate that “cautious” collective classification procedures produce betterclassification performance than “aggressive” ones. They recommend only prop-agating information about the top-k most confident predicted labels.


The problem ofwithin-network classification can be viewed as a semi-supervisedlearning problem. The graph-based approaches to semi-supervised learning areparticularly relevant here. In their study, Macskassy and Provost [7] compare theSSL Gaussian Random Field (GRF) model [15] to a SRL weighted-vote relationalneighbor (wvRN) model that uses relaxation labeling for collective classification(wvRN+RL). They conclude that the two models are nearly identical in terms ofaccuracy, although GRF produces slightly better probability rankings. Our resultswith wvRN+RL and GRF are consistent with this conclusion. The “ghost edge”approach of Gallagher et al. [17] combines aspects of both SRL and SSL, and com-pares favorably with both wvRN+RL and GRF.

3 Label-Dependent vs. Label-Independent Features

Relational classifiers leverage link structure to improve performance. Most fre-quently, links are used to incorporate attribute information from neighboringnodes. However, link structure can also be used to extract structural statisticsof a node (e.g., the number of adjacent links). We can divide relational featuresinto two categories: label-dependent and label-independent.

Label-dependent (LD) features use both structure and attributes (or labels)of nodes in the network. The most commonly used LD features are aggregationsof the class labels of nodes one link away (e.g., the number of neighbors with theclass label ‘fraudulent’). LD features are the basis for incorporating relationalinformation in many SRL classifiers.

Label-independent (LI) features are calculated using network structure, butnot attributes or class labels of nodes. An example of a simple LI feature isthe degree of a node (i.e., the number of neighboring nodes). Of course, weassume that there is an underlying statistical dependency between the classlabel of a node and its LI features. Otherwise, LI features would be of no valuein predicting a node’s class. However, because they are calculated based only onnetwork structure, LI feature values do not directly depend on the current classlabel assignments of nodes in a network. This means that, unlike LD features, LIfeatures may be calculated with perfect accuracy regardless of the availability ofclass label information and are impervious to errors in class label assignments.

3.1 Extracting Label-Independent Features

We consider four LI features on nodes: (1) the number of neighboring nodes,(2) the number of incident links, (3) betweenness centrality, and (4) clusteringcoefficient. Features 1 and 2, respectively, are node-based and link-based mea-sures of degree. Note that in multigraphs, these two are different. Betweennesscentrality measures how “central” a node is in a network, based on the numberof shortest paths that pass through it. Clustering coefficient measures neighbor-hood strength, based on how connected a node’s neighbors are to one another.For details, we refer the reader to a study by Mark Newman [18].

The success of network-based structural characteristics as predictors of classrelies on two assumptions. First, members of different classes play different roles


in a network. Second, these roles can be differentiated by structural characteris-tics. The second assumption is met in many cases. For instance, “popular” nodescan be identified by degree and “important” nodes can be identified by central-ity measures. Whether the first assumption is met depends on the class label.Suppose that executives tend to be more popular and central than an averageemployee in a company’s communication network, and that employees with aparticular job title tend to have similar popularity and centrality, regardless ofdepartment. Then, we would expect structural features to be more useful foridentifying executives than members of a particular department.

4 Experimental Design

We have designed our experiments to answer the following questions:

1. Can LI features make up for a lack of information due to sparsely labeleddata?

2. Can LI features provide information above and beyond that provided by theclass labels?

3. How do LI features improve classification performance?4. Which LI features are the most useful?

To avoid confounding effects as much as possible, we focus on univariate binaryclassification tasks, and extend simple classifiers to incorporate label-independentfeatures.

4.1 Classifiers

On each classification task, we ran ten individual classifiers: four variations of alink-based classifier [5], four variations of a relational neighbor classifier [19,7],and two variations of the Gaussian Random Field classifier [15]. We describeeach of them below.

nLB is the network-only link-based classifier [5]. It uses logistic regression to modela node’s class given the classes of neighboring nodes. To generate features, a node’sneighborhood is summarized by the link-weighted count of each class label. Forexample, given a binary classification task, two features will be generated: (1) thelink-weighted count of a node’s neighbors with the positive class and (2) the link-weighted count of a node’s neighbors with the negative class.

nLBLI is composed of two logistic regression models: (1) nLB with its two LDfeatures and (2) logLI, which uses the four LI features (see Section 3.1). ThenLBLI classifier calculates the probability of each class as:

P (C) = w · PnLB(C) + (1 − w) · PlogLI(C) (1)

where w is calculated based on the individual performance of nLB and logLIover 10-fold cross validation on the training data. We calculate area under the


ROC curve (AUC) for each fold and then obtain an average AUC score for eachclassifier, AUCLD and AUCLI . We then set w as follows:

w =AUCLD

AUCLD + AUCLI(2)

nLB+ICA uses the nLB classifier, but performs collective classification usingthe ICA algorithm described in Section 4.2.

nLBLI+ICA uses the nLBLI classifier, but performs collective classificationusing the ICA algorithm described in Section 4.2.

wvRN is the weighted-vote relational neighbor classifier [19,7]. It is a simplenon-learning classifier. Given a node i and a set of neighboring nodes, N , thewvRN classifier calculates the probability of each class for node i as:

P (Ci = c|N) =1Li

∑j∈N

{wi,j if Ci = c

0 otherwise(3)

where wi,j is the number of links between nodes i and j and Lj is the number oflinks connecting node i to labeled nodes. When node i has no labeled neighbors,we use the prior probabilities observed in the training data.

wvRNLI combines the LI features with wvRN in the same way that nLBLIdoes with nLB (i.e., using a weighted sum of wvRN and logLI).

wvRN+ICA uses the wvRN classifier, but performs collective classificationusing the ICA algorithm described in Section 4.2.

wvRNLI+ICA uses wvRNLI, but performs collective classification using theICA algorithm described in Section 4.2.

GRF is the semi-supervised Gaussian Random Field approach of Zhu et al. [15].We made one modification to accommodate disconnected graphs. Zhu computesthe graph Laplacian as L = D − cW , where c = 1. We set c = 0.9 to ensurethat L is diagonally dominant and thus invertible. We observed no substantialimpact on performance in connected graphs due to this change.

GRFLI combines the LI features with GRF as nLBLI does with nLB (i.e., usinga weighted sum of GRF and logLI). We also tried the approach of Zhu et al.[15], where one attaches a “dongle” node to each unlabeled node and assignsit a label using the external LI classifier. The transition probability from nodei to its dongle is η and all other transitions from i are discounted by 1 − η .This approach did not yield any improvements. So, we use the weighted sumapproach (i.e., Equation 1) for consistency.

4.2 Collective Classification

To perform collective classification, we use the iterative classification algorithm(ICA) [7], with up to 1000 iterations. We chose ICA because (1) it is simple, (2) it


performs well on a variety of tasks, and (3) it tends to converge more quickly thanother approaches. We also performed experiments using relaxation labeling (RL)[7]. Our results are consistent with previous research showing that the accuracyof wvRN+RL is nearly identical to GRF, but GRF produces higher AUC values[7]. We omit these results due to the similarity to GRF. For a comparison ofwvRN+RL and GRF on several of the same tasks used here, see Gallagher etal. [17]. Overall, ICA slightly outperforms RL for the nLB classifier.

Several of our data sets have large amounts of unlabeled data since groundtruth is simply not available. In these cases, there are two reasonable approachesto collective classification: (1) perform collective classification over the entiregraph and (2) perform collective classification over the core set of nodes only(i.e., nodes with known labels).

In our experiments, attempting to perform collective classification over theentire graph produced results that were often dramatically worse than the non-collective base classifier. We hypothesize that this is due to an inadequate propa-gation of known labels across vast areas of unlabeled nodes in the network. Notethat for some of our experiments, fewer than 1% of nodes are labeled. Otherresearchers have also reported cases where collective classification hurts perfor-mance due to a lack of labeled data [3,11]. We found that the second approach(i.e., using a network of only the core nodes) outperformed the first approach inalmost all cases, despite disconnecting the network in some cases. Therefore, wereport results for the second approach only.

4.3 Experimental Methodology

Each data set has a set of core nodes for which we know the true class labels.Several data sets have additional nodes for which there is no ground truth avail-able. Classifiers have access to the entire graph for both training and testing.However, we hide labels for 10%− 90% of the core nodes. Classifiers are trainedon all labeled core nodes and evaluated on all unlabeled core nodes.

For each proportion labeled, we run 30 trials. For each trial, we choose aclass-stratified random sample containing 100 × (1.0 − proportionlabeled)% ofthe core nodes as a test set and the remaining core nodes as a training set.Note that a single node will necessarily appear in multiple test sets. However,we carefully choose test sets to ensure that each node in a data set occurs inthe same number of test sets over the course of our experiments; and therefore,carries the same weight in the overall evaluation. Labels are kept on trainingnodes and removed from test nodes. We use identical train/test splits for eachclassifier. For more on experimental methodologies for relational classification,see Gallagher and Eliassi-Rad [20].

We use the area under the ROC curve (AUC) to compare classifiers because itis more discriminating than accuracy. In particular, since most of our tasks have alarge class imbalance (see Section 4.4), accuracy cannot adequately differentiatebetween classifiers.


4.4 Data Sets

We present results on four real-world data sets: political book purchases [21],Enron emails [22], Reality Mining (RM) cellphone calls [23], and high energyphysics publications (HEP-TH) from arXiv [24]. Our five tasks are to identifyneutral political books, Enron executives, Reality Mining students, Reality Min-ing study participants, and HEP-TH papers with the topic “Differential Geom-etry.” Table 1 summarizes the prediction tasks. The Sample column describesthe method used to obtain a data sample for our experiments: use the entire set(full), use a time-slice (time), or sample a continuous subgraph via breadth-firstsearch (BFS ). The Task column indicates the class label we try to predict. The|V |, |L|, and |E| columns indicate counts of total nodes, labeled nodes, and totaledges in each network. The P (+) column indicates the proportion of labelednodes that have the positive class label (e.g., 12% of the political books areneutral). For Enron, Reality Mining students, and HEP-TH, we have labels foronly a subset of nodes (i.e., the “core” nodes) and can only train and test ourclassifiers on these nodes. However, unlabeled nodes and their connections tolabeled nodes are exploited to calculate LI features of the labeled nodes.

Table 1. Summary of Data Sets and Prediction Tasks

Data Set Sample Task |V | |L| |E| P (+)

Political Books Full Neutral? 105 105 441 0.12Enron Time Executive? 9K 1.6K 50K 0.02

Reality Mining BFS Student? 1K 84 32K 0.62Reality Mining BFS In Study? 1K 1K 32K 0.08

HEP-TH BFS Differential Geometry? 3K 284 36K 0.06

5 Experimental Results

In this section, we discuss our results. We assess significance using paired t-tests(p-values ≤ 0.05 are considered significant).1

5.1 Effects of Learning Label Dependencies

Figures 2 and 3 show results for statistical relational learning and semi-supervisedlearning approaches on all of our classification tasks. Supervised learningapproaches, like nLB, use labeled nodes as training data to build a dependencymodel over neighboring class labels. The non-learning wvRN and GRF assume1 It is an open issue whether the standard significance tests for comparing classifiers

(e.g., t-tests, Wilcoxon signed-rank) are applicable for within-network classification,where there is typically some overlap in test sets across trials. It remains to be seenwhether the use of such tests produces a bias and the extent of any errors caused bysuch a bias. This is an important area for future study that will potentially affect anumber of published results.


0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.1 0.3 0.5 0.7 0.9

AUC

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.1 0.3 0.5 0.7 0.9

0.4

0.5

0.6

0.7

0.8

0.9

0.1 0.3 0.5 0.7 0.9

AUC

0.4

0.5

0.6

0.7

0.8

0.9

0.1 0.3 0.5 0.7 0.9

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0.1 0.3 0.5 0.7 0.9

AUC

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0.1 0.3 0.5 0.7 0.9

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.1 0.3 0.5 0.7 0.9

Proportion of Core Nodes Labeled

AUC

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.1 0.3 0.5 0.7 0.9


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.1 0.3 0.5 0.7 0.9

AUC

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.1 0.3 0.5 0.7 0.9

Enron Executives

Reality Mining Students

Reality Mining Study Participants

HEP-TH Differential Geometry Papers

Political BooksnLBLI

nLB nLB+ICA

nLBLI+ICA

Learning Algorithms

wvRNLI

wvRN wvRN+ICA

wvRNLI+ICA

Non-learning Algorithms

Proportion of Core Nodes Labeled Proportion of Core Nodes Labeled

AUC

AUC

AUC

AUC

AUC

Fig. 2. Classification results for statistical relational learning approaches on our datasets. For details on classifiers, see Section 4.1. Note: Due to differences in the difficultyof classification tasks, the y-axis scales are not consistent across tasks. However, fora particular classification task, the y-axis scales are consistent across the algorithmsshown both in this figure and in Figure 3.


Semi-supervised Learning AlgorithmsGRF GRFLI

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.1 0.3 0.5 0.7 0.9

Reality Mining Study Participants


0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0.1 0.3 0.5 0.7 0.9

Reality Mining Students


0.4

0.5

0.6

0.7

0.8

0.9

0.1 0.3 0.5 0.7 0.9

Enron Executives


0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.1 0.3 0.5 0.7 0.9

Political Books


0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.1 0.3 0.5 0.7 0.9

HEP-TH Differential Geometry Papers


AU

C

AU

CA

UC

Fig. 3. Classification results for semi-supervised learning approaches on our data sets.For details on classifiers, see Section 4.1. Note: Due to differences in the difficultyof classification tasks, the y-axis scales are not consistent across tasks. However, fora particular classification task, the y-axis scales are consistent across the algorithmsshown both in this figure and in Figure 2.

that class labels of neighboring nodes tend to be the same (i.e., high label consis-tency). GRF performs well on the Enron and RM student tasks, which have highlabel consistency between neighbors. On the RM study task, where neighboring la-bels are inversely correlated (i.e., low label consistency), wvRN and GRF performpoorly, whereas nLB can learn the correct dependencies.


5.2 Effects of Label-Independent Features

Figures 2 and 3 illustrate several effects of LI features. In general, the perfor-mance of the LI classifiers degrades more slowly than that of the correspondingbase classifiers as fewer nodes are labeled. At ≤ 50% labeled, the LI featuresproduce a significant improvement in 36 of 45 cases. The exceptions mainly oc-cur for GRF on Enron, RM Student, and HEP-TH, where (in most cases) wehave a statistical tie. In general, the information provided by the LI features isable to make up, at least in part, for information lost due to missing labels. Notethat there are three separate effects that lower performance as the number oflabels decreases. (1) Fewer labels available for inference lead to lower quality LDfeatures at inference time, but do not impact the quality of LI features. (2) Fewerlabels at training time mean that (labeled) training examples have fewer labeledneighbors. This impacts the quality of the LD features available at training timeand the quality of the resulting model. LI features are not affected. (3) Fewerlabels mean less training data. This impacts model quality for both LD and LIfeatures. Note that wvRN and GRF are affected only by 1, since they do notrely on training data.

In general, the LI models outperform the corresponding base models, leadingto significant improvements in 49 out of 75 cases across all proportions of labeleddata. There is only one case where the use of LI features significantly degrades per-formance: using GRF on the Enron task at 0.3 labeled. The GRF classifier doesso well on this task that the LI features simply add complexity without additionalpredictive information. However, the degradation here is small compared to gainson other tasks.

Another effect demonstrated in Figures 2 and 3 is the interaction between LIfeatures and label propagation (i.e., ICA or GRF). In several cases, combiningthe two significantly outperforms either on its own (e.g., GRFLI on politicalbooks and the RM tasks). However, the benefit is not consistent across all tasks.

The improved performance due to LI features on several tasks at 90% labeled(i.e., political books, both RM tasks) suggests that LI features can provide infor-mation above and beyond that provided by class labels. Recall that political booksand RM study are the only data sets fully labeled to begin with. This indicatesthat LI features may have more general applicability beyond sparsely labeled data.

Figure 4 shows the sensitivity of classifiers to the specific nodes that are ini-tially labeled. For each classifier and task, we measure variance in AUC across30 trials. For each trial, a different 50% of nodes is labeled. ICA has very littleimpact on the sensitivity of nLB to labeling changes. However, the LI featuresdecrease the labeling sensitivity of nLB dramatically for all but one data set.The results for wvRN are qualitatively similar. LI features also decrease sensi-tivity for GRF in most cases. Since GRF has low sensitivity to begin with, theimprovements are less dramatic. The observed reduction in label sensitivity isnot surprising since LI features do not rely on class labels. However, it suggeststhat LI features make classifiers more stable. So, even in cases where averageclassifier performance does not increase, we expect an increase in the worst casedue to the use of LI features.


Lableing Sensitivity(50% Labeled)

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

nLB nLB+ICA nLBLI GRF GRFLI

Classifier

Varia

nce

in A

UC

HepthArea

EnronTitle

PoliticalBook

RealityMiningInStudy

RealityMiningStudent

Fig. 4. Sensitivity of classifiers to specific assignments of 50% known labels across datasets

5.3 Performance of Specific LI Features

To understand which LI features contribute to the observed performance gains,we re-ran our experiments using subsets of the LI features. We used logisticregression with different combinations of the four LI features: each alone (4classifiers), leave one feature out (4 classifiers), degree-based features only (1classifier), non-degree-based features only (1 classifier), and all features (1 clas-sifier).2 This yields 11 classifiers. We present results for 50% of nodes labeled.Results for other proportions labeled are similar.

Figure 5 shows AUC using each LI feature alone vs. all features together. Thisdemonstrates the utility of each feature in the absence of any other information.

Individual Label-Independent Features at 50% Labeled

0.4

0.5

0.6

0.7

0.8

0.9

Enron HEP-TH P. Books RM Students RM Study

Data Set

AUC

All

Node count

Link count

Betweenness

Clust. coef.

Fig. 5. Performance of LI features in isolation

Figure 6 shows the increase in AUC due to adding the specified feature to aclassifier that already has access to all other LI features. The y-axis is the AUC2 Degree-based features are node (or neighbor) count and link (or edge) counts. Non-

degree-based features are betweenness and clustering coefficient.


Combined Label-Independent Features at 50% Labeled

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

Enron HEP-TH P. Books RM Students RM Study

Data Set

Incr

ease

in A

UC

Degree-based

Non-degree

Node count

Link count

Betweenness

Clust. coef.

Fig. 6. Performance of LI features in combination. Degree-based features are node andlink count. Non-degree features are betweenness and clustering coefficient.

of a classifier that uses all LI features minus the AUC of a classifier that uses allexcept the specified feature. This demonstrates the power of each feature whencombined with the others.

All features appear to be useful for some tasks. Clustering coefficient is theleast useful overall, improving AUC slightly on two tasks and degrading AUCslightly on three. For all tasks, a combination of at least three features yieldsthe best results. Interestingly, features that perform poorly on their own canbe combined to produce good results. On the RM student task, node count, be-tweenness, and clustering coefficient produce AUCs of 0.57, 0.49, and 0.48 alone,respectively. When combined, these three produce an AUC of 0.78. Betweenness,which performs worse than random (AUC < 0.5) on its own, provides a boostof 0.32 AUC to a classifier using node count and clustering coefficient.

For most tasks, performance improves due to using all four LI features. OnEnron, however, clustering coefficient appears to mislead the classifier to thepoint where it is better to use either node or link count individually than to

Enron Executives

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.1 0.3 0.5 0.7 0.9


AUC Logistic

Rand Forest

Fig. 7. Comparison of logistic regression and random forest classifiers with all four LIfeatures


use all features. This is one case where we might benefit from a more selectiveclassifier. Figure 7 compares logistic regression with a random forest classifier[25], both using the same four LI features. As expected, the random forest isbetter able to make use of the informative features without being misled by theuninformative ones.

To get a feel for why some LI features make better predictors than others, weexamine the distribution of each feature by class for each prediction task. Table 2summarizes these feature distributions by their mean and standard deviation. Ingeneral, we expect features that cleanly separate the classes to provide the mostpredictive power. As mentioned previously, clustering coefficient appears to bethe least powerful feature overall for our set of prediction tasks. One possibleexplanation for clustering coefficient’s general poor performance is that it doesnot vary enough from node to node; therefore, it does not help to differentiateamong instances of different classes.

Table 2. Mean and standard deviation (SD) of feature values by class and data set.The larger mean value for each feature (i.e., row) is shown in bold.

Data Set/Feature Mean (SD) for the ‘+’ Class Mean (SD) for the ‘-’ Class

Political Books Neutral Other

Node Count 5.8 (3.3) 8.8 (5.6)Link Count 5.8 (3.3) 8.8 (5.6)Betweenness 0.027 (0.030) 0.019 (0.029)Clust. Coef. 0.486 (0.25) 0.489 (0.21)

Enron Executive Other

Node Count 22 (27) 9.6 (20)Link Count 61 (100) 25 (66)Betweenness 0.0013 (0.0037) 0.00069 (0.0025)Clust. Coef. 0.91 (0.77) 1.75 (4.5)

RM Student Student Other

Node Count 19 (27) 22 (38)Link Count 471 (774) 509 (745)Betweenness 0.027 (0.050) 0.022 (0.056)Clust. Coef. 15 (22) 8.0 (7.0)

RM Study In-study Out-of-study

Node Count 18 (30) 1.4 (2.8)Link Count 418 (711) 30 (130)Betweenness 0.022 (0.048) 0.00086 (0.022)Clust. Coef. 10 (17) 5.8 (51)

HEP-TH Differential Geometry Other

Node Count 14 (9.0) 21 (26)Link Count 14 (9.0) 21 (26)Betweenness 0.000078 (0.00010) 0.0011 (0.0056)Clust. Coef. 0.42 (0.19) 0.40 (0.23)


Feature Variability

0

2

4

6

8

10

12

neighb

orCou

nt

edge

Coun

t

betw

eenn

ess

clust

eringC

oefficien

t

Coef

ficie

nt o

f Var

iatio

n

PoliticalBook

EnronTitle


RealityMiningPosition

HepthArea

Fig. 8. Degree of variability for each LI feature on each prediction task

Figure 8 shows the degree of variability of each LI feature across the fiveprediction tasks. To measure variability, we use the coefficient of variation, anormalized measure of the dispersion of a probability distribution. The coefficientof variation is defined as:

cv(dist) =σ

μ(4)

where μ is the mean of the probability distribution dist and σ is the standarddeviation. A higher coefficient of variation indicates a feature with more variedvalues across instances in the data set.

The variability of the clustering coefficient appears comparable to the degreefeatures (i.e., node and link count) (see Figure 8). We even observe that the de-gree of variability of the clustering coefficient for the Enron task is higher thanthe degree of variability for the neighbor count feature, even though neighborcount provides much more predictive power (see Figure 5). So, clustering coeffi-cient appears to have sufficient variability over the nodes in the graph. However,it is possible that the clustering coefficient exhibits similar variability for nodesof both classes; and thus, still fails to adequately distinguish between nodes ofdifferent classes. Therefore, we wish to quantify the extent to which the featuredistributions can be separated from one another by class.

Figure 9 shows how well each LI feature separates the two classes for eachprediction task. We measure class separation by calculating the distance betweenthe empirical distributions of the LI feature values for each class. Specifically,we use the Kolmogorov-Smirnov statistic (K-S) to measure the distance betweentwo empirical (cumulative) distribution functions:

D(Fn(x), Gn(x)) = max(|Fn(x) − Gn(x)|) (5)

We observe from Figure 9 that, on average, the per-class distributions of valuesfor the clustering coefficient are more similar to one another than for other LIfeatures. So, although the clustering coefficient does vary from node to node,


Class Separation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

neighb

orCou

nt

edge

Coun

t

betw

eenn

ess

clust

eringC

oefficien

t

Kolm

ogor

ov-S

mirn

ov D

ista

nce

PoliticalBook

EnronTitle


RealityMiningPosition

HepthArea

Fig. 9. Degree of class separation for each LI feature on each prediction task

the values do not differ consistently based on class. Therefore, clustering coeffi-cient has a hard time distinguishing between instances of different classes, andexhibits poor predictive power overall. The exception is on the Reality Miningstudy-participant task, where we observe a high K-S distance (Figure 9) anda correspondingly high classification performance (Figure 5). In fact, the K-Sdistances in Figure 9 generally correspond quite well to the classification perfor-mance we observe in Figure 5.

5.4 Observations about Our Problem Domains and Data Sets

Table 2 highlights a number of interesting characteristics of our data sets. Anexamination of these characteristics provide insights into the underlying problemdomains. We describe several such insights here.

More politically extreme books tend to have higher degree (neighboring nodesand adjacent edges) and clustering coefficient, but lower betweenness than theneutral books. This tells us that there are two very different types of readersrepresented in our network: (1) party loyalists that tend to have more extremeviewpoints, strong agreement with others inside their party, and strong dis-agreement with outsiders and (2) political moderates who are more inclined toconsider multiple differing perspectives on an issue.

Enron executives tend to have higher degree and betweenness, but lower clus-tering coefficients than others. So, as we would expect, executives maintain morerelationships than other employees and are positioned in a place of maximalcontrol over information flow. The lower clustering coefficient suggests that ex-ecutives maintain ties to multiple communities within the company and are lessclosely tied to a particular community.

Reality Mining students tend to have higher betweenness and clustering co-efficient, but lower degree than others. This indicates that students tend to bemore cliquey, with many of their communications made within a single strong


community. It also indicates that students play an important role in keepinginformation flowing between more distant parts of the network.

Reality Mining study participants tend to have higher degree, betweenness, andclustering coefficient than non-participants. These findings may reveal more abouthow the data were collected than about the underlying problem domain, but theyare interesting nonetheless. Because their phones were instrumented with specialsoftware, we have information on all calls of study participants. However, we haveonly a partial view of the calls made and received by non-participants. More specif-ically, the only calls we observe for non-participants are those to or from a studyparticipant. The result of this is that we end up with a central community of in-terconnected study participants, surrounded by a large, but diffuse periphery ofnon-participants. Thus, the participants appear to have more neighbors, highercentrality, and a more closely knit surrounding community.

In HEP-TH, differential geometry papers tend to have higher clustering coef-ficient, but lower degree and betweenness than others topics. This indicates thatdifferential geometry papers play a relatively isolated and peripheral role amonghigh-energy physics papers, at least in our subset of the arXiv data.

6 Conclusion

We examined the utility of label-independent features in the context of within-network classification. Our experiments revealed a number of interesting findings:(1) LI features can make up for large amounts of missing class labels; (2) LI fea-tures can provide information above and beyond that provided by class labelsalone; (3) the effectiveness of LI features is due, at least in part, to their consistencyand their stabilizing effect on network classifiers; (4) no single label-independentfeature dominates, and there is generally a benefit to combining a few diverse LIfeatures. In addition, we observed a benefit to combining LI features with labelpropagation, although the benefit is not consistent across tasks.

Our findings suggest a number of interesting areas for future work. Theseinclude:– Combining attribute-based (LD) and structural-based (LI) features of a net-

work to create new informative features for node classification. For instance,will the number of short paths to nodes of a certain label or the average pathlength to such nodes improve classification performance?

– Exploring the relationship between attributes and network structure in time-evolving networks, where links appear and disappear and attribute valueschange over time. For example, in such a dynamic network, could we usea time-series of LI feature values to predict the values of class labels at afuture point in time?

AcknowledgmentsWe would like to thank Luke McDowell for his insightful comments. This workwas performed under the auspices of the U.S. Department of Energy by LawrenceLivermore National Laboratory under contract No. W-7405-ENG-48 and No.DE-AC52-07NA27344 (LLNL-JRNL-411529).


References

1. Taskar, B., Abbeel, P., Koller, D.: Discriminative probabilistic models for relationaldata. In: Proceedings of the 18th Conference on Uncertainty in AI, pp. 485–492(2002)

2. Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., Eliassi-Rad, T.: Collec-tive classification in network data. AI Magazine 29(3), 93–106 (2008)

3. Neville, J., Jensen, D.: Relational dependency networks. Journal of Machine Learn-ing Research 8, 653–692 (2007)

4. Getoor, L., Friedman, N., Koller, D., Taskar, B.: Learning probabilistic models oflink structure. Journal of Machine Learning Research 3, 679–707 (2002)

5. Lu, Q., Getoor, L.: Link-based classification. In: Proceedings of the 20th Interna-tional Conference on Machine Learning, pp. 496–503 (2003)

6. Neville, J., Jensen, D., Gallagher, B.: Simple estimators for relational bayesianclassifiers. In: Proceedings the 3rd IEEE International Conference on Data Mining,pp. 609–612 (2003)

7. Macskassy, S., Provost, F.: Classification in networked data: A toolkit and a uni-variate case study. Journal of Machine Learning Research 8, 935–983 (2007)

8. Neville, J., Jensen, D., Friedland, L., Hay, M.: Learning relational probability trees.In: Proceedings of the 9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, pp. 625–630 (2003)

9. Perlich, C., Provost, F.: Aggregation-based feature invention and relational conceptclasses. In: Proceedings of the 9th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, pp. 167–176 (2003)

10. Singh, L., Getoor, L., Licamele, L.: Pruning social networks using structural prop-erties and descriptive attributes. In: Proceedings of the 5th IEEE InternationalConference on Data Mining, pp. 773–776 (2005)

11. Neville, J., Jensen, D.: Leveraging relational autocorrelation with latent groupmodels. In: Proceedings the 5th IEEE International Conference on Data Mining,pp. 322–329 (2005)

12. Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hy-perlinks. In: Proceedings of ACM SIGMOD International Conference on Manage-ment of Data, pp. 307–318 (1998)

13. Jensen, D., Neville, J., Gallagher, B.: Why collective inference improves relationalclassification. In: Proceedings of the 10th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, pp. 593–598 (2004)

14. McDowell, L., Gupta, K., Aha, D.: Cautious inference in collective classification. In:Proceedings of the 22nd AAAI Conference on Artificial Intelligence, pp. 596–601(2007)

15. Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using gaussianfields and harmonic functions. In: Proceedings of the 20th International Conferenceon Machine Learning, pp. 912–919 (2003)

16. Zhu, X.: Semi-supervised learning literature survey. Technical ReportCS-TR-1530, University of Wisconsin, Madison, WI (December 2007),http://pages.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf

17. Gallagher, B., Tong, H., Eliassi-Rad, T., Faloutsos, C.: Using ghost edges for clas-sification in sparsely labeled networks. In: Proceedings of the 14th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, pp. 256–264(2008)

http://pages.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf


18. Newman, M.: The structure and function of complex networks. SIAM Review 45,167–256 (2003)

19. Macskassy, S., Provost, F.: A simple relational classifier. In: Notes of the 2ndWorkshop on Multi-relational Data Mining at KDD 2003 (2003)

20. Gallagher, B., Eliassi-Rad, T.: An examination of experimental methodology forclassifiers of relational data. In: Proceedngs of the 7th IEEE International Confer-ence on Data Mining Workshops, pp. 411–416 (2007)

21. Krebs, V.: Books about U.S. politics (2004),http://www.orgnet.com/divided2.html

22. Cohen, W.: Enron email data set, http://www.cs.cmu.edu/~enron/23. Eagle, N., Pentland, A.: Reality mining: sensing complex social systems. Journal

of Personal and Ubiquitous Computing 10(4), 255–268 (2006),http://reality.media.mit.edu

24. Jensen, D.: Proximity HEP-TH database,http://kdl.cs.umass.edu/data/hepth/hepth-info.html

25. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)

http://www.orgnet.com/divided2.html

http://www.cs.cmu.edu/~enron/

http://reality.media.mit.edu

http://kdl.cs.umass.edu/data/hepth/hepth-info.html

Community Detection Using a Measure ofGlobal Influence

Rumi Ghosh and Kristina Lerman

USC Information Sciences Institute, Marina del Rey, California 90292{ghosh,lerman}@isi.edu

Abstract. The growing popularity of online social networks gave re-searchers access to large amount of network data and renewed interestin methods for automatic community detection. Existing algorithms, in-cluding the popular modularity-optimization methods, look for regionsof the network that are better connected internally, e.g., have higher thanexpected number of edges within them. We believe, however, that edgesdo not give the true measure of network connectivity. Instead, we arguethat influence, which we define as the number of paths, of any length,that exist between two nodes, gives a better measure of network con-nectivity. We use the influence metric to partition a network into groupsor communities by looking for regions of the network where nodes havemore influence over each other than over nodes outside the community.We evaluate our approach on several networks and show that it oftenoutperforms the edge-based modularity algorithm.

Keywords: community, social networks, influence, modularity.

1 Introduction

Communities and social networks have long interested researchers [5,13]. How-ever, one of the main problems faced by the early researchers was the difficultyof collecting empirical data from human subjects [5]. The advent of the internetand the growing popularity of online social networks changed that, giving re-searchers access to huge amount of social interactions data. This, coupled withthe ever increasing computation speed, storage capacity and data mining capa-bilities, led to the reemergence of interest in the social networks in general, andcommunity detection specifically.

Many existing community finding algorithms look for regions of the networkthat are better connected internally and have fewer connections to nodes outsidethe community [4]. Graph partitioning methods [7,27], for example, attempt tominimize the number of edges between communities. Modularity maximization-based methods, on the other hand, identify groups of nodes that have higherthan expected number of edges within them [22,21,24,23]. We believe, however,that edges do not give the true measure of network connectivity. We generalizethe notion of network connectivity to be the number of paths, of any length, thatexist between two nodes (Section 2). We argue that this metric, called influence


Community Detection Using a Measure of Global Influence 21

by sociologists [13], because it measures the ability of one node to affect (e.g.,send information to) another, gives a better measure of connectivity betweennodes. We use the influence metric to partition a (directed or undirected) networkinto groups or communities by looking for regions of the network where nodeshave more influence over each other than over nodes outside the community. Inaddition to discovering natural groups within a network, the influence metriccan also help identify the most influential nodes within the network, as well asthe “weak ties” who bridge different communities. We formalize our approach bydescribing a general mathematical framework for representing network structure(Section 3). We show that the metric used for detecting communities in randomwalk models, modularity-based approaches, and influence-based modularity arespecial cases of this general framework. We evaluate our approach (in Section 4)on the standard data sets used in literature, and find performance at least asgood as that of the edge-based modularity algorithm.

2 A Measure of Global Influence

If a network has N nodes and E links it can be graphically represented byG(N, E) where N is the number of vertices and E is the number of edges. Edgesare directed; however, if there exists an edge from vertex i to j and also from jto i, it is represented as an undirected edge. A path p is an n-hop path from ito j, if there are n vertices between the i and j along the path. We allow pathsto be non-selfavoiding, meaning that the same edge could be traversed morethan once. The graph G(N, E) can be represented by an adjacency matrix A,whose elements are defined as Aij = 1 if ∃ an edge from vertex i to j; otherwise,Aij = 0. A is symmetric for undirected graphs.

The Oxford English Dictionary defines influence as “the capacity to have aneffect on the character, development, or behavior of someone or something, orthe effect itself.” The measure of influence that we adopt is along lines of Pooland Kochen [5], who state that “influence in large part is the ability to reach acrucial man through the right channels, and the more the channels in reserve thebetter.” This metric depends not only on direct edges between nodes, but alsoon the number of ways an effect or a message can be transmitted through othernodes. Therefore, the capacity of node i to influence node j can be measuredby the weighted sum of the number of n-hop paths present from the i to j. Theunderlying hypothesis is that the more the number of paths from one node toanother, the greater is the capacity to influence. This model is analogous to theindependent cascade model of information spread [11,14].

The strength of the effect via longer paths is likely to be lower than via shorterpaths. We model the attenuation using parameters αi where αi (0 ≤ αi ≤ 1)is the probability of transmission of effect in the (i-1)th hop. Let us considertransmitting an effect or a message from nodes b to c in the network shown inFigure 1. The probability of transmission via the immediate neighbors of c suchas e to c or g to c is α1. The probability of transmission over 1-hop paths suchas b to c via e is α1α2. In general, the probability of a transmission along an

22 R. Ghosh and K. Lerman

Fig. 1. A directed graph representing a network

n-hop path is Πn+1i=1 αi. The total influence of b on c thus depends on the number

of (attenuated) channels between b and c, or the sum of all the weighted pathsfrom b to c. This definition of influence makes intuitive sense, because the greaterthe number of paths between b and c, the more opportunities there are for b totransmit messages to c or to affect c.

For ease computation we simplify this model by taking α1 = β and αi = α,∀i �= 1. β is called the direct attenuation factor and is the probability of trans-mission of effect directly between adjacent nodes. α is the indirect attenuationfactor and is the probability of transmission of effect through intermediaries. Ifα = β, i.e., the probability of transmission of effect through all links is the same,then this index reduces to the metric used to find the Katz status score [13].

The number of paths from i to j with n intermediaries, in �� j , is given

by An =

n+1 times︷︸︸︷A · A · · ·A = A(n−1) · A. Adding weights to take into account the

attenuation of effect, we get the weighted total capacity of i to affect j as

i �� j = β i0 �� j + · · ·+βαn i

n �� j + · · ·. We represent this weightedtotal capacity to influence by the influence matrix P :

P = (βA + βαA1 + · · · + βαnAn + · · ·)= βA(I − αA)−1

, (1)

As mentioned by Katz[13], the equation holds while α < 1/λ, where λ is thelargest characteristic root of A [6].

We use the influence matrix to help find community structure in a network.We claim (without much theoretical or empirical support) that a community iscomposed of individuals who have a greater capacity to influence others withintheir community than outsiders. As a result, actions of community members willtend to become correlated with time, whether by adopting a new fashion trend,


vocabulary, watching a movie, or buying a product. Armed with this alternativedefinition of community, we adapt modularity maximization-based approach toidentifying communities.

2.1 Influence-Based Modularity

The objective of the algorithms proposed by Newman and coauthors is to dis-cover “community structure in networks — natural divisions of network nodesinto densely connected subgroups” [25]. They proposed modularity as a measurefor evaluating the strength of the discovered community structure. Algorithmi-cally, their approach is based on finding groups with higher than expected edgeswithin them and lower than expected edges between them [22,21,23]. The mod-ularity Q optimized by the algorithm is given by:Q =(fraction of edges within community)-(expected fraction of such edges).Thus, Q is used as a numerical index to evaluate a division of the network.The underlying idea, therefore, is that connectivity of nodes within a commu-nity is greater than that of nodes belonging to different communities, and theytake the number of edges as the measure of connectivity. However, we claim thatpath-based, rather than edge-based, connectivity is the true measure of networkconnectivity. Consider again the graph in Figure 1, where there exists an edgefrom a to c but not from b to c. Clearly, however, c is not unconnected from b, asthere are several distinct paths from b to c. We use the influence matrix, whichgives the global connectivity of the network, to identify communities.

We redefine modularity as Q =(connectivity within the community) - ( ex-pected connectivity within the community) and adopt the influence matrix Pas the measure of connectivity. This definition implies that in the best divisionof the network, the influence of nodes within their community is greater thantheir influence outside their community. A division of the network into communi-ties, therefore, maximizes the difference between the actual capacity to influenceand the expected capacity to influence, given by the capacity to influence in anequivalent random graph.

Let us denote the expected capacity to influence by an N ×N matrix P . Weround off the values of Pij to the nearest integer values. Modularity Q can thenbe expressed as

Q =∑ij

[Pij − Pij ]δ(si, sj) (2)

where si is the index of the community i belongs to and δ(si, sj) = 1 if si = sj ;otherwise, δ(si, sj) = 0. When all the vertices are placed in a single group, thenaxiomatically, Q = 0. Therefore

∑ij [Pij − Pij ] = 0. Hence, the total capacity to

influence W isW =

∑ij

Pij =∑ij

Pij (3)

Hence the null model has the same number of vertices N as the original model,and in it the expected influence of the entire network equals to the actual influ-ence of the original network. We further restrict the choice of null model to that


where the expected influence on a vertex j, W inj , is equal to the actual influence

on the corresponding vertex in the real network.

W inj =

∑i

Pij =∑

i

Pij (4)

Similarly, we also assume that in the null model, the expected capacity of avertex i to influence others, W out

i , is equal to the actual capacity to influence ofthe corresponding vertex in the real network

W outi =

∑j

Pij =∑

j

Pij . (5)

In order to compute the expected influence, we reduce the original graph G toa new graph G′ that has the same number of nodes as G and total number ofedges W , such that each edge has weight 1 and the number of edges betweennodes i and j in G′ is Pij . So now the expected influence between nodes i andj in graph G could be taken as the expected number of the edges between nodei and j in graph G′ and the actual influence between nodes i and j in graphG can be taken as the actual number of edges between nodes i and node j ingraph G′. The equivalent random graph G′′ is used to find the expected numberof edges from node i to node j. In this graph the edges are placed at randomsubject to constraints:

– The total number of edges in G′′ is W .– The out-degree of a node i in G′′ = out-degree of node i in G′ = W out

i .– The in-degree of a node j in graph G′′ =in-degree of node j in graph G′ =

W inj .

Thus in G′′ the probability that an edge will emanate to a particular vertexi is dependent only on the out-degree of that vertex; and the probability thatan edge is incident on a particular vertex i is dependent only on the in-degreeof that vertex and the probabilities of the two vertices being the two ends ofa single edge are independent of each other. In this case, the probability thatan edge exists from i to j is given by P (emanates from i) · P (incident onj )=(W out

i /W )(W inj /W ). Since the total number of edges is W in G′′, therefore

the expected number of edges between i and j is W · (W outi /W )(W in

j /W ) = Pij ,the expected influence between i and j in G.

2.2 Detecting Community Structure

Once we have derived Q, we have to select an algorithm to divide the networkinto communities that optimize Q. Brandes et al. [3] have shown that the de-cision version of modularity maximization is NP-complete. Like others [23,17],we use the the leading eigenvector method to obtain the approximate solution.In [9] we applied this approach to the standard data sets used in literature, andfound performance at least as good as that of the edge-based modularity algo-rithm. As can be mathematically derived from the formulation, we find that the


communities detected are independent of the value of β. So, henceforth withoutloss of generality, we shall assume the value of β = 1. In Section 4 we use thisapproach to partition several example networks into communities.

3 A Generalized Model of Influence

In this section, we present a mathematical framework that generalizes the notionof influence. In algebraic topology a k-simplex, with k ≥ 0, is a convex hull σof k + 1 linearly independent points v0, v1, . . . , vk and dimension k. The pointsvi are called vertices of σ. Let σ = {v0, v1, . . . , vk} be a k-simplex and let ω ={wi, . . . , wl} be a nonempty subset of σ, where wi �= wj if i �= j. Then ω ={w0, w1, . . . , wl} is called the l-dimensional face of σ. A simplectic complex K isa finite collection of simplices in some Rn satisfying:

– If σ ∈ K, then all faces of σ belong to K.– If σ1, σ2 ∈ K, then either σ1

⋂σ2 = Ø or σ1

⋂σ2 is a common face of σ1

and σ2 .

The dimension of K is defined to be −1 if K = Ø and the maximum of the di-mensions of K otherwise. An undirected graph can then be viewed as a simplecticcomplex with a single-element set per vertex and a two-element set per edge.

Suppose we are given finite sets X ={x1, x2, . . . , xn} and Y ={y1, y2, . . . , ym},and a binary relation γ ⊆ X × Y between elements of X and elements of Y (Xand Y could be the same). Then the relation γ may be expressed as an n × mincidence matrix A′(γ) = (A′

ij) where A′ij = 1 if (xi, yj) ∈ γ and 0 otherwise.

Each row in the incidence matrix A′(γ) may be viewed as a simplex in thefollowing way: Let Y be a set of vertices. The i-th row of A′(γ) can be identifiedwith a k dimensional simplex {yj1, yj2, . . . , yj(k+1)} = σk(xi) on the vertices Y(where A′

ij =1). Thus each xi ∈ X determines (with γ ) a row of A′(γ) and eachrow A′(γ) can be identified a simplex. The set of simplices is a simplical complexdenoted by KX(γ, Y ). Since an arbitrary element xi is γ-related to exactly k +1yj , σk(xi) is distinguished as a named simplex. If we let d denote the maximumdimension of KX(γ, Y ), we immediately see that d ≤ m − 1.

Let σ and τ be two simplices in KX(γ, Y ). Then σ and τ are q −near if theyhave a common q-face, i.e., their intersection contains at least q + 1 elements.(This q-face need not be an element of the simplex family.) Then τ and σ areq-connected if there exists a sequence σ1, σ2, . . . , σp of simplices in KX(γ, Y ),such that σ1 = σ and σp = τ and σi is q-near to σi+1 ∀1 ≤ i ≤ p − 1. Thusq-connectivity is the transitive closure of q-nearness. Q analysis using q-nearnessand q-connectivity was used by Atkin [2] to deal with pairs of sets and sets ofcontextual relations.

As Legrand [16] points out, q-nearness and q-connectivity are not necessarilya true measure of how similar the vertices are to each other, for which thelength of sequence of q-connectivity should be the true indicator. We thereforetake the length of sequences into account by calculating how q-near vertex i isfrom vertex j, making it dependent on the length of the path between them.


Therefore the adjacency matrix A, Aij = (q1ij), shows if two simplices arezero-near to one another in a 0-hop path. The product A2 =A × A gives thevalue of q2ij such that A2

ij = q2ij , i.e., vertex i and vertex j when separatedby a one-hop path are q-near each other with q = q2ij − 1. In the same way,A3 = A × A × A = (A3

ij) = (q3ij) shows that vertices i and j connected bya two-hop path are q3ij − 1 near from each other. We then take the length ofthe sequence into account to calculate the expected q-nearness of one vertex toanother by taking the weighted average of q-nearness of varying length of paths.The expected value of qij between two elements i and j, such that they areexpected to be qij − 1 near each other, with (qkij) = Ak

ij is:

E(qij) =(W1 · q1ij + W2 · q2ij + . . . + Wn · qnij + . . .)∑∞

i=1 Wi(6)

This expected value can be used to find out how connected two vertices are toeach other, taking paths of all lengths into account. Note that Wi can be a scalaror a vector.

This formulations allows us to generalize different network models for com-munity detection and scoring like the random walk model [28,29,31], the Katzmodel [13] of status score, and the influence-based model. In random walk mod-els, a particle starts a random walk from node i. The particle iteratively tran-sitions to its neighbors with probability proportional to the corresponding edgeweights. Also at each step, the particle returns to node i with some restartprobability (1 − c). The proximity score from node i to node j is defined asthe steady-state probability ri,j that the particle will be on node j [29]. Thesemodels can be shown to be special cases of the formulations of the expectedq-nearness (without loss of generality we assume that T is an n × n matrix):

1. If Wk = ck−1 ·D−(k−1) where c is a constant and D is an n× n matrix withDij =

∑nj=1 Aij if i = j and 0 otherwise; then, the expected q-nearness score

reduces to proximity score in random walk model [28,29].2. If Wi = Πi

j=1αj , where the scalar αj is the attenuation factor of a (j − 1)-thhop in a (i−1) hop path, then the expected q-nearness reduces to metric usedto find the influence score and represented by the influence matrix. For easeof computation of the influence matrix, we have taken α1 = β and αi = α∀i �= 1. As stated before, α < 1/λ where λ is the largest characteristic root ofA. Gershgorins Circle Theorem (1931) gives the simple sufficient conditionα < 1/maxi(Dii).

3. When β = α, this in turn reduces to the metric used to find the Katz statusscore [13] with α as the attenuation factor.

4. When α1 = 1 and α2 = . . . = αn = . . . = 0, the expected q-nearness is theq-nearness of the 0-hop path which is metric used to calculate similarity inedge-based modularity approaches [22].

In summary, the capacity to influence is a measure of the expected q-nearnessbetween vertices. Liben-Nowell and Kleinberg [20] have shown that Katz measureis the most effective measure for link prediction task. The influence score, which


is a generalization of the Katz score, can then be used to find communities asdescribed in Section 2.1.

4 Evaluation

We applied influence-based community finding method to small networks studiedpreviously in literature, as well as the friendship network extracted from thesocial photosharing site Flickr. On all the data sets we studied, the performanceof the influence-based modularity optimization algorithm was at least as good asthat of the edge-based modularity (α = 0 case). In several cases, the influence-based approach led to purer groups.

4.1 Zachary’s Karate Club

The karate club data represents the friendship network of members of a karateclub studied by Zachary [30]. During the course of the study, a disagreementdeveloped between the administrator and the club’s instructor, resulting in thedivision of the club into two factions, represented by circles and squares in Fig-ure 2. We used this data to study the communities detected for different valuesof α and compare the performance of the influence-based modularity approachto Newman’s community-finding algorithms (which is the special case whereα = 0) [21].

Using α < 1/λ (Section 2) we get the upper bound on α ≤ 0.29. Whenboth Newman’s edge-based modularity maximization approach and our method(0 ≤ α ≤ 0.29) are used to bisect the network into just two communities, werecover the two factions observed by Zachary (Figure 2(c)). However, algorithmsrun until a termination condition is reached (no more bisections are possible),different values of α lead to different results, as shown in Figure 2. As stated whenα = 0, the method reduces to Newman’s edge-based modularity maximizationapproach [21], and we get four communities (Figure 2(a)). For 0 < α < 0.14 thenumber of communities reduces to three (Figure 2(b)). As α is increased further(0.14 ≤ α ≤ 0.29) we get two communities(Figure 2(c)), which are the same asthe factions found in Zachary’s study.

(a) (b) (c)

Fig. 2. Zachary’s karate club data. Circles and squares represent the two actual fac-tions, while colors stand for discovered communities as the strength of ties increases:(a) α = 0 (b) 0 < α < 0.14 (c) 0.14 ≤ α ≤ 0.29.


4.2 College Football

We also ran our approach on the US College football data from Girvan et al. [10].1

The network represents the schedule of Division 1 games for the 2000 seasonwhere the vertices represent teams and the edges represent the regular seasongame between the two teams they connect. The teams are divided into “confer-ences” (or communities) containing 8 to 12 teams each. Games are more frequentbetween members of the same conference than members of different conferences.Inter-conference games, however, are not uniformly distributed, with teams thatare geographically closer likely to play more games with one another than teamsseparated by geographic distances. However, some conferences have teams play-ing nearly as many games against teams in other conferences as teams withintheir own conference. This leads to the intuition, that conferences may not bethe natural communities, but the natural communities may actually be biggerin size than conferences, with teams playing as many games against others inthe same conferences being put into the same community.

College football Political books

Fig. 3. The graph showing the purity of communities predicted with different valuesof α and β in the (a) college football and (b) political books data sets. We see thatpurity increases with α, and is independent of β. When α = 0, the method reduces toeigenvector based modularity maximization method postulated by Newman [23].

We measure the quality of discovered communities in terms of purity. Thepurity of a community is the fraction of all pairs of teams that were assignedto that community that actually belong to the same conference. The qualityof a network division produced by an algorithm is the average purity of thediscovered communities. Figure 3 shows the purity the discovered communitiesas α is varied. Purity is independent of β, the weight of direct edges, but increaseswith α, reaching ∼ 90% near α = 0.1 (the upper bound to α is determined bythe reciprocal of the largest eigenvalue of the adjacency matrix). When α = 0,the modularity reduces to edge-based modularity studied by Newman [23], thepurity is around 72%. The number of predicted groups changes from 8 at α = 0to four at higher values of α.

1 http://www-personal.umich.edu/∼mejn/netdata/


4.3 Political Books

We evaluated our approach on the political books data compiled by V. Krebs.2

In this network the nodes represent books about US politics sold by the onlinebookseller Amazon. Edges represent frequent co-purchasing of books by the samebuyers, as indicated by the “customers who bought this book also bought theseother books” feature of Amazon. This feature influences the book purchasingdecisions of customers. The nodes were given labels liberal, neutral, or conserva-tive by Mark Newman on a reading of the descriptions and reviews of the booksposted on Amazon.3 49 of the books were marked as conservative, 43 books weremarked as liberal and 13 books were marked as neutral. We use our algorithmto find the existing community structure in the network by varying parametersas shown in Figure 3. Purity is independent of the value of β, and similarly tothe football data, as α increases, the number of communities decreases (fromfour at α = 0 to two at α = 0.08). Also the purity of the communities increasesfrom 60% at α = 0 to 92% at α = 0.08. Again, α = 0 corresponds to Newman’smodularity method. Another observation is that when α = 0.08, leading to theformation of two groups, only the neutral books are split between group, indi-cates a possibility that some of the 13 neutral books were conservatively inclinedand some liberally.

4.4 Flickr Social Network

We also ran our algorithm on the social network data collected from Flickrfor the image search personalization study [18]. Flickr is a social photosharingsite that allows users to upload images, tag them with descriptive keywords,known as tags, and to join social networks by adding other users as contacts. Webelieve that network structure, create by independent decisions to add anotherphotographer as a contact, capture social knowledge, including knowledge aboutusers’ photography interests. Thus, users who are interested in a particular topicare more likely to be connected than users interested in different topics.

Since the actual social network on Flickr is rather vast, we sampled it byidentifying users who were broadly interested in one of three topics: child andfamily portraiture, wildlife photography and technology. For each topic, we usedthe Flckr API to perform a tag search using a keyword relevant to that topic,and retrieved 500 ‘most interesting’ images. We then extracted the names ofusers who submitted these images to Flickr. These users were added to our dataset. The keywords used for image search were (a) newborn for the portraituretopic, (b) tiger and beetle for the wildlife topic, and (c) apple for the technologytopic. Each keyword is ambiguous. Tiger, for example, could mean a big cat ofthe panthera genus, but also a flower (Tiger lily), Mac operating system (OS XTiger), or a famous golfer (Tiger Woods), while beetle could describe a bug or acar. The keyword newborn could refer to human babies just as well as to kittensand puppies, while apple could mean the computer maker or a fruit.2 http://www.orgnet.com/3 available at http://www-personal.umich.edu/∼mejn/netdata/


From the set of users in each topic, we identified four (eight for the wildlifetopic) who were interested in the topics we identified: i.e., wildlife for tiger andbeetle query terms, portraiture for the newborn query, and technology for theapple query. We studied each user’s profile to confirm that the user was indeedinterested in that topic. Specifically, we looked at group membership and user’smost common tags. Thus, groups such as “Big Cats”, “Zoo”, “The WildlifePhotography”, etc. pointed to user’s interest in the wildlife topic. In additionto group membership, tags that users attached to their images could also helpidentify their interests. For example, users who used tags nature and macro wereprobably interested wildlife rather than technology. Similarly, users interested inhuman, rather than animal, portraiture tagged their images with baby and family.We used the Flickr API to retrieve the contacts of each of the users we identified,as well as their contacts’ contacts. We labeled users by the topic through whichthey were discovered. In other words, users who uploaded one of the 500 mostinteresting images retrieved by the query tiger, were labeled wildlife, whether ornot they were interested in wildlife photography. The contacts and contacts’scontacts of the four users within this set identified as being interested in wildlifephotography were also labeled wildlife. Although we did not verify that all thelabeled users were indeed interested in the topic, we use these soft labels toevaluate the discovered communities.

Once we retrieved the social networks of target set of users, we reduced it toan undirected network containing mutual contacts only. In other words, everylink in the network between two nodes, say A and B, implies that A lists Bas contact and vice versa. This resulted in a network of 5747 users. Of these,1620 users were labeled technology, 1337 and 2790 users were labeled portraitureand wildlife respectively. We ran our community finding algorithm for differentvalues of α on this data set. For α = 0, we found four groups, while for highervalues of α (α < 0.01), we found three groups. Figure 4 shows compositionof the discovered groups in terms of soft labels. Group 1 is composed mainly oftechnology users, group 2 mainly wildlife users, and group 3 is almost exclusivelyportraiture. The fourth group found at α = 0.0 has 932 members, of which 497are labeled wildlife, 242 technology, and 193 members portraiture. Except for theportraiture group (group 3), groups become purer as α increases.

5 Related Research

There has been some work in motif-based communities in complex networks[1] which like our work extends traditional notion of modularity introduced byGirvan and Newman [10]. The underlying motivation for motif-based commu-nity detection is that “the high density of edges within a community determinescorrelations between nodes going beyond nearest-neighbours,” which is also ourmotivation for applying the influence-based modularity metric to communitydetection. Though the motivation of this method is to determine the correla-tions between nodes beyond nearest neighbors, yet it does impose a limit on theproximity of neighbors to be taken into consideration dependent on the size of


Fig. 4. Composition of groups discovered in the Flickr social network for differentvalues of α


the motifs. The method we propose, on the other hand, imposes no such limiton proximity. On the contrary, it considers the correlation between nodes in amore global sense. The measure of global correlation evaluated using the influ-ence metric would be equal to the weighted average of correlations when motifsof different sizes are taken. The influence matrix enables the calculation of thiscomplex term in a quick and efficient manner.

Resolution limit is one of the main limitations of the original modularity de-tection approach [8]. It can account for the comment by Leskovec et al. [19]that they “observe tight but almost trivial communities at very small scales, thebest possible communities gradually ‘blend in’ with rest of the network and thusbecome less ‘community-like’.” However, that study is based on the hypothesisthat communities have “more and/or better-connected ‘internal edges’ connect-ing members of the set than ‘cut edges’ connecting to the rest of the world.”Hence, like most graph partitioning and modularity-based approaches to com-munity detection, their process depends on the local property of connectivityof nodes to neighbors via edges and is not dependent on the structure of thenetwork on the whole. Therefore, it does not take into account the characteris-tics of node types, that is ‘who’ are the nodes that a node is connected to andhow influential these nodes are. In their paper on motif-based community detec-tion, Arenas et al.[1] state that the extended quality functions for-motif basedmodularity also obey the principle of the resolution limit. But this limit is nowmotif-dependent and then several resolution of substructures can be achieved bychanging the motif. However, it would be difficult to verify which resolution ofsubstructures is closest to natural communities. In influence-based modularity,on the other hand, the resolution limit would depend on the probability of trans-mission of the effect between nodes, i.e., the strength of ties. The probability oftransmission of effect can indeed be calculated from the graph, by say observingthe dynamics of spread of idea within a graph at different times.

As stated before, Liben-Nowell and Kleinberg [20] have shown that Katz mea-sure is the most effective measure for the link prediction task, better than hittingtime, PageRank [26] and its variants. Thus we use influence score, which is ageneralization of the Katz score, to detect communities and compute rankingsof individuals.

Recently researchers have used probabilistic models, e.g., mixture models, forcommunity discovery. These models can probabilistically assign a node to morethan one community, as it has been observed “objects can exhibit several dis-tinct identities in their relational patterns” [15]. This indeed may be true, butwhether the nodes in the network is to be divided into distinct communities orprobabilities with which each node belongs to community is to be discovered,really depends on the specific application. By this, we mean that if the appli-cation we are interested in is finding the natural communities say in the karateclub data, and if we use a probabilistic method (say [15]), we would be assigningthe nodes into groups into which their probability of belonging is the highest,and the communities thus formed do not necessarily portray the division of thenetwork into natural communities observed.


6 Conclusion and Future Work

We have proposed a new definition of community in terms of the capacity ofnodes to influence each other. We gave a mathematical formulation of this effectin terms of the number of paths of any length that link two nodes, and redefinedmodularity in terms of the influence metric. We use the new definition of mod-ularity to partition a network into communities. We applied this framework tonetworks well-studied in literature and found that it produces results at least asgood as the edge-based modularity approach.

Although the formulation developed in this paper applies equally well to di-rected graphs, we have only implemented it on undirected ones. Hence futurework includes implementation of the of the algorithm on directed graphs that arecommon on social networking sites, as well applying it to bigger networks. Theinfluence matrix approximates capacity to influence along the lines of indepen-dent cascade model of information spread. Future work includes approximationof capacity to influence along other models of information spread like the thresh-old influence model. Leskovec et al. [19] state that they “observe tight but almosttrivial communities at very small scales, the best possible communities gradu-ally ‘blend in’ with rest of the network and thus become less ‘community-like’.”However the hypothesis that they employ to detect communities is that commu-nities have “more and/or better-connected ‘internal edges’ connecting membersof the set than ‘cut edges’ connecting to the rest of the world.” Hence, like mostgraph partitioning and modularity based approaches to community detection,their process depends on the local property of connectivity of nodes to neighborsvia edges and is not dependent on the structure of the network on the whole.Besides, it also does not take into account the heterogeneity of node types, thatis ‘who’ are the nodes that a node is connected to and how influential thesenodes are. Therefore, we argue that a global property, such as the measure ofinfluence, is a better approach to community detection. It remains to be seenwhether communities will similarly ‘blend in’ with the larger network if one usesthe influence metric to discriminate them.

Acknowledgements

This research is based on work supported in part by the National Science Foun-dation under Award Nos. IIS-0535182, BCS-0527725 and IIS-0413321.

References

1. Arenas, A., Fernandez, A., Fortunato, S., Gomez, S.: Motif-based communities incomplex networks. Mathematical Systems Theory 41, 224001 (2008)

2. Atkin, R.: From cohomology in physics to q-connectivity in social science. Inter-national Journal of Man-Machines Studies 4, 341–362 (1972)

3. Brandes, U., Delling, D., Gaertler, M., Gorke, R., Hoefer, M., Nikoloski, Z., Wagner,D.: On modularity clustering. IEEE Trans. on Knowl. and Data Eng. 20(2), 172–188 (2008)


4. Clauset, A.: Finding local community structure in networks. Physical Review E(Statistical, Nonlinear, and Soft Matter Physics) 72(2) (2005)

5. de Sola Pool, I., Kochen, M.: Contacts and influence. Social Networks 1(1), 39–40(1978–1979)

6. Ferrar, W.L.: Finite Matrices. Oxford University Press, Oxford (1951)7. Fiedler, M.: Algebraic connectivity of graphs. Czech. Math. J. 23, 298–305 (1973)8. Fortunato, S., Barthelemy, M.: Resolution limit in community detection. Proc.

Natl. Acad. Sci. USA 104, 36 (2007)9. Ghosh, R., Lerman, K.: Community detection using a measure of global influence.

In: Proc. of the 2nd KDD Workshop on Social Network Analysis, SNAKDD 2008(2008)

10. Girvan, M., Newman, M.E.J.: Community structure in social and biological net-works. Proc. Natl. Acad. Sci. USA 99, 7821 (2002)

11. Goldenberg, J., Libai, B., Muller, E.: Talk of the network: A complex systems lookat the underlying process of word-of-mouth. Marketing Letters (2001)

12. Granovetter, M.: The strength of weak ties. The American Journal of Sociology(May 1973)

13. Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18,39–40 (1953)

14. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence througha social network. In: KDD 2003: Proceedings of the ninth ACM SIGKDD inter-national conference on Knowledge discovery and data mining, pp. 137–146. ACM,New York (2003)

15. Koutsourelakis, P.S., Eliassi-Rad, T.: Finding mixed-memberships in social net-works. In: AAAI Spring Symposium Social Information Processing (2008)

16. Legrand, J.: How far can q-analysis go into social systems understanding? In: FifthEuropean Systems Science Congress (2002)

17. Leicht, E.A., Newman, M.E.J.: Community structure in directed networks. PhysicalReview Letters 100, 118703 (2008)

18. Lerman, K., Plangprasopchok, A., Wong, C.: Personalizing results of image searchon flickr. In: AAAI workshop on Intelligent Techniques for Web Personlization(2007)

19. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties ofcommunity structure in large social and information networks. In: Proceedings ofthe World Wide Web Conference (2008)

20. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks.J. Am. Soc. Inf. Sci. Technol. 58(7), 1019–1031 (2007)

21. Newman, M.E.J.: Detecting community structure in networks. The European Phys-ical Journal B 38, 321–330 (2004)

22. Newman, M.E.J.: Fast algorithm for detecting community structure in networks.Physical Review E 69, 066133 (2004)

23. Newman, M.E.J.: Finding community structure in networks using the eigenvectorsof matrices. Physical Review E 74, 036104 (2006)

24. Newman, M.E.J.: Modularity and community structure in networks. Proc. Natl.Acad. Sci. USA 103, 8577 (2006)

25. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in net-works. Physical Review E 69, 026113 (2004)

26. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking:Bringing order to the web. Technical report, Stanford Digital Library TechnologiesProject (1998)


27. Pothen, A., Simon, H., Liou, K.P.: Partitioning sparse matrices with eigenvectorsof graphs. SIAM J. Matrix Anal. Appl. 11, 430–452 (1990)

28. Tong, H., Faloutsos, C., Pan, J.: Fast random walk with restart and its applications.In: Sixth International Conference on Data Mining, ICDM 2006, pp. 613–622 (2006)

29. Tong, H., Papadimitriou, S., Yu, P.S., Faloutsos, C.: Proximity tracking on time-evolving bipartite graphs. In: SDM, pp. 704–715. SIAM, Philadelphia (2008)

30. Zachary, W.W.: An information ow model for conict and ssion in small groups.Journal of Anthropological Research 33, 452–473 (1977)

31. Zhou, H.: Network landscape from a brownian particles perspective. Physical Re-view E 67 (2003)

Communication Dynamics of Blog Networks

Mark Goldberg1, Stephen Kelley1, Malik Magdon-Ismail1,Konstantin Mertsalov1, and William (Al) Wallace2

1 CS Department, RPI, 110 8th Street, Troy, NY{goldberg,kelles,magdon,mertsk2}@cs.rpi.edu

2 DSES Department, RPI, 110 8th Street, Troy, [email protected]

Abstract. We study the communication dynamics of Blog networks,focusing on the Russian section of LiveJournal as a case study. Com-munication (blogger-to-blogger links) in such online communication net-works is very dynamic: over 60% of the links in the network are new fromone week to the next, though the set of bloggers remains approximatelyconstant. Two fundamental questions are: (i) what models adequatelydescribe such dynamic communication behavior; and (ii) how does onedetect the phase transitions, i.e. the changes that go beyond the stan-dard high-level dynamics? We approach these questions through the no-tion of stable statistics. We give strong experimental evidence to the factthat, despite the extreme amount of communication dynamics, severalaggregate statistics are remarkably stable. We use stable statistics to testour models of communication dynamics postulating that any good modelshould produce values for these statistics which are both stable and closeto the observed ones. Stable statistics can also be used to identify phasetransitions, since any change in a normally stable statistic indicates asubstantial change in the nature of the communication dynamics. Wedescribe models of the communication dynamics in large social networksbased on the principle of locality of communication: a node’s communi-cation energy is spent mostly within its own “social area,” the localityof the node.

1 Introduction

The structure of large social networks, such as the WWW, the Internet, and theBlogosphere, has been the focus of intense research during the last decade (see[1], [7], [8], [12], [17], [19], [20], [21], [22]. One of the main foci of this researchhas been the development of dynamic models of network creation ([2], [11], [22],[18]) which incorporates two fundamental elements: network growth, with nodesarriving one at a time; and some form of preferential attachment in which anarriving node is more likely to attach itself to a more prominent existing nodethan a less prominent one (the rich get richer).

Once a network has grown and stabilized in size, how does it evolve? Suchan evolution is governed by the communication dynamics of the network: linksbeing broken and formed as social groups form, evolve and disappear. The com-munication dynamics of these networks have been studied much less, partially


Communication Dynamics of Blog Networks 37

because the typical networks studied (the WWW, the Internet, collaborationnetworks) mainly exhibit growth dynamics and not communication dynamics.Clearly, as a network matures, the growth (addition of new users) becomes aminor ingredient of the total change (see Figure 1). Further, links in a sociallydynamic network such as the Blogosphere should not be interpreted as static.The posts made by a blogger a week ago may not be reflective of his/her currentinterests and social groups. In fact, blog networks display extreme communica-tion dynamics. Over the 20 week period shown in Figure 1, in a typical week,510,000 pairs of bloggers communicated via blog comments. Out of those about380,000 are between pairs of bloggers who did not communicate the week before,i.e. over 70% of the communications are new. What models adequately describethe dynamics of the communications in such networks which have more or lessstabilized in terms of growth?

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

51 52 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

vertex growthnew edges

Fig. 1. Edge and vertex dynamics. Clearly the rate of growth is decreasing howeverthe fraction of new edges which appear in a week remains approximately constant atover 70%.

To begin to address this question, one must first develop methods for testingthe validity of a model. In such an environment of extreme stochastic dynamics,one cannot hope to replicate the dynamics of the individual communications; thisexplains our focus on the evolution of interesting macroscopic properties of thecommunication dynamics. Particularly interesting ones are those which are timeinvariant. We refer to such properties as stable statistics. As we demonstrate,even in such an active environment, certain statistics are remarkably stable. Forexample: the power-law coefficient for the in-degree distribution, the clusteringcoefficient, and the size of the giant component (see Table 1).

1.1 Our Contributions

In this paper, we first demonstrate by experimentation that a certain set of statis-tics is stable in the graph of the Blogosphere. The stability implies that these(and conceivably some other) statistics can be used to validate models, charac-terize networks and identify their phases. Second, we present several models for

38 M. Goldberg et al.

communication dynamics that are based on the principle of locality of commu-nication. We show via simulation that our models stabilize to an equilibrium inwhich the aggregate statistics of the communication dynamics are stable. Fur-thermore, among the set of models we tested, we select the one whose stablestatistics best reproduce the statistics of the observed network.

Stable Statistics. Our case study was the Russian section of LiveJournal. Dur-ing the observed period of 20 weeks, close to 153,000 users were active in any oneweek period. The size of this set is quite stable (changes typically from 1 to 2%).Although, the makeup of the set changes drastically from week to week. Sur-prisingly many aggregated statistics computed for the Blogograph show strongstability. Among those stable statistics are: the distribution of the in-degrees andthe out-degrees of the nodes; the (overlapping) coalition distribution as describedby the cluster density and size; and the size of the giant component.

The nodes of the Blogograph represent bloggers and the directed edges be-tween them represent all pairs {A, B} where blogger A visited the blog of Bduring the week in question and left a comment to a specific post already in theblog. We consider the following four types of stable statistics:

(i) Individual Statistics: properties of individual nodes such as the in-degreeand out-degree distributions for the graph

(ii) Relational Statistics: properties of edges in the graph such as the per-sistence of edges and clustering coefficients

(iii) Global Statistics: properties reflecting global information such as the sizeand diameter of the largest component and total density

(iv) Community Statistics: properties relating to group structure such asthe community size and density distributions

The purpose of collecting these statistics is two-fold. First, they create a baselinewhich describes the normal behavior of individuals, communities, and the networkas a whole. Once this base has been established, anomalous behavior at each ofthese levels can be identified and investigated further. Second, stable statisticscan be used for testing any model of the network dynamics, as any model whichattempts to replicate the communication dynamics must, in particular, be able toreproduce these statistics. Furthermore, the quality of a model can be measuredby how well the statistics computed from the network generated by the model (inequilibrium) replicate those observed in the real-life network.

Locality based Models of Communication Dynamics. Existing growth-based models fail to adequately replicate the observed stable statistics, as theydo not capture communication dynamics. We consider models for communicationdynamics which take as input: (a) the current (observed) communication graph;and, (b) each user’s out-degree (communication energy) at the next time step(or a distribution over for the user’s out-degree). These two inputs are standardfor existing growth models (such as the preferential attachment growth model).Such models are only applicable when the communications are open (observableto all nodes). The output is the communication graph at the next time step,based on the model for probabilistic attachment of each node’s out-edges.


We discuss intuitive extensions of growth models for modeling communica-tion dynamics and illustrate that these extensions are inadequate for modelingthe observed stable statistics. We present a locality based model which relies ontwo fundamental principles to more accurately reflect the observed communi-cation dynamics. First, our concept of locality reduces the set of nodes a nodecan attach to in the next time step (a week in our case). This locality is basedon structural properties of the current (observable to all) communication graph.The locality represents a semi-stable set of “neighbor” nodes that an individualis highly likely to connect to and can be interpreted as that individuals viewof the communities she belongs to. We test various structural (graph theoretic)definitions of a node’s social locality, ranging from trivial localities such as theentire graph to notions of a node’s neighborhood (e.g. the 2-neighborhood; theclusters to which a node belongs). Second, after obtaining a node’s locality, onemust specify the attachment mechanism, the mechanism used by the individualto select the nodes in her locality to which she will connect at the next timestep. We test a number of different attachment mechanisms which one couldconsider, ranging from uniform attachment to some form of preferential attach-ment. Thus, we present results using each of the various choices for the localityand attachment mechanism.

Such probabilistic models are Markov chains, and we test a model’s perfor-mance by comparing the values it produces for the stable statistics after it hasequilibrated. We find experimentally that the mixing times are small and theequilibrium statistics are independent of the starting state (the chains are er-godic), hence the equilibrium distribution is unique. Our results indicate thatour locality based model with locality defined as the union of clusters to which anode belongs and a preferential attachment mechanism produces the best valuesfor the stable statistics.

2 Clusters

The notion of a social community is crucial to our model of a Blog network. Theunderlying idea of our model is that every user selects the nodes to visit (to leavea comment) from the set of nodes that belong to a relatively small “area” of anode. Our experiments with different definitions of the local area of the nodeshow that the best approximation to the observed statistics is achieved if thearea is taken as the union of clusters containing a given node. Our definition ofnetwork clusters is borrowed from [4], [5], [6] with an important specification ofthe notion of the density of a set of nodes in a network.

Definition. Given a graph G(V.E) let function D, called the density, be definedon the set of all subsets of V . Then, a set C ⊆ V is called a cluster if it is locallymaximal w.r.t. D in the following sense: for every vertex x ∈ C (resp. x �∈ C),removing x from C (resp. adding x to C) creates a set whose density is smallerthan D(C).

The idea of the definition matches the common understanding of a socialcommunity as a set of members that forge more communication links within the


set than that with the outside the set. The function D is not specified by thedefinition, but its precise formulation is crucial in “catching” the nature of socialcommunities. The density function considered in [3] is as follows:

D(C) =win

win + wout, (1)

where win is the number of edges xy with x, y ∈ C and wout is the number ofedges xy with either x ∈ C & y �∈ C or x �∈ C & y ∈ C (to allow for directedgraphs). The main deficiency of the definition of a cluster as a computational rep-resentation of a social community is that it is easy to find examples of networksthat permits very large and loosely connected clusters, that intuitively are notrepresenting any community. The idea of our modification of 1 is to introduce anadditional parameter which represents the edge probability in the set

D(C) =win

win + wout+ λ

2win

|C|(|C| − 1), (2)

where the parameter λ depends on the specific network under the consideration,and is supposed to be selected by the researcher. For our experiments, we selectedλ = 0.125.

3 Data

We define the blogograph as a directed, unweighted graph representing the com-munication of the blog network within a fixed time-period. There is a vertex inthe blogograph representing each blogger and a directed edge from the author ofany comment to the owner of the blog where the comment was made during theobserved time period. Parallel edges are not allowed and a comment is ignoredif the corresponding edge is already present in the graph. Loops, comments on abloggers own blog, are ignored as well. To study the communication dynamics,we consider consecutive weekly snapshots of the network; the communicationgraph contains the bloggers that either posted or commented during a week andthe edges represent the comments that appeared during the week. We chose tosplit graphs into one week periods due to the highly cyclic nature of activity inthe blogosphere (see Figure 3). An illustration of the blogograph’s constructionis given on Figure 2.

The data used for our research was collected from the popular blogging ser-vice LiveJournal. As of May 2008, there were more than 15 million users for thewhole network; the number of posts during a 24 hour period was approximately191,000 (see html://www.livejournal.com/stats). Much of the communication inLiveJournal is public, which allows for open access. LiveJournal provides a realtime RSS update feature that publishes all open posts that appear on any ofthe hosted blogs. We record the permament addresses of the posts and wait forthe comments to accumulate. In our experience, the overwhelming majority ofcomments appear on these posts within two weeks of the posting date. Thus,


Edges:Thread on Alice’s Blog

AliceBill

AliceCory

postedcommented

commentedcommented commented

commentedcommented

postedBillAliceCory

Dave

Thread on Bill’s BlogEdges:A B−>C B−>D B−>

B

A

C

D−>B A−>C A

Fig. 2. Blogograph generation example. Vertices are placed for every blogger whoposted or commented, the edges are placed from the author of the comment to theauthor of the post (the blog owner). Parallel edges and loops are not allowed.

Table 1. Statistics for observed blogograph: order of the graph (|V |), graph size (|E|),fraction of vertices that are part of giant component (GC size), clustering coefficient(C), average separation (d), power law exponent (α)

week |V | |E| GC C d α

49 155,615 530,160 95.88% 0.0639 5.333 2.6350 156,026 532,189 95.91% 0.0644 5.327 2.6651 155,093 527,364 95.62% 0.0635 5.316 2.6552 151,559 516,483 95.62% 0.0635 5.316 2.711 118,979 327,356 93.55% 0.0573 5.777 2.922 142,478 444,457 95.14% 0.0587 5.392 2.683 159,436 559,506 96.16% 0.0629 5.268 2.684 158,429 550,436 95.60% 0.0631 5.224 2.675 156,144 534,917 95.49% 0.0627 5.293 2.726 156,301 526,194 95.70% 0.0615 5.338 2.727 154,846 523,235 95.44% 0.0622 5.337 2.698 156,064 528,363 95.59% 0.0609 5.320 2.699 156,362 524,441 95.58% 0.0602 5.377 2.6810 154,820 523,304 95.48% 0.0593 5.368 2.6811 155,267 516,280 95.13% 0.0600 5.356 2.6812 156,872 514,269 95.20% 0.0590 5.367 2.6313 155,338 510,070 95.42% 0.0601 5.342 2.7114 155,099 506,892 95.19% 0.0607 5.309 2.7315 153,440 504,850 95.32% 0.0601 5.303 2.7316 154,012 512,094 95.34% 0.0599 5.298 2.6017 151,427 503,802 95.30% 0.0611 5.288 2.75

our screen-scraping program visits the page of a post after it has been pub-lished for two weeks and collects the comment threads. We then generate thecommunication graph.

We have focused on the Russian section of LiveJournal as it is reasonablybut not excessively large (currently close to 580,000 bloggers out of the total 15million) and almost self contained. We identify Russian blogs by the presence ofCyrillic characters in the posts. Technically this also captures the posts in other


150000

200000

250000

300000

350000

400000

450000

01/19 02/02 02/16 03/01 03/15 03/29

(a) Comments per day

0

5000

10000

15000

20000

25000

30000

Mon Tue Wed Thu Fri Sat Sun

(b) Comments per hour

Fig. 3. Number of comments per day that appeared between January 14, 2008 andApril 6th, 2008 and number of comments per hour during a week between March24th, 2008 and March 30th. The periodic drops in the number of comments per daycorrespond to Saturdays and Sundays.

languages with a Cyrillic alphabet, but we found that the vast majority of theposts are in Russian. The network of Russian bloggers is very active. On average,32% of all posts contain Cyrillic characters. LiveJournal blogging has becomea cultural phenomenon in Russia. Discussion threads often contain intense andinteresting discussions which encourage communication through commenting.Our work is based on data collected between December 2007 and April 2008.The basic statistics about the size of obtained data are presented in Table 1. Asimpler set of statistics on a smaller set of observed data is presented in [16].

4 Stable Statistics

The observed communication graph has interesting properties. The graph isvery dynamic on the level of nodes and edges but has stable aggregated statis-tics. About 75% of active bloggers will also be active in the next week. Further,about 28% of edges that existed in a week will also be found in the next week. Alarge part of the network changes weekly, but a significant part is preserved. Thestability of various statistics of the blogograph is presented in Table 1. The giantcomponent (GC) is the largest connected (not necessarily strongly connected)


subgraph of the undirected blogograph. A giant component of similar size hasbeen observed in other large social networks [18], [14]. The clustering coefficient(C) refers to the probability that the neighbors of a node are connected. Theclustering coefficient of a node with degree k is the ratio of the number of edgesbetween it’s neighbors and k(k − 1). The clustering coefficient of the graph isdefined to be the average of the node clustering coefficients. The observed clus-tering coefficient is stable over multiple weeks and significantly different from theclustering coefficient in a random graph with the same out-degree distribution,which is 0.00029. The average separation (d) is the average shortest path be-tween two randomly selected vertices of the graph. We computed it by sampling10,000 random pairs of nodes and finding the undirected shortest path betweenthem. The observed value in the blogograph is similar to what has been foundin many other social networks ([18], [23]).

Many large social networks ([2], [14]) display a power law in the degree distri-bution, P (k) ∝ ck−α, where P (k) is the probability a node has degree k. Figure5 shows the mean in-degree distribution of the collected blogographs. In thesegraphs, we observed a power law tail with parameter α ≈ 2.70 which is stablefrom week to week. This value was computed using maximum likelihood methoddescribed in [10] and Matlab code made available by Aaron J. Clauset.

To evaluate the dynamic in the observed communication we consider thechange in the set of links or edges from one week to another. Figure 4 showsthe distribution of number of weeks a particular pair of bloggers communicated.It is evident from this plot that the vast majority of communication does notre-occur. Yet, some links reappear every week. We also look at the past rela-tionship between the bloggers who communicated. We define the history of theedge (i, j) that appeared in time cycle t to the the shortest undirected distancebetween i and j in the graph of the time cycle t − 1. Figure 4 presents thedistribution of the edge histories of all observed edges of all time cycles. Theedge history distribution of the particular observed weeks is very close to the

0.0001

0.001

0.01

0.1

1

0 5 10 15 20

(a) Edge Stability

0

0.1

0.2

0.3

0.4

0.5

0 2 4 6 8 10 12 14 16

(b) Edge History

Fig. 4. Edge stability: distribution of the number of weeks an edge appeared in 60% ofall edges only appeared once. Edge history: the distribution of the shortest undirecteddistance of end points of an edge in a previous time cycle.


1e-06

1e-05

0.0001

0.001

0.01

0.1

1

1 10 100 1000 10000

P(x

= k

)

k

observed distributionα = 2.70

Fig. 5. Average in-degree distribution in the blogograph observed over 21 weeks fromDec. 03, 2007 and Apr. 28, 2008

Table 2. 19 weeks of communities from the Russian section of Live Journal. |C| isthe number of communities, δavg is the average density, and ep is the average edgeprobability within the communities.

week |C| avg size δavg ep week |C| avg size δavg ep

51 19631 10.0183 0.456677 0.253212 9 20136 9.95401 0.473693 0.25260752 19520 10.0615 0.453763 0.252101 10 19670 9.71678 0.45449 0.2557781 23187 10.0915 0.473676 0.248130 11 20212 9.66842 0.456908 0.2560982 20970 9.98412 0.458161 0.251843 12 20415 9.70331 0.461118 0.2558193 17986 9.86184 0.448757 0.254203 13 20030 9.78058 0.455676 0.2546814 18510 9.71891 0.453578 0.257481 14 19893 9.74936 0.455234 0.2543845 18808 9.88255 0.455823 0.254305 15 19392 9.73407 0.455365 0.2546876 19318 9.79242 0.454656 0.253901 16 19113 9.74787 0.454531 0.2547217 19343 9.80381 0.456364 0.255236 17 18737 9.72333 0.455775 0.2556588 19796 9.83113 0.453577 0.252818

presented distribution (the variation at each point is less then 2%). As the figuresuggests, the majority of communicating vertices were less or at 3 hops away inthe network on the previous time cycle. This provides evidence for the stronglocality of communication that occurs in the observed network.

In addition to looking for stability in structural statistics, it is also usefulto examine stable community behavior. Using the notion of clusters discussedpreviously in this text, we find locally optimal communities using each edge inthe graph as a seed. Once all seeds are optimized, duplicates and clusters of size2 are removed. Statistics of the remaining clusters are showing in Table 2. A sizevs density plot is also given in Figure 6. The general shape and scale of this plotis replicated across all observed weeks.


0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50 60 70 80 90 100

Den

sity

Size

Fig. 6. A size vs density plot for week 5 of the observed data. The x-axis is a measureof the community size while the y-axis shows the value of δ. Each point represents acommunity.

5 Modeling

As previously stated, networks with such strong communication dynamics havenot been well modeled. Much of the previous work aims to replicate the growthphase of a network’s life-cycle, ignoring the evolution of communication oncethe network’s size stabilizes. Models which replicate these dynamics would beuseful as a sand-box within which social hypotheses on information diffusion, theemergence of leaders, and group formation and dissolution can be tested. To beconsidered useful, any model should create a set of graphs whose statistics comeas close as possible to mirroring the statistics of the observed data presentedpreviously.

Before delving into the creation of a new model, let us first consider themodification of a previously existing one. The simplest method of producinga set of evolving graphs is to grow each week’s graph using a known networkgrowth algorithm. Vertices can be assigned an out-degree based on the observeddata and connected to each other via preferential attachment for each of theweeks. If done correctly, this would yield a set of graphs whose in-degree andout-degree distributions come close to matching the observed data’s power lawdistributions.

Despite this initial positive result, examining the rest of the statistics demon-strates that the model is insufficient. Relational statistics such as edge stability,edge history, and clustering coefficient all significantly depart from the observedvalues, which we will show in detail further in the paper. This model’s inability dorecreate these statistics is expected, since it generates each graph independently.

Below, we propose a model which performs its edge connection within somelocality in an effort to more closely mirror the edge stability, edge history, clus-tering coefficient, and community based statistics of the network.


6 A Locality Based Model

The goal of our model is to produce a sequence of graphs which simulate theconnection and reconnection of vertices. Our model specifies how nodes updatetheir edges in response to the observed communication activity. In specifying thismodel of evolution, we take as input the out-degree distribution of the blogo-graph. The justification for this is that, while the out-degree distribution wouldbe an interesting object to model, it mainly reflects the individual properties ofthe users in the network such as the level of energy and involvement of the user.Such quantities tend to be innate to a user. Different people have different socialhabits; some manage to communicate with hundreds of people while others in-teract with only a small group. Hence, out-degrees should be specified either abinitio (e.g. from social science theory) or extracted directly from the observeddata. We take the latter approach to specifying the out-degree distribution whenit comes to testing our model. An early version of this model with preliminaryresults is presented in [15].

Given the out-degrees for all nodes, the task is now to specify how to attachthe out-edges of the nodes and to obtain the in-degree distribution. It is thein-degree distribution that characterizes the global communication structure ofthe network (for example, who is considered by others to be important). Clearly,the out-degree distribution of a graph alone does not determine its in-degreedistribution. Algorithms for generating undirected random graphs with a pre-scribed degree distribution are well known (see [9], [13], [24]). However, even ifthose algorithms are expanded to the domain of directed graphs, they will stillbe insufficient for our purpose of modeling evolution, which requires repeatedgeneration of the next graph given the previous one.

To summarize, we are interested in models which reproduce the observedevolution given the out-degrees of the nodes. Thus, all our locality models assumethat a node, when deciding where to attach its communication links, has somefixed budget of emanating edges which it can attach. The main task of our modelis to develop an evolution mechanism that re-creates an in-degree distributionclose to the observed one.

We use standard graph theory terminology in describing our model (see forexample [25]). The sequence of blogographs are represented by directed graphsG0, G1, G2, . . ., where at every time step t, Gt = (V, Et). V is the common vertexset of all known bloggers, V = {v1, ..., vn}. An edge (vi, vj) is in the edge set Et ifblogger vi commented on a post by vj during the time period t. One time periodcovers one week, which appears to be the natural time scale in the blogosphere.

The input to the model is the set of out-degrees at time t for each vertex,{k1

t , . . . , knt } and Gt−1, the blogograph at time t− 1. The output of the model is

Gt, the blogograph at time t. Our model is locality based. At time t, every nodevi identifies its area, and assigns its out-edges with destinations in its area.

More formally, denote the area of vi at time t by Ait ⊆ V . Ai

t represents thelocality of node vi at time t. Typically, a node’s locality at time t will depend onGt−1, the blogograph at time t − 1. The attachment mechanism is probabilisticfor each node. Node vi attaches its ki

t out-edges according to its own probability


Algorithm 1. Evolution Model1: Function Model (T , OutDeg, Area, Prob)2: // Output: Blogographs G1, . . . , GT .3: {k1

0 , . . . , kn0 } ← OutDeg

4: Initialize G0 (e.g. to a random graph)5: for t = 1 to T do6: Et ← ∅; {k1

t , . . . , knt } ← OutDeg

7: for i = 1 to n do8: Ai

t ← Area(i, Gt−1); pit ← Prob(i, Ai

t, Gt−1)9: Ei

t ← Attach(i, , Ai, pit, k

it); Et ← Et ∪ Ei

t.10: end for11: Gt+1 ← (V, Et)12: end for

Algorithm 2. Edge attachment algorithm1: Function Attach (i, Ai

t, pit, ki

t)2: // Output: Ei

t: edges in Gt originating at i3: while ki

t > 0 do4: if (

∑v∈Ai

tpi

t(v) > 0) then

5: Select node v ∈ Ait−1 with probability pi

t(v)6: pi

t(v) ← 0; renormalize pit

7: else8: Select node v ∈ V \ Ai

t−1 with uniform probability9: end if

10: kit = ki

t − 111: Ei

t = Eit ∪ (i, v)

12: end while

distribution pit, where pi

t(v) specifies the probability for node vi to attach to nodev for v ∈ V . The probability distribution pi

t may depend on Ait and Gt−1 (e.g.

higher degree nodes may get higher probabilities). In particular, we assume that∑v∈Ai

tpi

t(v) = 1, which corresponds to the assumption that every nodes expendsall its communication energy within its local area. Since we do not allow paralleledges, if ki

t > |Ait|, it is not possible for node vi to expend all its communication

energy within its local area Ait. In this case, we assume that ki

t − Ait edges are

attached uniformly at random to nodes outside its area and the remaining edgesare attached within its area. The precise algorithm for distributing the edgesgiven the probability distribution pi

t is given in Algorithm 2.The evolution model is illustrated in Figure 7. In more detail, the model

first obtains the out-degrees (which are exogenously specified). From Gt−1, itcomputes Ai

t and pit for all nodes vi ∈ V . For all nodes, it then attaches edges

according to Algorithm 2. This entire process is iterated for a user specifiednumber of time steps. The entire process is given in Algorithm 1. The inputs tothe model are the procedure OutDeg which specifies the out-degrees (assumed


Iterate

local areaDetermine

out−degreesAssign

Place edgesw/in area

new graphForm

For each node:

Generate random

out−degeesgraph with fixed

Initial Graph

Fig. 7. Model execution flow

to be exogenous), the procedure Area which identifies the local areas of thenodes given the previous graph, and the procedure Prob which specifies theattachment probabilities according to the attachment model. We will now discusssome approaches to defining the areas and the attachment probabilities. Whentesting our model, we will also need the procedure for obtaining the out-degrees,which will be discussed in Section 7.

6.1 Locality Models

A node expends the majority of it’s communication energy within it’s local area.This captures the intuition that people mostly communicate with in a smallgroup that contains friends, family, colleagues, etc. We propose the followingdefinitions of an area:

1. Global. Every node vi is aware of the whole network, the local area of vi isAi

t = V at every time period t.2. k-neighborhood. The local area Ai

t of node vi at time t consists of all vj

such that undirected shortest distance δt(vi, vj) ≤ k.3. Clusters. This definition is based on the notion of a cluster, defined as

follows (see also [5], [4], [6]). First, the notion of the density of a set isintroduced, which generally can be any function defined on a subset of nodesof the graph (e.g. ratio of number of edges within subset to a number ofedges with at least one endpoint in the subset). For every density function, acluster is defined to be any subset of nodes that is locally maximal w.r.t. toits density. Our definition of a cluster permits clusters to overlap. In fact, ourexperiments show that clusters overlap quite frequently. This is is expectedin a graph of a social network, where the same member may belong to morethan one community represented by clusters. Finally, using the definition ofa cluster presented earlier in this paper, we define the local area of a bloggeras the union of all clusters which she is a member of. Intuitively, this restrictsa blogger’s activity to the set of individuals in groups which they have showninterest in previously.


6.2 Attachment Models

Given the local area Ait of the node vi at time t, the attachment model describes

the probability pit+1(vj) of occurrence of an edge (vi, vj) at time t+1 for vj ∈ V .

We propose the following attachment modes:

1. Uniform. Node vi attaches to any vj ∈ Ait with equal probability

pit(vj) =

1|Ai

t|(3)

and for vj /∈ Ait, pi

t(vk) = 0.2. Preferential Attachment. Node vi attaches to any vj ∈ Ai

t with proba-bility

pit(vj) ∝ indegt−1(vj) + γ (4)

where indegt−1(vj) is the in-degree of vertex vj in graph Gt−1 and γ is aconstant.

3. Markov Chain. To obtain the attachment probabilities for vertex vi wesimulate the particle traveling over undirected edges of graph Gt startingfrom the node vi and randomly selecting edges to travel over until it arrivesat first node ve /∈ Ai

t. Every time the particle arrives at some node vj ∈ Ait,

the counter cij is incremented. After this simulation is repeated with out

resetting the counters cij , ∀ vj ∈ Ai

t a number of times, we determine theattachment probability

pit(vj) ∝ ci

j . (5)

4. Inverse distance. Node vi attaches to some node vj ∈ Ait with probability

pit(vj) ∝ 1

δρt−1(i, j)

, (6)

where δt−1(i, j) is the shortest undirected distance between vertices vi, vj ingraph Gt−1 and ρ is a constant.

The combination of the locality model and attachment model specifies the evo-lution model that, given the out-degree distribution, will produce a series ofgraphs that represent the blogograph at different time periods.

7 Experiments and Results

In this section we present the results of execution of few of the models and theevaluation of their performance.


To evaluate the performance of the models, we compare the sequence of graphsproduced by the model to the sequence of graphs produced by the observedcommunication in LiveJournal. In particular, we compare the clustering coeffi-cient, the size of the giant component, average separation between two nodes,in-degree distributions, and community size and density distributions. To com-pare the in-degrees, we compute the point-wise difference of the normalizeddistributions. Formally, for each graph Gi we compute the normalized distri-bution Di(k) = k

|Vi| , where k is the degree and |Vi| is the number of verticesin the graph. The differences between distributions of observed graph Go andgenerated graph Gg is

E =∑

d

|Do(d) − Dg(d)|. (7)

Notice, E ∈ [0, 2] and lower value of E corresponds to a closer match.To compare community structure we compute gerr, the average error between

the size and density distributions that represent communities in generated andobserved graphs. This value is computed by splitting the size-density plot intobins where bin width on the desity axis is 0.05 and on the size axis is 5. Theresulting bin sizes are normalized with respect to the number of communitiesin the graph. Value of gerr is measured by the sum of difference of sizes ofcorresponding bins in size-density distributions of observed and generated data.

To evaluate a particular model we execute enough iterations to let the modelstabilize. We determine the stabilization by inspection of plots of the major pa-rameters (including in-degree distribution, clustering coefficient, etc). Then, wecompare the sequence of the graphs produced by the model after the stabilizationto the sequence mined from LiveJournal.

Table 3 contains the results of execution of models with various combinationsof local area and attachment mechanisms compared to the average parameters ofgraphs of different observed weeks. Note, for the observed data average parameterE is computed by comparing distributions of graphs corresponding to various

Table 3. The stable parameters of graphs generated by various models compared tothe parameters of the observed data

Area Attch GC C d E gerr

Observed 0.9545 0.0613 5.34 0.0289 0.00144Global Uniform 0.9867 5.2 × 10−6 7.86 1.075 0.04215Global P.A. (in) - - - - -Global P.A. (out) 0.9688 0.00018 5.21 0.427 0.011893-Neighb. uniform 0.8939 0.00045 5.30 0.4331 0.017923-Neighb. P.A.(in) - - - - -3-Neighb. P.A.(out) 0.9776 0.00133 4.53 0.1412 0.03504Clusters uniform 0.9646 0.00252 6.73 0.7267 0.03484Clusters P.A. (in) 0.9643 0.00149 6.88 0.1713 0.03811Clusters P.A. (out) 0.9523 0.03156 6.56 0.5320 0.02034


observed weeks. Figure 8 compares the observed degree distribution to the onesgenerated by some of the best area/attachment combinations. As defined inSection 4, edge history conveys information about how close the end points ofthe observed edge were in the previous time cycle and therefore measures thesignificance of locality in the communications. Figure 8 compares the observededge history with the edge histories produces by the best models.

7.1 Global Area Model

First, we consider the model with global area where vertices are aware of andcan connect to any other vertex in the network.

In the case of a uniform attachment, the resulting model is very similar tothe Erdos-Renyi model. The in-degree distribution and other parameters gen-erated by such model are predictably very different from the power law degreedistribution in the observed graph.

Global area with preferential attachment strictly proportional to the in-degreeof the vertices in the graph of the previous iteration results in a formation ofa power house - small set of vertices with very high in-degree that attract allof out-degree. This effect is caused directly by preferential attachment; sincevertices with zero in-degree will never be attached to, any vertex that receivesno incoming vertices at some iteration will not receive any incoming vertices inany of the following iterations. Clearly, the graph with small set of vertices thatattract all of the in-degree is very different from the observed graph.

The combination of global area and preferential attachment proportional to theout-degree of the vertices in the graph of the previous iteration produced resultsthat were more similar to the observed network than the other global models, butthe results were also significantly worse compared to models with other area defini-tions (k-neighborhood and union of clusters). Since this model allows for randomselection of the end points of edges from the whole graph, the edge history (Figure8) is very different from the one observed in the real-life network.

7.2 k-Neighborhood Area Model

We experimented with different values of k (k ∈ {2, 3, 4}) and determined thatk = 3 produced the best models.

The combination of 3-neighborhood area and uniform attachment produced amodel that showed mediocre results when compared to the observed parameters.The combination of 3-neighborhood area and preferential attachment propor-tional to the in-degree produced a graph with a small power house in just a fewiterations. 3-neighborhood area with preferential attachment proportional to theout-degree produced a model that generated graphs with in-degree distributionsvery similar to the observed graph. In particular, the power law tail resembledthe tail of the observed graph. The edge history of this model was quite differentfrom the observed, since most of the end points for new edges are selected suchthat their distance in the previous iteration’s graph was 3.


7.3 Union of Clusters Area Model

An area constructed via the union of clusters in combination with preferentialattachment proportional to the in-degree produced in-degree distributions moresimilar to the observed than other attachment mechanisms combined with thisarea definition. This model also produced a sequence of graphs with an edgehistory that was closest to the observed as evident from Figure 8.

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1 10 100 1000 10000

P(d

>=

k)

k

observedGlobal + P.A. (out)Clusters + P.A. (in)

3-Neighb + P.A. (out)

(a) In-degree distribution

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

2 4 6 8 10 12 14 16

ObservedGlobal + P.A. (out)Clusters + P.A. (in)

3-Neighb + P.A. (out)

(b) Edge history distribution

Fig. 8. In-degree and edge history distributions for various models and observed net-work

Models with this area definition were the only ones that produced non-trivialedge stability defined as the likelihood of a repetition of a recently observededge. To evaluate this stability, we consider the number of edges that appearedmore then once in 21 iterations of the model after stabilization. Models withglobal and 3-neighborhood area definitions that did not result in the formationof power houses produced a set of graphs such that less then 1% of edges thatappeared more the once in all of the graphs. Models with an area defined by theunion of clusters produced a set of graphs in which, on average, 14% of edgesappear more then once in 21 iterations. In particular, a combination of this areadefinition with preferential attachment proportional to the in-degree produceda sequence of graphs with 18% of edges appear more than once, while in theobserved network, 40% of edges (see Figure 4) appear more then once during 21observed weeks.

After considering all of the parameters of the models, we determined the com-bination of an area defined by the union of clusters with preferential attachmentproportional to the in-degrees of the vertices to be the best model to describethe dynamics of communication in the observed network.

8 Conclusion

We have presented a set of statistics which display strong stability even fora dynamic network such as the blogosphere. Our list of stable statistics is not


exhaustive. However, they represent a comprehensive set of interesting propertiesof a network that any model for communication dynamics should capture.

Our experiments have shown that the communication dynamics of large so-cial networks are best explained as a result of local communication, where themajority of members communicate within their social locality, a relatively smallset of nodes reflective of their interests or communities. The best approximationto this locality, among the models we evaluated on LiveJournal data was theone determined by the union of clusters a node belonged to. Our notion of acluster is a set of nodes which locally maximized a cluster density. This notionof a cluster has the important property that it allows clusters to overlap, whichis important if a cluster is to represent a social community or coalition.

Many possibilities exist for enhancing the definitions of locality and the attach-ment mechanisms. One direction which we intend to pursue as future researchis the combination of local with global attachment mechanisms.

Acknowledgment. This material is based upon work partially supported by theU.S. National Science Foundation (NSF) under Grant Nos. IIS-0621303, IIS-0522672,IIS-0324947, CNS-0323324, NSF IIS-0634875 and by the U.S. Office of Naval Research(ONR) Contract N00014-06-1-0466 and by the U.S. Department of Homeland Security(DHS) through the Center for Dynamic Data Analysis for Homeland Security adminis-tered through ONR grant number N00014-07-1-0150 to Rutgers University. The contentof this paper does not necessarily reflect the position or policy of the U.S. Government,no official endorsement should be inferred or implied.

References

1. Albert, R., Barabasi, A.-L.: Statictical mechanics of complex networks. Reviews ofModern Physics 74(47-97) (2002)

2. Barabasi, A.L., Jeong, J., Neda, Z., Ravasz, E., Shubert, A., Vicsek, T.: Evolutionof the social network of scientific collaborations. Physica A 311(590-614) (2002)

3. Baumes, J., Chen, H.-C., Francisco, M., Goldberg, M., Magdon-Ismail, M., Wallace,W.: Dynamics of bridging and bonding in social groups, a multi-agent model. In:Third conference of the North American Association for Computational Social andOrganizational Science (NAACSOS 2005), Notre-Dame, Indiana, June 26-28 (2005)

4. Baumes, J., Goldberg, M., Krishnamoorthy, M., Magdon-Ismail, M., Preston, N.:Finding comminities by clustering a graph into overlapping subgraphs. In: Pro-ceedings of IADIS International Conference, Applied Computing 2005, pp. 97–104(2005)

5. Baumes, J., Goldberg, M., Magdon-Ismail, M.: Efficient identification of overlap-ping communities. In: IEEE International Conference on Intelligence and SecurityInformatics (ISI), May 2005, pp. 27–36 (2005)

6. Baumes, J., Goldberg, M., Magdon-Ismail, M., Wallace, W.: Identification of hid-den groups in communications. In: Handbooks in Information Systems, NationalSecurity, vol. 2 (2007)

7. Berger-Wolf, T.Y., Saia, J.: A framework for analysis of dynamic social networks.DIMACS Technical Report 28 (2005)


8. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stat, R.,Tomkins, A., Wiener, J.: Graph structure in the web. Computer Networks 33(1-6),309–320 (2000)

9. Chung, F., Lu, L.: Connected components in random graphs with given degreesequence. Annals of Combinatoreics 6, 125–1456 (2002)

10. Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empiricaldata (2007)

11. Doreian, P., Stokman, E.F.N.: Evolution of social networks. Gordon and Breach(1997)

12. Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the in-ternet topology. In: SIGCOMM, pp. 251–252 (1999)

13. Gkantsidi, C., Mihail, M., Zegura, E.: The markov chain simulatiopn methods forgenerating connected power law random graphs. In: Proc. of ALENEX 2003, pp.16–50 (2003)

14. Goh, K.-I., Eom, Y.-H., Jeong, H., Kahng, B., Kim, D.: Structure and evolutionof online social relationships: Heterogeneity in unrestricted discussions. PhysicalReview E (Statistical, Nonlinear, and Soft Matter Physics) 73(6), 66123 (2006)

15. Goldberg, M., Kelley, S., Magdon-Ismail, M., Mertsalov, K.: A locality model forthe evolution of blog networks. In: IEEE Information and Security Informatics, ISI(2008)

16. Goldberg, M., Kelley, S., Magdon-Ismail, M., Mertsalov, K.: Stable statistics ofthe blogograph. In: Interdisciplinary Studies in Information Privacy and Security(2008)

17. Kleinberg, J.M., Lawrence, S.: The structure of the web. Science, 1849–1850 (2001)18. Kossinets, G., Watts, D.J.: Empirical analysis of an evolving social network. Sci-

ence 311, 88–90 (2006)19. Kumar, R., Novak, J., Raghavan, P., Tomkins, A.: Structure and evolution of

blogospace. Communications of the ACM 33(1-6), 309–320 (2004)20. Kumar, R., Novak, J., Tomkins, A.: Structure and evolution of online social net-

works. In: KDD 2006 (2006)21. Newman, M.: The structure and function of complex networks. SIAM Review 45(2),

167–256 (2003)22. Newman, M., Barabasi, A.-L., Watts, D.: The structure and dynamics of networks.

Princeton University Press, Princeton (2006)23. Newman, M.E.J.: The structure of scientific collaboration networks. Proc. Natl.

Acad. Sci. USA 98, 404 (2001)24. Stauffer, A.O., Barbosa, V.C.: A study of the edge-switching markov chain method

for the generation of random graphs. arxiv: cs. DM/0512105 (2006)25. West, D.B.: Introduction to graph theory. Prentice Hall, Upper Saddle River (2003)

Finding Spread Blockers in Dynamic Networks

Habiba1,�, Yintao Yu2,��, Tanya Y. Berger-Wolf1,� � �, and Jared Saia3,†

1 University of Illinois at Chicago,{hhabib3,tanyabw}@uic.edu

2 University of Illinois at Urbana-Champaign,[email protected]

3 University of New Mexico,[email protected]

Abstract. Social interactions are conduits for various processes spread-ing through a population, from rumors and opinions to behaviors and dis-eases. In the context of the spread of a disease or undesirable behavior, it isimportant to identify blockers: individuals that are most effective in stop-ping or slowing down the spread of a process through the population. Thisproblem has so far resisted systematic algorithmic solutions. In an effort toformulate practical solutions, in this paper we ask: Are there structural net-work measures that are indicative of the best blockers in dynamic social net-works? Our contribution is two-fold. First, we extend standard structuralnetwork measures to dynamic networks. Second, we compare the blockingability of individuals in the order of ranking by the new dynamic mea-sures. We found that overall, simple ranking according to a node’s staticdegree, or the dynamic version of a node’s degree, performed consistentlywell. Surprisingly the dynamic clustering coefficient seems to be a goodindicator, while its static version performs worse than the random rank-ing. This provides simple practical and locally computable algorithms foridentifying key blockers in a network.

1 Introduction

How can we stop a process spreading through a social network? This problemhas applications to diverse areas such as preventing or inhibiting the spread ofdiseases [7, 26, 40], computer viruses1 [8, 22], rumors, and undesirable fads orrisky behaviors [23, 24, 37, 38]. A common approach to spread inhibition is to

� (No last name). Work supported in part by the Fulbright fellowship.�� Work performed in part while being a visiting student at the University of New

Mexico.� � � Work supported in part by the NSF grant IIS-0705822 and NSF CAREER Award

0747369.† Work supported in part by the NSF grant IIS-0705822, NSF CAREER Award

0644058, and an AFO MURI award.1 In particular, we are concerned with computer malware that spreads through social

networks, such as email viruses and worms, cell-phone viruses, and other relatedmalware such as the recent MySpace worm.


56 Habiba et al.

identify key individuals whose removal will most dampen the spread. In thecontext of the spread of a disease, it is a question of finding individuals to bequarantined, inoculated, or vaccinated so that the disease is prevented frombecoming an epidemic. We call this set of key individuals the blockers of thespreading process.

There has been significant previous work related to studying and controllingthe spread of dynamic processes in a network [9,10,11,16,18,22,23,26,35,40,43,44,46,47,51,54,57,59,60,67]. Unfortunately, these results have three propertiesrendering them ineffective for identifying good blockers in large networks. First,many proposed algorithms focus on a slightly different objective: they aim toidentify nodes that will be most effective in starting the spread of a process ratherthan blocking it [44, 47]; or alternatively, nodes that would be most effective insensing that a process has started to spread, and where the process initiated[9, 10, 11]. In this paper, we are focused specifically on identifying those nodesthat are good blockers. Second, algorithms proposed in previous work all requirecomputationally expensive calculations of some global properties over the entirenetwork, or rely on expensive, repeated stochastic simulations of the spread of adynamic process. In this paper, we present heuristics that identify good blockersquickly, based only on local information.

Finally, perhaps the most critical problem in previous work is the omission ofthe dynamic nature of social interactions. The very nature of a spreading processimplies an explicit time axis [52]. For example, the flow of information through asocial network depends on who starts out with the information when, and whichindividuals are in contact at the starting point with the information carrier [43].In this paper, we consider explicitly dynamic networks, defined in Section 3.1.In these networks, we study the social interactions over a finite period of time,measured in discrete time steps.

The main contributions of this paper are summarized below.

– We formally define dynamic networks in Section 3.1. This representation ofnetworks encompasses the traditional“aggregate”view of networks defined inSection 3.2 and adds the explicit temporal component to the interactions.The time axis is necessary since most spreading processes take place onnetworks that evolve over time.

– We formally define the problem of identifying key spread blockers in networksin Section 3.3.

– We modify various network measures, such as the centrality measures andclustering coefficient, to incorporate the dynamic nature of the networks(Section 3.4).

– We compare the reduction in the extent of spread based on removing in-dividuals from a network in the ranking order imposed by various networkmeasures. We identify measures that consistently give a good approximationof the best spread blockers.

– We compare the difference in the sets of top blockers identified by variousmeasures.

Finding Spread Blockers in Dynamic Networks 57

– We extensively evaluate our methods on real networks (Section 5). We usethe Enron email network dataset, the MIT Reality Mining dataset, DBLPco-authorship network, animal population networks of Grevy’s zebras, Plainszebras, and onagers.

Ultimately, we show that the dynamics of interactions matters, and moreoverthat simple local measures, such as degree, are highly indicative of an individual’scapacity to prevent the spread of a phenomenon in a population. The implica-tion of our results are that there are practical scalable heuristics for identifyingquarantine and vaccination targets in order to prevent an epidemic.

2 Related Work

Dynamic phenomena such as opinions, information, fads, behavior, and diseasespread through a network by contacts and interactions among the entities of thenetwork. Such spreading phenomena have been studied in a number of domainsincluding epidemiology [22,26,40,51,54,57,59], diffusion of technological innova-tions and adoption of new products [7, 16, 18,23,24,38, 35, 44, 46, 60, 67], voting,strikes, rumors [36, 37, 53, 68], as well as spread of contaminants in distributionnetworks [8, 9, 10, 11, 46] and numerous others.

One of the fundamental questions about dynamic processes is: Which indi-viduals, if removed from the network, would block the spread of such process?Several previous results have addressed the problem of identifying such indi-viduals [26, 40, 43]. Eubank et al. [26] experimentally show that global graphtheoretic measures like expansion factor and overlap ratio are good indicatorsfor devising vaccination strategies in static networks. Cohen et al. [21] proposeanother immunization strategy based on the aggregate network model. In partic-ular, they propose an efficient method of picking high degree nodes in a networkto immunize, thus inhibiting the spread of disease. Kempe et al. [43] show thata variant of the blocker identification problem is NP-hard. While these problemsand suggested approaches are similar to finding good blockers in a network, un-fortunately, there are critical differences that make these results inappropriatefor our formulation. First of all our objective is to minimize the expected extentof spread in a network. We do not make any assumption about the source of thespread. Second, almost all the above methods simplify the spreading process byignoring the time ordering of interactions.

There has also been significant related work on the problem of determiningwhere to place a small number of detectors in a network so as to minimize the timerequired to detect the spread of a dynamic process, and, ideally, also the locationat which the spread began. Berger-Wolf et al. [9] give algorithms for the problemof minimizing the size of the infected population before an outbreak is detected.Berry et al. [10,11] give algorithms to strategically place sensors in utility distribu-tion networks to minimize worst case time until detection. In [47], Leskovec et al.demonstrate that many objectives of the detection problem exhibit the propertyof submodularity. They exploit this fact to develop efficient and elegant algorithmsfor placing detectors in a network. While the detection problem is related to the

58 Habiba et al.

problem of blocking a process, it is only concerned with detecting a spreading pro-cess once, whereas a good blocker prevents multiple spreading paths. Moreover,the algorithms proposed for the detection problem all require global informationand work only for a stable, relatively unchanging network.

Another related problem is that of identifying nodes in a network that are mostcritical for spreading a dynamic process. Kempe et al. [44] show that identifyingkey spreaders – individuals that help to propagate the spread most – is NP-hard,but admits a simple greedy (1−1/e)-approximation. Later, Mossel and Roch [55]showed that the general case of finding a set of nodes with the largest“influence”is NP-hard, and has a (1−1/e−ε) approximation algorithm. Unfortunately, thisapproximation algorithm is computationally intensive. Strong inapproximabilityresults for several variants of identifying nodes with high influence in socialnetworks have been shown in [19]. Asur et al. in [5] present an event basedcharacterization of critical behavior in interaction graphs for the purposes ofmodeling evolution, link prediction, and influence maximization.

Finally, Aspnes et al. [4] have studied the inoculation problem from a graphtheoretic perspective. They show that finding an optimum inoculation strategyis related to the sum-of-squares partition problem. Moreover, they show that thesocial welfare of an inoculation strategy found when each node is a selfish agentcan be significantly less than the social welfare of an optimal inoculation strategy.

3 Definitions

Populations of individuals interacting over time are often represented as net-works, or graphs, where the nodes correspond to individuals and a pairwise in-teraction is represented as an edge between the corresponding individuals. Theidea of representing societies as networks of interacting individuals dates back toLewin’s earlier work of group behavior [48]. Typically, there is a single networkrepresenting all interactions that have happened during the entire observationperiod. We call this representation an aggregate network (Section 3.2). In thispaper we use an explicitly dynamic network representation (Section 3.1) thattakes the history of interactions into account.

3.1 Dynamic Network

We represent dynamic network as a series 〈G1, . . . , GT 〉 of static networks whereeach Gt is a snapshot of individuals and their interactions at time t. For thiswork, we assume that the time during which the individuals are observed isfinite. For simplicity, we also assume that the time period is divided into discretesteps {1, . . . , T}. The nontrivial problem of appropriate time discretization isbeyond the scope of this paper. We assume that an interaction between a pairof individuals takes place within one time step.

Definition 1. Let {1, . . . , T} be a finite set of discrete time steps. Let V ={1, . . . , n} be a set of individuals. Let Gt = (Vt, Et) be a graph representing thesnapshot of the network at time t. Vt ⊆ V , is a subset of individuals V observed


at time t. An edge (ut, vt) ∈ Et if individuals u and v have interacted at time t.Further, for all v ∈ V and t ∈ {1, . . . , T −1} the edges (vt, vt+1) ∈ E are directedself edges of individuals across time steps.

A dynamic network GD = 〈G1, . . . , GT 〉 is the graph GD = (V, E) of the timeseries of graphs Gt such that V =

⋃t Vt and E =

⋃t Et ∪

⋃t−1 (vt, vt+1).

The definition is equivalent to an undirected multigraph representation in [43].Figure 1 shows an example of several dynamic networks that have the same

unweighted aggregate network representation.

(a) (b) (c) (d)

(e)

Fig. 1. Example of several dynamic networks that have the same unweighted aggregatenetwork representation. Figures (a)–(d) show a dynamic networks of three individualsinteracting over four time steps. The solid line edges represent interactions amongindividuals in a time step. Empty circles are individuals observed during a time step.While at any given time step some individuals may be unobserved, the particularexample shows all the individuals being observed at all time steps. Figure (e) shows anunweighted aggregate network that has the same interactions as every dynamic networkin the example. Figures (a)–(c) have the multiplicity two of each edge while figure (d)has the multiplicity four for every edge in the aggregate representation.

3.2 Aggregate Network

The aggregate network is the graph GA = (V, E) of individuals V and their in-teractions E observed over a period of time. In this representation an edge existsbetween a pair of individuals if they have ever interacted during the observedtime period. Multiple interactions between a pair of individuals over time arerepresented as a single, possibly weighted, edge or multiple edges between them.This representation provides an aggregate view of the population where the in-formation about the timing and order of interactions is discard. In this work werepresent aggregate networks as multigraphs.

Definition 2. Let {1, . . . , T} be a finite set of discrete time steps. Let Vt be theset of individuals observed at time t and let Et be the set of interactions amongindividuals Vt at t. Then the aggregate graph GA = (V, E) of such a network isthe set of individuals V and interactions E such that V =

⋃t Vt and (u, v) ∈ E

if ∃(ut, vt) at some time step t ∈ {1, . . . , T}.

60 Habiba et al.

Using this aggregate network model, the structure and properties of many socialnetworks have been studied from different perspectives [6,13,12,15,41]. However,as we have mentioned, this and other similar models do not explicitly considerthe temporal aspect of the network.

3.3 Spread Blockers

We now formalize the notions of processes spreading in a network and individualsblocking this spread.

Spread(.) is a function that gives the overall average extent of spread ina network, that is, the expected number of individuals affected by a stochas-tic spreading process within a specified number of time steps. The estimate ofthe spread is dependent on the model of the spreading process, the structureof the network, and, of course, the number of time steps under consideration.Spreadv(.) is the expected spread in a network, when the spreading process isinitiated by the individual v. Given a model of a spreading process M and a dis-tribution of the probability of infection X : E → [0, 1], we define the spreadingfunctions as follows:

Spreadv : {Networks × Spread Models× Probability × T ime} → R+

Spread(G,M,X , T ) =1|V |

∑v∈V

Spreadv(G,M,X , T ) (1)

The limit equilibrium state of spread is denoted by

Spread(G,M,X ) = Spread(G,M,X ,∞) (2)

For a fixed spread model, probability distribution and a time period we will usethe overloaded shorthand notation Spread(G).

We define BlX(.) as a function that measures the reduction in the expectedspread size after removing the set X of individuals from the network. Hence, theblocking capacity of a single individual v, Blv(.), is the reduction in expectedspread size after removing individual v from the network.

BlX : {Networks × Spread Models× Probability × T ime} → R+

BlX(G) = BlX(G,M,X , T ) = Spread(G) − Spread(G \ X). (3)

kBl(.) is the function that finds the maximum possible reduction in spread ina network when set of individuals of size k is removed from the network. Notice,that the value of this function is always at least k. The argmax of this functionfinds the best blocker(s) in a network.

kBl(G) = maxX⊆V,|X|=k

BlX(G). (4)

Thus, finding the best blockers in the network is equivalent to finding the (setof) individuals whose removal from the network minimizes the expected extentof spread.

kBl(G) = Spread(G) − minX⊆V,|X|=k

Spread(G \ X). (5)


This definition of the individuals’ blocking capacity by removal corresponds inthe disease spread context to the quarantine action. Vaccination or inoculationleave the node in the network but deactivate its ability to propagate the spread.For the Independent Cascade model of spread (Section 3.5) the two actions areequivalent at the abstract level of estimating the spread and identifying blockersin networks.

Since no good analytical approaches are known for identifying blockers innetworks, in this paper we focus on examining the possibility of using structuralnetwork measures as practical indicators of nodes’ blocking ability. We nextbriefly define the structural measures used in this paper.

3.4 Network Structural Measures

In network analysis various properties of the graph representing the populationare studied as proxies of the properties of the individuals, their interactions,and the population itself. For example, the degree, various centrality measures,clustering coefficients, or the eigenvalues (PageRank) of the nodes have beenused to determine the relative importance of the individuals, e.g., [17, 42]. Be-tweenness centrality has been used to identify cohesive communities [33] and thedistributions of shortest path lengths employed to measure the “navigability” ofthe network [66]. These and many other graph theoretic measures have beentranslated to many social properties [50, 56, 57].

The ability of an individual to block a process spreading over a network canbe seen as another such social property. Graph measures such as clustering andassortative mixing coefficients have been used to design local vaccination strate-gies [40]. However, it is not clear that those are the best network measures to beused as an indicator of a node which is a good blocker. In this paper we evaluatethe power of all the standard network measures of a node to indicate the block-ing ability of the corresponding individual. Moreover, we extend the standardstatic measures to reflect the dynamic nature of the underlying network. Weexamine the following measures: degree, average degree, betweenness, closenesscentralities and clustering coefficient. We modify those to incorporate the timeordering of the interactions.

We use the following terms interchangeably in this paper: individuals or nodesare the vertices of the network and interactions are edges that can be both directedor undirected. Neighbors of a node, N(.), are the set of nodes adjacent to it. Thesubscript T with a function name indicates the dynamic variant of the function.

We now state the standard network measures for aggregate networks anddefine corresponding measures for dynamic networks. We focus first on the globalmeasures that summarize the entire network and then address local measuresthat characterize a node.

Global Structural Properties

Density is the proportion of the number of edges |E| present in a networkrelative to the possible number of edges

(|V |2

).

62 Habiba et al.

D(G) =|E|(|V |2

) . (6)

Dynamic Density is the average density of an observed time snapshot.

DT (G) =1T

∑1<t≤T

D(Gt). (7)

In the example in Figure 1, the density of the aggregate network in (e) is 2/3.However, the dynamic density of the networks (a), (b), and (c) is 1/3 while thedynamic density of (d) is 2/3.

Path between a pair of nodes u, v is a sequence of distinct nodes u = v1, v2, . . . , vp

= v with every consecutive pair of nodes connected by an edge(vi, vi+1) ∈ E.Temporal Path between u, v is a time respecting path in a dynamic network.

It is a sequence of nodes u = v1, . . . , vp = v where each (vi, vi+1) ∈ E is anedge in Et for some t. Also, for any i, j such that i + 1 < j, if vi ∈ Vt andvj ∈ Vs then t < s. The length of a temporal path is the number of timesteps it spans. Note, that this definition allows only immediate neighborhoodof a node to be reached within one time step.

In the example in Figure 1, while there is a path from c to a in the aggregate net-work (e), there is no temporal path from c to a in the dynamic network (b). Allthe temporal paths from a to c in the dynamic networks (a)–(d) are of length 2.

Diameter is the length of the longest shortest path. In dynamic networks, it isthe length of the longest shortest temporal path.

Local Node Properties

Degree of a node is the number of its unique neighbors. It is perhaps the sim-plest measure of the influence of an individual: the more neighbors one has,the higher the chances of reaching a larger proportion of a population.

Dynamic Degree is the change in the neighborhood of an individual over time,the rate at which new friends are gained. Let N(ut) be the neighborhoodof individual u at time step t. The relative change in the neighborhood isthen:2 |N(ut−1) � N(ut)|

|N(ut−1) ∪ N(ut)| |N(ut)|. (8)

The Dynamic Degree DEGT of u is the total accumulated rate of friendaddition.

DEGT (u) =∑

1<t≤T

|N(ut−1) � N(ut)||N(ut−1) ∪ N(ut)| |N(ut)|. (9)

Note, that here we consider a friend to be “new” if it was not a friend in theprevious time step. The definition is easily extended to incorporate a longerterm memory of friendship. The dynamic degree captures the gregariousnessof an individual, an important quality from a spreading perspective.

2 Here � denotes the symmetric difference of the sets.


Dynamic Average Degree is the average over all time steps of the interac-tions of an individuals in each time step:

AV G-DEG(u) =1T

∑1≤t≤T

DEG(ut). (10)

where, DEG(ut) is the size of the neighborhood of u at time step t.

The dynamic degree, unlike its standard aggregate version, carries the informa-tion of the timing of interactions and is sensitive to the order, concurrency anddelay among the interactions. For example, in Figure 1, the degree of the node bin the aggregate network (e) is 2. However, its dynamic degree in (a) is 3, in (b)is 1, and in (c) and (d) is 0. The dynamic average degree, on the other hand doesnot change when the order of interactions in a dynamic network is perturbed.It just tells us the average connectivity of an individual in the observed timeperiod. In all the dynamic networks (a)–(c) the average dynamic degree of b is1, while in (d) it is 2.

Nodes in Neighborhood (NNk) is the number of nodes in the local k-neigh-borhood of an individual. The number of nodes in the 1-neighborhood isprecisely the degree of an individual. We extend the measure by consideringthe 2- and 3-neighborhoods of each individual.

Edges in Neighborhood (ENk) is the number of edges in the local k-neigh-borhood of an individual. We compute the edges in neighborhood for 1-, 2-and 3- hop neighborhoods of each individual. This measure loosely capturesthe local density of the neighborhood of an individual.

Betweenness of an individual is the sum of fractions of all shortest paths be-tween all pairs of individuals that pass through this individual. It is a pa-rameter that measures the importance of individuals in a network based ontheir position on the shortest paths connecting pairs of non-adjacent indi-viduals [3, 31, 32].

Dynamic Betweenness of an individual is the fraction of all shortest temporalpaths that pass through it. Intuitively, the edges in a temporal path appearin the increasing time order. This concept of betweenness incorporates themeasure of a delay between interactions as well as the individual being atthe right place at the right time. We present in detail, different flavors ofthe traditional betweenness centrality concept in dynamic networks basedon position, time, and duration of interactions among individuals in [39]. Inthis paper, for technical reasons, we use the concept of temporal betweenness.Let gst be the number of shortest temporal paths between s and t, gst(u) ofwhich pass through u. Then the temporal betweenness centrality, BT (u), ofa node u is the sum of fractions of all s-t shortest temporal paths passingthrough the node u:

BT (u) =∑

s�=t�=u

gst(u)gst

. (11)

64 Habiba et al.

Closeness of an individual is the average (geodesic) distance of the individualto any other individual in the network [32, 62].

Dynamic Closeness of an individual is the average time it takes from thatindividual to reach any other individual in the network. Dynamic closenessis based on shortest temporal paths and the geodesic is defined as the timeduration of such paths. Let dT (u, v) be the length of the shortest temporalpath from u to v. Following the definition in [62] we define dynamic closenessas follows.

CT (u) =1∑

v∈V \{u}dT (u, v)

. (12)

Clustering Coefficient of an individual is the fraction of its neighbors whoare neighbors among themselves [58].

Dynamic Clustering Coefficient is the sum of the fractions of an individ-uals’ neighbors who have been friends among themselves in previous timesteps. That is, the dynamic clustering coefficient measures how many ofyour friends are already friends. Let CF (ut) be the number of friends of uthat are already friends among themselves by time step t. Then the dynamicclustering coefficient is defined as follows.

CCT (u) =∑

0≤t<T

CF (ut)|N(ut)|(|N(ut)| − 1)

. (13)

Consider the example in Figure 2. The clustering coefficient of all three nodes inthe static network is the same and equals to 1. However, the situation in the twodynamic networks is completely different. In network (a) the dynamic clusteringcoefficient of nodes a and c is 0 while that of the node b is 1. In network (b),on the other hand, the dynamic clustering coefficient of all the nodes is 0 sincewhen b meets a and c they still don’t know each other.

Apart from the measures defined above we also compute PageRank [14] ofnodes.

(a) (b) (c)

Fig. 2. Example of two dynamic networks (a) and (b) that have the same aggregatenetwork representation (c)


3.5 Spreading Model

A propagation process in a network can be described formally using many mod-els of transmission over the edges in that network. The fundamental assumptionof all such models is that the phenomenon is spreading over and only over theedges in the network and, thus, the topology of the network defines the dynamicsof the spread. For this paper we use the Independent Cascade model of diffu-sion in networks. Independent Cascade is one variant of the conditional decisionmodel [34, 65]. The spreading phenomenon cascades through the network basesthe the simplifying assumption that each individual base their decision to adoptor reject the spreading phenomenon on the status of each of its neighbors in-dependently. The independent cascade model was first introduced in [34, 35] inthe context of word-of-mouth marketing. This is also the most commonly usedsimple model to study disease transmission in networks [22, 51, 54, 57, 59] andis closely related to the simplest Susceptible-Infectious-Recovered (SIR) modelsfrom epidemiology [2]. In the Independent Cascade model, transmission fromone individual to another happens independently of interactions those individu-als have with all the other individuals.

The Independent Cascade model describes a spreading process in terms ofof two types of individuals, active and inactive. The process unfolds in discretesynchronized time steps. In each time step, each active individual attempts toactivate each of its inactive neighbors. The activation of each inactive neighboris determined by a probability of success. If an active individual succeeds inaffecting any of its neighbors, those neighbors become active in the next timestep. Each attempt of activation is independent of all previous attempts as wellas the attempts of any other active individual to activate a common neighbor.

More formally, let GD = (V, E) be a dynamic network, A0 ⊆ V be a setof active individuals, and puv be the probability of influence of u on v. Forsimplicity, we assume p is uniform for all V and remains fixed for the entireperiod of simulation. The uniform probability values also ensure that we test howthe blocking ability of individuals depends solely on the structure of the network,controlling for other parameters that may affect this ability. An active individualut ∈ A0 at time step t tries to activate each of its currently inactive neighborsvt with a probability p, independent of all the other neighbors. If ut succeeds inactivating vt at time step t, then vt will be active in step t + 1, whether or not(ut+1, vt+1) ∈ Et+1. If ut fails in activating vt, and at any subsequent time steput+i gets reconnected to the still inactive vt+i, it will again try to activate vt+i.The process runs for a finite number of time steps T . We denote by σ(A0) = AT

the correspondence between the initial set A0 and the resulting set of activeindividuals AT . We call the size of the set AT , |AT |, the extent of spread.

The spreading process in the independent cascade model in a dynamic networkis different from the aggregate network in one important aspect. In the aggregatecase, each individual u uses all its attempts of activating each of its inactiveneighbors v with the same probability p in one time step t. This is the timestep right after the individual u itself becomes active. After that single attemptthe active individual becomes latent: that is, it is active but unable to activate

66 Habiba et al.

others. However in the dynamic network model as defined above, the activeindividuals never become latent during the spreading process. For this paper, weonly consider the progressive case in which an individual converts from inactiveto active but never reverses (no recovery in the epidemiological model). It isa particularly important case in the context of identifying blockers since theblocking action is typically done before any recovery.

4 Experimental Setup

We evaluate the effectiveness of each of the network structural measures as indi-cators of individual’s blocking capacity under the Independent Cascade spread-ing model.

4.1 The Protocol

For each measure and for each dynamic network dataset, we follow the followingsteps:

1. Order the individuals 0, . . . , |V − 1| according to the ranking imposed by themeasure.

2. For i = 0 to |V − 1| do:(a) Remove node i from G = (V, E).(b) Estimate the extent of spread in G \ i by averaging over stochastic sim-

ulations of Independent Cascade model initiated at each node in turn,3000 iterations for each starting node.3

(c) If the extent of spread is less than 10% of the nodes in the originalnetwork then STOP.

We compare the power of each measure to serve as a proxy indicator for theblocking ability of an individual based on the number of individuals that hadto be removed in the ordering imposed by that measure in order to achieve thisreduction to 10%.

4.2 Probability of Activation

We conducted the Independent Cascade spreading experiments on a variety ofnetworks with diverse global structural properties such as density, diameter,and average path length. In each network, we assigned a different probability ofactivation based on the structure of the network. The probability value that forsome networks facilitated propagation of spread to only a small portion of thenodes for other networks resulted in immediate spread to the entire network.The following is the procedure we used to find a meaning full probability ofactivation for a given network.

3 Which is more than sufficient for the convergence.


1. For a given G = (V, E) , run the Independent Cascade Spreading processwith p = 1. Note, that this is a deterministic process.

2. Calculate the average extent of spread S in G = (V, E). This is the averagesize of a connected component in G.

3. Rerun the spreading process while setting p < 1. Calculate the average extentof spread in the network. Repeatedly reduce p until the average extent ofspread is half of S.

4. Set probability of activation for G equal to p.

We use the following measures for comparison: dynamic and aggregate versionsof degree, betweenness, closeness centralities, and clustering coefficient, as wellas the average dynamic degree (turnover rate). For the global measures of be-tweenness and closeness we locally approximate them within 1-, 2-, and 3-hopsneighborhoods. For the datasets with directed interactions we also use page rankand approximate it within 1-, 2-, and 3-neighborhoods as well. We also rank in-dividuals based on their neighbors within 1-, 2-, and 3- hops of nodes and edges.Overall, we experimented with 26 different measures.

We compare the structural measures to a random ordering of nodes as anupper bound and the best blockers identified by an exhaustive search as thelower bound.

4.3 Lower Bound: Best Blockers

We identify the best blockers one at a time using exhaustive search over all theindividuals. To find one best blocker, we remove each individual, in turn, fromthe network and estimate the extent of spread using stochastic simulations ofthe Independent Cascade model in the remaining network. The best blocker,then, is the individual whose removal results in the minimum extent of spreadafter removal. We then repeat the process with the remaining individuals. Thisprocess imposes another ranking on the nodes.

Optimally, one needs to identify the set of top k blockers. However, this prob-lem is computationally hard and an exhaustive search is infeasible. We haveconducted limited experiments on the datasets considered in this paper and inall cases the set of iterative best k blockers equals to the set of top k blockers.This preliminary result warrants future investigation and rigorous evaluation.

5 Datasets

We now describe the datasets used in the experiments.

Grevy’s: Populations of Grevy’s zebras (Equus grevyi) were observed by bi-ologists [29, 30, 61, 63] over a period of June–August 2002 in the Laikipiaregion of Kenya. Predetermined census loops were driven on a regular basis(approximately twice per week) and individuals were identified by uniquestripe patterns. Upon sighting, an individual’s GPS location was taken. Inthe resulting dynamic network, each node represents an individual animaland two animals are interacting if their GPS locations are the same. Thedataset contains 28 individuals interacting over a period of 44 time steps.

68 Habiba et al.

Onagers: Populations of wild asses (Equus hemionus), also known as onagers,were observed by biologists [61, 63] in the Little Rann of Kutch, a desert inGujarat, India, during January–May 2003. These data are also obtained fromvisual scans, as in Grevy’s zebra case. The dataset contains 29 individualsover 82 time steps.

DBLP: This data set is a sample of the Digital Bibliography and Library Project[49]. This is a bibliographic dataset of publications in Computer Science. Weuse a cleaned version of the data from 1967–2005. In the dynamic networkeach node represents an individual author and two authors are interacting ifthey are co-authors on a paper. A year is one time step. The sample we usedcontains 1374 individuals and 38 time steps. We use this dataset to comparethe dynamic and the static networks.

The DBLP dataset is sparse, with many small connected components. Infact, the average size of a connected component (using temporal paths) is.03×|V |. Thus, the expected extent of spread in this network cannot exceed3%. For DBLP we set the stopping criterion for removing blockers from thenetwork at 1% of the population being affected, rather than the 10% usedfor other datasets.

Reality Mining: The Reality Mining experiment is one of the largest mobilephone projects attempted in academia. These are the data collected by MITMedia Lab at MIT [25]. They have captured communication, proximity, lo-cation, and activity information from 100 subjects at MIT over the courseof the 2004-2005 academic year. These data represent over 350,000 collectivehours (∼ 40 years) of human behavior.

Reality Mining data are collected by recording the bluetooth scans of eachdevice every five minute. We have quantized the data to 4 hours interval forthe dynamic network representation of the network based on the analysisby [20].

Enron: The Enron e-mail corpus is a publicly available database of e-mails sentby and to employees of the now defunct Enron corporation4. Timestamps,senders and lists of recipients were extracted from message headers for eache-mail on file. We chose a day as the time step, and a directed interaction ispresent if an e-mail was sent between two individuals.

We used the version of the dataset restricted to the 150 employees ofEnron organization who were actually subpoenaed. The raw Enron corpuscontains 619,446 messages belonging to 158 users [1, 45].

UMass: Co-location of individuals in a population of students at the Universityof Massachusetts Amherst; data collected via portable motes5.

The following table provides a summary of the statistics of the networks we usein our experiments.

4 Available with a full description at http://www.cs.cmu.edu/~enron/5 Available with a full description at http://kdl.cs.umass.edu/data/msn/msn-info.html

http://www.cs.cmu.edu/~enron/

http://kdl.cs.umass.edu/data/msn/msn-info.html

http://kdl.cs.umass.edu/data/msn/msn-info.html


Table 1. Dynamic network dataset statistics. Here V is the number of individuals,E is the number of edges, T is the number of time steps, D is density and DT isdynamic density, d is the diameter within a connected component and dT is the dynamicdiameter,p is average shortest path length and pT is the average temporal shortest pathlength, and r is no. of reachable pairs and rT is the number of temporally reachablepairs.

V E T D DT d dT p pT r rT

Grevy’s 28 779 44 0.30 0.52 4 36 1.84 4.81 518 432Onagers 29 402 82 0.36 0.24 3 74 1.66 7.51 756 617DBLP 1374 2262 38 0.002 0.09 15 37 5.54 5.12 900070 58146Enron 147 7406 701 0.04 0.14 6 618 2.66 461.24 19620 16474MIT 96 67107 2940 0.68 0.18 2 315 1.32 4.21 9120 9114UMass 20 2664 693 0.72 0.35 2 8 1.28 3.71 380 374

6 Results and Discussion

For each of the datasets we have evaluated all the structural network measures todetermine how effectively they serve to identify good blockers. To recap, we ranknodes by each measure and remove them from the network in that order. Afterremoving each node we measure the expected extent of spread in the networkusing simulations. We compare the effect of each measure’s ordering to that ofa random ordering and the brute force best blockers ordering. Figure 3 showsresults for two datasets, Onagers and Enron, that are representative of the resultson all the datasets. The results for the other datasets are omitted due to spacelimitations. For all the plots, the x-axis is the number of individuals removedand the y-axis shows the corresponding extent of spread. The lower the extentof spread after removal, the better is the blocking capacity of the individualsremoved. Thus, the curves lower on the plot correspond to measures that serveas better indicators of individuals’ blocking power.

Fig. 3. [Best viewed in color.] Comparison of the reduction of extent of spread afterremoval of nodes ranked by various measures in Onagers and Enron datasets

70 Habiba et al.

The comparison of all the measures showed that four measures performed con-sistently well as blocker indicators: degree in aggregate network, the number ofedges in the immediate aggregate neighborhood (local density), dynamic averagedegree, and dynamic clustering coefficient. This is good news from the practicalpoint of view of designing epidemic response strategies since all the measures aresimple, local, and easily scalable. Figure 4 shows the results of the comparison ofthose four best measures, as well as the best possible and random orderings, forall the datasets. Surprisingly, while the local density and the dynamic cluster-ing coefficients seem to be good indicators, the aggregate clustering coefficientturned out to be the worst, often performing worse than a random ordering. Be-tweenness and closeness measures performed inconsistently. PageRank did notperform well in the only dataset with directed interactions (Enron)6.

As seen in Figure 4, the ease of blocking the spread depends very much on thestructure of the dynamic network. In the two bluetooth datasets, MIT RealityMining and UMass, all orderings, including the random, performed similarly.Those are well connected networks, as evident by the large difference betweenthe dynamic diameter and the average shortest temporal path. The only way toreduce the extent of spread to below 10% of the original population seems to betrivially removing nearly 90% of the individual population. On the other hand,Enron and DBLP, the sparsely connected datasets, show the opposite trend ofbeing easily blockable by a good ranking measure.

When rankings of different measures result in a similar blocking ability we askwhether it is due to the fact that the measures rank individuals in a similar way

Fig. 4. [Best viewed in color.] Comparison of the reduction of the extent of spreadafter removal of nodes ranked by the best 4 measures. The x-axis shows the number ofindividuals removed and the y-axis shows the average spread size after the removal ofindividuals.

6 On undirected graphs, PageRank is equivalent to degree in aggregate network.


Table 2. Average rank difference between the rankings induced by every two of thebest four measures

Dat

aset

Bes

tvs

Avg

DE

G

Bes

tvs

Dyn

CC

Bes

tvs

DE

G

Bes

tvs

EN

N1

Ave

DE

Gvs

Dyn

CC

Avg

DE

Gvs

DE

G

Avg

DE

Gvs

EN

N1

Dyn

CC

vsD

EG

Dyn

CC

vsE

NN

1

DE

Gvs

EN

N1

Grevy’s 4.5 4.64 4.79 3.86 4.5 2.86 2.64 5.57 5 1.14Onagers 3.59 4.48 3.31 3.52 4.69 4.14 2.97 6.07 6 2DBLP - - - - 430.76 71.3 78.49 434.21 428.25 77.22Enron 21.95 50.01 27.29 21.02 46.37 22.56 21.93 44.35 44.95 25.32MIT - - - - 4.88 14.4 14.48 14.33 14.27 2.25

UMass 4.6 4.6 3 2.7 0 3.3 3.1 3.3 3.1 1

and, thus, identify the same set of good blockers or, rather, different measuresidentify different sets of good blockers. To answer this question, we comparedthe sets of the top ranked blockers identified by the four best measures as wellas the best possible ordering. We compute the average rank difference betweenthe sets of individuals ranked top by every two measures. Table 2 shows thepairwise difference in ranks. In general, there is little correspondence between therankings imposed by various measures. The only strong relationship, as expected,is between the number of edges in the neighborhood of a node and its degree inthe aggregate network.

We further explore the difference in the sets of the top ranked individualsby computing the size of the common intersection of all the top sets rankedby the four measures and the best possible ranking. We use the size of the setdetermined by the best possible ordering as the reference set size for all measures.Table 3 shows the size of the common intersection for all datasets. Again, we see

Table 3. The size of the common intersection of all the top sets ranked by the fourmeasures and the best ranking. Set size is the size of the sets determined by the bestblocking ordering. The size of the intersection is the number of the individuals in theintersection and the Intersection fraction is the fraction of the intersection of the sizeof the set.

Dataset Set size Inter. size Inter. frac

Grevy’s 5 2 .40Onagers 9 3 .33DBLP 16 0 0Enron 13 4 .31Reality Mining 60 48 .80UMass 12 10 .83

72 Habiba et al.

a strong effect of the structure of the network. The MIT Reality Mining and theUMass datasets have the largest intersection size. On the other hand, in DBLPthe four measures produced very different top ranked sets, yet all four measureswere extremely good indicators of the blockers. In other networks, while thereare some individuals that are clearly good blockers according to all measures,there is a significant difference among the measures. Overall, these results lead totwo future directions: 1) investigating the effect of the overall network structureon the “blockability”of the network; and 2) designing consensus techniques thatcombine rankings by various measures into a possible better list of blockers.

7 Conclusions and Future Work

In this paper we have investigated the task of preventing a dynamic process,such as disease or information, from spreading through a network of social in-teractions. We have formulated the problem of identifying good blockers: nodeswhose removal results in the maximum reduction in the extent of spread in thenetwork. In the absence of good computational techniques for finding such nodesefficiently, we have focused on identifying structural network measures that areindicative of whether or not a node is a good blocker. Since the timing andorder of interactions is critical in propagating many spreading phenomena, wefocused on explicitly dynamic networks. We, thus, extended many standard net-work measures, such as degree, betweenness, closeness, and clustering coefficient,to the dynamic setting. We also approximated global network measures locallywithin a node’s neighborhood. Overall, we considered 26 different measures ascandidate proxies for the blocking ability of a node.

We conducted experiments on six dynamic network datasets spanning a rangeof contexts, sizes, density, and other parameters. We compared the extent ofspread while removing one node at a time according to the ranking of nodesimposed by each measure. Overall, four structural measures performed consis-tently well in all datasets and were close to identifying the overall best blockers.These four measures were node degree, number of edges in node’s neighborhood,dynamic average degree, and dynamic clustering coefficient. The traditional ag-gregate clustering coefficient and dynamic closeness performed the worst, oftenworse than a random ordering of nodes. All four best measures are local, simple,and scalable, thus, potentially can be used to design good practical epidemicprevention strategies. However, before such policy decisions are made, we needto verify that our results hold true in other, larger and more complete datasetsand for realistic disease spread models.

The striking disparity between the performance of the dynamic and aggre-gate clustering coefficient indicates the necessity of taking the dynamic natureof interactions explicitly into consideration in network analysis. Moreover, thisdisparity justifies the extension of traditional network measures and methodsto the dynamic setting. In future work, we plan to further investigate the in-formativeness of a range of dynamic network measures in various applicationcontexts.


We have also compared the sets of nodes ranked at the top by various mea-sures. Interestingly, in the networks in which it was difficult to block dynamicspread, all the measures resulted in very similar rankings of individuals. In con-trast, in the networks where the removal of a small set of individuals was sufficientto reduce the spread significantly, the best measures gave very different rankingsof individuals. Thus, there seems to be a dichotomy in the real-world networkswe studied. On one hand, there are dense networks (e.g. MIT Reality Miningand UMass datasets) in which it is inherently challenging to block a spreadingprocess and all measures perform similarly badly. On the other hand, there aresparse networks where it seems to be easy to stop the spread and there are manyways to do it. In future work, we will investigate the specific global structuralattributes of a network that delineate this difference between networks for whichit is hard or easy to identify good blockers.

The comparison of the top ranked sets also shows that while there may besome common nodes ranked high by all measures, there is a significant differenceamong the measures. Yet, all the rankings perform comparably well. Thus, thereis a need to test a consensus approach that combines the sets ranked top byvarious measures into one set of good candidate blockers. This is similar tocombining the top k lists returned as a web search result [27].

This paper focused on the practical approaches to identifying good blockers.However, the theoretical structure of the problem is not well understood andso far has defied good approximation algorithms. Recent developments in theanalysis of non-monotonic submodular functions [28, 64] may be applicable tovariants of the problem and may result in good approximation guarantees.

References

1. Adibi, J.: Enron email dataset, http://www.isi.edu/~adibi/Enron/Enron.htm2. Anderson, R.M., May, R.M.: Infectious Diseases of Humans: Dynamics and Control.

Oxford University Press, Oxford (1992)3. Anthonisse, J.: The rush in a graph. Mathematische Centrum, Amsterdam (1971)4. Aspnes, J., Chang, K., Yampolskiy, A.: Inoculation strategies for victims of viruses

and the sum-of-squares partition problem. J. Comput. Syst. Sci. 72(6), 1077–1093(2006)

5. Asur, S., Parthasarathy, S., Ucar, D.: An event-based framework of characterizingthe evolutionary behavior of interaction graphs. In: Proceedings of the ThirteenthACM SIGKDD International Conference on Knowledge Discovery and Data Mining(2007)

6. Barabasi, A.L., Jeong, H., Neda, Z., Ravasz, E., Schubert, A., Vicsek, T.: Evolutionof the social network of scientific collaborations. Physica A: Statistical Mechanicsand its Applications 311(3-4), 590–614 (2002)

7. Berger, E.: Dynamic monopolies of constant size. J. Combin. Theory Series B 83,191–200 (2001)

8. Berger, N., Borgs, C., Chayes, J.T., Saberi, A.: On the spread of viruses on theinternet. In: SODA 2005: Proceedings of the sixteenth annual ACM-SIAM sym-posium on Discrete algorithms, Philadelphia, PA, USA, pp. 301–310. Society forIndustrial and Applied Mathematics (2005)

http://www.isi.edu/~adibi/Enron/Enron.htm

74 Habiba et al.

9. Berger-Wolf, T., Hart, W., Saia, J.: Discrete sensor placement problems in distri-bution networks. Mathematical and Computer Modelling (2005)

10. Berry, J., Fleischer, L., Hart, W., Phillips, C., Watson, J.: Sensor placement inmunicipal water networks. Journal of Water Resources Planning and Manage-ment 131(3) (2005a)

11. Berry, J., Hart, W., Phillips, C., Uber, J.G., Watson, J.: Sensor placement in munic-ipal water networks with temporal integer programming models. Journal of WaterResources Planning and Management 132(4), 218–224 (2006)

12. Borner, K., Dall’Asta, L., Ke, W., Vespignani, A.: Studying the emerging globalbrain: Analyzing and visualizing the impact of co-authorship teams. Complexity,Special issue on Understanding Complex Systems 10(4), 57–67 (2005)

13. Borner, K., Maru, J., Goldstone, R.: The simultaneous evolution of author andpaper networks. PNAS 101(suppl. 1), 5266–5273 (2004)

14. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine.In: WWW7: Proceedings of the 7th International Conference on World Wide Web7, pp. 107–117. Elsevier Science Publishers B. V., Amsterdam (1998)

15. Broido, A., Claffy, K.: Internet topology: connectivity of IP graphs. In: Proceedingsof SPIE ITCom (2001)

16. Carley, K.: Communicating new ideas: The potential impact of information andtelecommunication technology. Technology in Society 18(2), 219–230 (1996)

17. Carreras, I., Miorandi, D., Canright, G., Engøo-Monsen, K.: Eigenvector central-ity in highly partitioned mobile networks: Principles and applications. Studies inComputational Intelligence (SCI) 69, 123–145 (2007)

18. Chen, L., Carley, K.: The impact of social networks in the propagation of computerviruses and countermeasures. IEEE Trasactions on Systems, Man and Cybernetics(forthcoming)

19. Chen, N.: On the approximability of influence in social networks. In: ACM-SIAMSymposium on Discrete Algorithms (SODA), pp. 1029–1037 (2008)

20. Clauset, A., Eagle, N.: Persistence and periodicity in a dynamic proximity network(unpublished manuscript)

21. Cohen, R., Havlin, S., ben Avraham, D.: Efficient immunization strategies for com-puter networks and populations. Physical Review Letters (2003)

22. Dezso, Z., Barabasi, A.-L.: Halting viruses in scale-free networks. Physical ReviewE 65(055103(R)) (2002)

23. Domingos, P.: Mining social networks for viral marketing. IEEE Intelligent Sys-tems 20, 80–82 (2005)

24. Domingos, P., Richardson, M.: Mining the network value of customers. In: SeventhInternational Conference on Knowledge Discovery and Data Mining (2001)

25. Eagle, N., Pentland, A.: Reality mining: Sensing complex social systems. Journalof Personal and Ubiquitous Computing (2006)

26. Eubank, S., Guclu, H., Kumar, V., Marathe, M., Srinivasan, A., Toroczkai, Z.,Wang, N.: Modelling disease outbreaks in realistic urban social networks. Na-ture 429, 180–184 (2004) (supplement material)

27. Fagin, R., Kumar, R., Sivakumar, D.: Comparing top k lists. In: SODA 2003: Proc.,14th ACM-SIAM Symposium on Discrete Algorithms, Philadelphia, PA, USA, pp.28–36. Society for Industrial and Applied Mathematics (2003)

28. Feige, U., Mirrokni, V., Vondrak.: Maximizing non-monotone submodular func-tions. In: Foundations of Computer Science, FOCS (2007)


29. Fischhoff, I.R., Sundaresan, S.R., Cordingley, J., Larkin, H.M., Sellier, M.-J.,Rubenstein, D.I.: Social relationships and reproductive state influence leadershiproles in movements of plains zebra (Equus burchellii). Animal Behaviour 73(5),825–831 (2007)

30. Fischhoff, I.R., Sundaresan, S.R., Cordingley, J., Rubenstein, D.I.: Habitat use andmovements of plains zebra (Equus burchelli) in response to predation danger fromlions. Behavioral Ecology 18(4), 725–729 (2007)

31. Freeman, L.: A set of measures of centrality based on betweenness. Sociometry 40,35–41 (1977)

32. Freeman, L.C.: Centrality in social networks: I. conceptual clarification. SocialNetworks 1, 215–239 (1979)

33. Girvan, M., Newman, M.E.J.: Community structure in social and biological net-works. Proc. Natl. Acad. Sci. 99, 8271–8276 (2002)

34. Goldenberg, J., Libai, B., Muller, E.: Talk of the network: A complex systemslook at the underlying process of word-of-mouth. Marketing Letters 12(3), 211–223 (2001)

35. Goldenberg, J., Libai, B., Muller, E.: Using complex systems analysis to advancemarketing theory development. Academy of Marketing Science Review (2001)

36. Granovetter, M.: The strength of weak ties. American J. Sociology 78(6), 1360–1380 (1973)

37. Granovetter, M.: Threshold models of collective behavior. American J. Sociol-ogy 83(6), 1420–1443 (1978)

38. Gruhl, D., Guha, R., Liben-Nowell, D., Tomkins, A.: Information diffusion throughblogspace. In: WWW 2004: Proc. 13th Intl Conf on World Wide Web, pp. 491–501.ACM Press, New York (2004)

39. Habiba, C.T., Berger-Wolf, T.Y.: Betweenness centrality in dynamic networks.Technical Report 2007-19, DIMACS (2007)

40. Holme, P.: Efficient local strategies for vaccination and network attack. Europhys.Lett. 68(6), 908–914 (2004)

41. Hopcroft, J., Khan, O., Kulis, B., Selman, B.: Natural communities in large linkednetworks. In: Proc. 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery andData Mining, pp. 541–546 (2003)

42. Jordan, F., Benedek, J., Podani, Z.: Quantifying positional importance in foodwebs: A comparison of centrality indices. Ecological Modelling 205, 270–275 (2007)

43. Kempe, D., Kleinberg, J., Kumar, A.: Connectivity and inference problems fortemporal networks. J. Comput. Syst. Sci. 64(4), 820–842 (2002)

44. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence througha social network. In: 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery andData Mining (2003)

45. Klimt, B., Yang, Y.: The Enron corpus: A new dataset for email classification re-search. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004)

46. Leskovec, J., Adamic, L.A., Huberman, B.A.: The dynamics of viral marketing.In: EC 2006: Proceedings of the 7th ACM conference on Electronic commerce, pp.228–237. ACM Press, New York (2006)

47. Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J.: Cost-effectiveoutbreak detection in networks. In: Proc. 13th ACM SIGKDD Intl. Conf. on Knowl-edge Discovery and Data Mining (2007)

48. Lewin, K.: Principles of Topological Psychology. McGraw Hill, New York (1936)49. Ley, M.: Digital bibliography & library project (DBLP) (December 2005); A digital

copy of the databse has been provided by the author, http://dblp.uni-trier.de/

http://dblp.uni-trier.de/

76 Habiba et al.

50. Liljeros, F., Edling, C., Amaral, L.N.: Sexual networks: Implication for the trans-mission of sexually transmitted infection. Microbes and Infection (2003)

51. May, R.M., Lloyd, A.L.: Infection dynamics on scale-free networks. Physical ReviewE 64(066112) (2001)

52. Moody, J.: The importance of relationship timing for diffusion. Social Forces (2002)53. Moreno, Y., Nekovee, M., Pacheco, A.F.: Dynamics of rumor spreading in

complex networks. Physical Review E (Statistical, Nonlinear, and Soft MatterPhysics) 69(6), 066130 (2004)

54. Morris, M.: Epidemiology and social networks:modeling structured diffusion. Soci-ological Methods and Research 22(1), 99–126 (1993)

55. Mossel, E., Roch, S.: On the submodularity of influence in social networks. In: TheAnnual ACM Symposium on Theory of Computing(STOC) (2007)

56. Newman, M.: The structure and function of complex networks. SIAM Review 45,167–256 (2003)

57. Newman, M.E.: Spread of epidemic disease on networks. Physical ReviewE 66(016128) (2002)

58. Newman, M.E.J.: Scientific collaboration networks. i. network construction andfundamental results. Physical Review E 64, 016131 (2001)

59. Pastor-Satorras, R., Vespignani, A.: Epidemic spreading in scale-free networks.Phys. Rev. Lett. 86(14), 3200–3203 (2001)

60. Rogers, E.M.: Diffusion of Innovations, 5th edn. Simon & Shuster, Inc., New York(2003)

61. Rubenstein, D.I., Sundaresan, S., Fischhoff, I., Saltz, D.: Social networks in wildasses: Comparing patterns and processes among populations. In: Stubbe, A.,Kaczensky, P., Samjaa, R., Wesche, K., Stubbe, M. (eds.) Exploration into theBiological Resources of Mongolia, vol. 10, pp. 159–176. Martin-Luther-University,Halle-Wittenberg (2007)

62. Sabidussi, G.: The centrality index of a graph. Psychometrika 31, 581–603 (1966)63. Sundaresan, S.R., Fischhoff, I.R., Dushoff, J., Rubenstein, D.I.: Network metrics

reveal differences in social organization between two fission-fusion species, Grevy’szebra and onager. Oecologia 151, 140–149 (2007)

64. Vredeveld, T., Lenstra, J.: On local search for the generalized graph coloring prob-lem. Operations Research Letters 31, 28–34 (2003)

65. Watts, D.: A simple model of global cascades on random networks. PNAS 99,5766–5771 (2002)

66. Watts, D., Strogatz, S.: Collective dynamics of small-world networks. Nature 393,440–442 (1998)

67. Young, H.P.: Innovation diffusion and population heterogeneity, Working paper(2006)

68. Zanette, D.H.: Dynamics of rumor propagation on small-world networks. Phys.Rev. E 65(4), 041908 (2002)

Social Network Mining with NonparametricRelational Models

Zhao Xu1, Volker Tresp2, Achim Rettinger3, and Kristian Kersting1

1 Fraunhofer IAIS, [email protected]

2 Siemens Corporate Technology, [email protected]

3 Technical University of Munich, [email protected]

Abstract. Statistical relational learning (SRL) provides effective tech-niques to analyze social network data with rich collections of objects andcomplex networks. Infinite hidden relational models (IHRMs) introducenonparametric mixture models into relational learning and have beensuccessful in many relational applications. In this paper we explore themodeling and analysis of complex social networks with IHRMs for com-munity detection, link prediction and product recommendation. In anIHRM-based social network model, each edge is associated with a ran-dom variable and the probabilistic dependencies between these randomvariables are specified by the model, based on the relational structure.The hidden variables, one for each object, are able to transport informa-tion such that non-local probabilistic dependencies can be obtained. Themodel can be used to predict entity attributes, to predict relationshipsbetween entities and it performs an interpretable cluster analysis. Wedemonstrate the performance of IHRMs with three social network appli-cations. We perform community analysis on the Sampson’s monasterydata and perform link analysis on the Bernard & Killworth data. Finallywe apply IHRMs to the MovieLens data for prediction of user preferenceon movies and for an analysis of user clusters and movie clusters.

Keywords: Statistical Relational Learning, Social Network Analysis,Nonparametric Mixture Models, Dirichlet Process, Variational Inference.

1 Introduction

Social network mining has gained in importance due to the growing availabilityof data on novel social networks, e.g. citation networks (DBLP, Citeseer), SNSwebsites (Facebook), and social media websites (Last.fm). Social networks usu-ally consist of rich collections of objects, which are linked into complex networks.Generally, social network data can be graphically represented as a sociogram asillustrated in Fig. 1 (left). In this simple social network, there are persons, personprofiles (e.g., gender), and these persons are linked together via friendships. Someinteresting applications in social network mining include community discovery,relationship prediction, social recommendation, etc.


78 Z. Xu et al.

Statistical relational learning (SRL) [8,17,11] is an emerging area of machinelearning research, which attempts to combine expressive knowledge represen-tation formalisms with statistical approaches to perform probabilistic inferenceand learning on relational networks. Fig. 1 (right) shows a simple SRL model forthe above sociogram example. For each potential edge, a random variable (RV) isintroduced that describes the state of the edge. For example, there is a RV asso-ciated with the edge between the person 1 and the person 2. The binary variableis YES if the two persons are friends and No otherwise. The edge between anobject (e.g., person 1) and object property (e.g., Male) is also associated witha RV, whose value describes the person’s profile. In the running example, allvariables are binary. To infer the quantities of interest, e.g., whether the person1 and the person 2 are friends, we need to learn the probabilistic dependenciesbetween the random variables. Here we assume that friendship is conditionedon the profiles (gender) of the involved persons, shown as Fig. 1 (right). Thedirected arcs, for example, the ones between G1 and R1,2 and between G2 andR1,2 specify that the probability that person 1 and the person 2 are friends de-pends on their respective profiles. Given the probabilistic model, we can learnthe parameters and predict the relationships of interest.

Fig. 1. Left: A simple sociogram. Right: A probabilistic model for the sociogram. Eachedge is associated with a random variable that determines the state of the edge. Thedirected arcs indicate direct probabilistic dependencies.

In the simple relational model of social network, the friendship is locally pre-dicted by the profiles of the involved objects: whether a person is a friend ofanother person is only dependent on the profiles of the two persons. Given thatthe parameters are fixed, and given the parent attributes, all friendships areindependent of each other such that correlations between friendships, i.e., thecollaborative effect, cannot be taken into account. To solve this limitation, struc-tural learning might be involved to obtain non-local dependencies but structurallearning in complex relational networks is considered a hard problem [9]. Non-local dependencies can also be achieved by introducing for each person a hiddenvariable as proposed in [24]. The state of the hidden variable represents unknownattributes of the person, e.g. the particular habit of making friends with certain

Social Network Mining with Nonparametric Relational Models 79

persons. The hidden variable of a person is now the only parent of its profiles andis one of the parents of the friendships in which the person potentially partici-pates. Since the hidden variables are of central importance, this model is referredto as the hidden relational model (HRM). In relational domains, different classesof objects generally require a class-specific complexity in the hidden representa-tion. Thus, it is sensible to work with a nonparametric method, Dirichlet process(DP) mixture model, in which each object class can optimize its own representa-tional complexity in a self-organized way. Conceptionally, the number of statesin the hidden variables in the HRM model becomes infinite. In practice, theDP mixture sampling process only occupies a finite number of components. Thecombination of the hidden relational model and the DP mixture model is theinfinite hidden relational model (IHRM) [24].

The IHRM model has been first presented in [24]. This paper is an extendedversion of [25] and we explore social network modeling and analysis with IHRMfor community detection, link prediction, and product recommendation. Wepresent two inference methods for efficient inference: one is the blocked Gibbssampling with truncated stick-breaking (TSB) construction, the other is themean-field approximation with TSB. We perform empirical analysis on three so-cial network datasets: the Sampson’s monastery data, the Bernard & Killworthdata, and the MovieLens data. The paper is organized as follows. In the next sec-tion, we perform analysis of modeling complex social network data with IHRMs.In Sec. 3 we describe a Gibbs sampling method and a mean-field approximationfor inference in the IHRM model. Sec. 4 gives the experimental analysis on so-cial network data. We review some related work in Sec. 5. Before concluding, anextension to IHRMs is discussed in Sec. 6.

2 Model Description

Based on the analysis in Sec. 1, we will give a detailed description of the IHRMmodel for social network data. In this section, we first introduce the finite hiddenrelational model (HRM), and then extend it to an infinite version (IHRM). Inaddition, we provide a generative model describing how to generate data froman IHRM model.

2.1 Hidden Relational Model

A hidden relational model (HRM) for a simple sociogram is shown in Fig. 2.The basic innovation of the HRM model is introducing for each object (here:person) a hidden variable, denoted as Z in the figure. They can be thought of asunknown attributes of persons. We then assume that attributes of a person onlydepend on the hidden variable of the person, and a relationship only depends onthe hidden variables of the persons involved in the relationship. It implies thatif hidden variables were known, both person attributes and relationships can bewell predicted.

Given the HRM model shown in Fig. 2, information can propagate via inter-connected hidden variables. Let us predict whether the person 2 will be a friend

80 Z. Xu et al.

Fig. 2. A hidden relational model (HRM) for a simple sociogram

of the person 3, i.e. predict the relationship R2,3. The probability is computedon the evidence about: (1) the attributes of the immediately related persons,i.e. G2 and G3, (2) the known relationships associated with the persons of inter-est, i.e. the friendships R2,1 and R2,4 about the person 2, and the friendships R1,3

and R3,4 about the person 3, (3) high-order information transferred via hiddenvariables, e.g. the information about G1 and G4 propagated via Z1 and Z4. Ifthe attributes of persons are informative, those will determine the hidden statesof the persons, therefore dominate the computation of predictive probability ofrelationship R2,3. Conversely, if the attributes of persons are weak, then hiddenstate of a person might be determined by his relationships to other persons andthe hidden states of those persons. By introducing hidden variables, informationcan globally distribute in the ground network defined by the relational structure.This reduces the need for extensive structural learning, which is particularly dif-ficult in relational models due to the huge number of potential parents. Note thata similar propagation of information can be observed in hidden Markov modelsused in speech recognition or in the hidden Markov random fields used in imageanalysis [26]. In fact, the HRM can be viewed as a directed generalization ofboth for relational data.

Additionally, the HRM provides a cluster analysis of relational data. The stateof the hidden variable of an object corresponds to its cluster assignment. Thiscan be regarded as a generalization of co-clustering model [13]. The HRM canbe applied to domains with multiple classes of objects and multiple classes ofrelationships. Furthermore, relationships can be of arbitrary order, i.e. the HRMis not constraint to only binary and unary relationships[24]. Also note that thesociogram is quite related to the resource description framework (RDF) graphused as the basic data model in the semantic web [3] and the entity relationshipgraph from database design.

We now complete the model by introducing the variables and parameters inFig. 2. There is a hidden variable Zi for each person. The state of Zi speci-fies the cluster of the person i. Let K denote the number of clusters. Z fol-lows a multinomial distribution with parameter vector π = (π1, . . . , πK) (πk >0,

∑k πk = 1), which specifies the probability of a person belonging to a cluster,


i.e. P (Zi = k) = πk. π is sometimes referred to as mixing weights. It is drawnfrom a conjugated Dirichlet prior with hyperparameters α0.

All person attributes are assumed to be discrete and multinomial variables(resp., binary and Bernoulli). Thus a particular person attribute Gi is a sampledrawn from a multinomial (resp., Bernoulli) distribution with parameters θk,where k denotes the cluster assignment of the person. θk is sometimes referredto as mixture component, which is associated with the cluster k. For all persons,there are totally K mixture components Θ = (θ1, . . . , θK). Each person in thecluster k inherits the mixture component, thus we have: P (Gi = s|Zi = k, Θ) =θk,s (θk,s > 0,

∑s θk,s = 1). These mixture components are independently drawn

from a prior G0. For computational efficiency, we assume that G0 is a conjugatedDirichlet prior with hyperparameters β.

We now consider the variables and parameters concerning the relationships(FriendOf). The relationship R is assumed to be discrete with two states. Aparticular relationship Ri,j between two persons (i and j) is a sample drawnfrom a binomial distribution with a parameter φk,�, where k and � denote clusterassignments of the person i and the person j, respectively. There are totally K×K parameters φk,�, and each φk,� is independently drawn from the prior Gr

0. Forcomputational efficiency, we assume that Gr

0 is a conjugated Beta distributionwith hyperparameters βr.

From a mixture model point of view, the most interesting term in the HRMmodel is φk,�, which can be interpreted as a correlation mixture component. If aperson i is assigned to a cluster k, i.e. Zi = k, then the person inherits not onlyθk, but also φk,�, � = {1, . . . , K}.

2.2 Infinite Hidden Relational Model

Since hidden variables play a key role in the HRM model, we would expect thatthe HRM model might require a flexible number of states for the hidden vari-ables. Consider again the sociogram example. With little information about pastfriendships, all persons might look the same; with more information available,one might discover certain clusters in persons (different habits of making friends);but with an increasing number of known friendships, clusters might show increas-ingly detailed structure ultimately indicating that everyone is an individual. Itthus makes sense to permit an arbitrary number of clusters by using a Dirichletprocess mixture model. This permits the model to decide itself about the optimalnumber of clusters and to adopt the optimal number with increasing data. Forour discussion it suffices to say that we obtain an infinite HRM by simply lettingthe number of clusters approach infinity, K → ∞. Although from a theoreticalpoint of view there are indeed an infinite number of components, a samplingprocedure would only occupy a finite number of components.

The graphical representations of the IHRM and HRM models are identical,shown as Fig. 2. However, the definitions of variables and parameters are dif-ferent. For example, hidden variables Z of persons have infinite states, andthus parameter vector π is infinite-dimensional. The parameter is not generatedfrom a Dirichlet prior, but from a stick breaking construction Stick(·|α0) with a

82 Z. Xu et al.

hyperparameter α0 (more details in the next section). Note that α0 is a positivereal-valued scalar and is referred to as concentration parameter in DP mixturemodeling. It determines the tendency of the model to either use a large numberor a small number of states in the hidden variables [2]. If α0 is chosen to besmall, only few clusters are generated. If α0 is chosen to be large, the couplingis loose and more clusters are formed. Since there are an infinite number of clus-ters, there are an infinite number of mixture components θk, each of which isstill independently drawn from G0. G0 is referred to as base distribution in DPmixture modeling.

2.3 Generative Model

Now we describe the generative model for the IHRM model. There are mainlytwo methods to generate samples from a Dirichlet Process (DP) mixture model,i.e. the Chinese restaurant process (CRP) [2] and the stick breaking construction(SBC) [22]. We will discuss how SBC can be applied to the IHRM model (see[24] for CRP-based generative model). Notations is summarized in Table 1.

Table 1. Notation used in this paper

Symbol DescriptionC number of object classesB number of relationship classesNc number of objects in a class cαc

0 concentration parameter of an object class cec

i an object indexed by i in a class cAc

i an attribute of an object eci

θck mixture component indexed by a hidden state k in an object class c

Gc0 base distribution of an object class c

βc parameters of a base distribution Gc0

Rbi,j relationship of class b between objects i, j

φbk,� correlation mixture component indexed by hidden states k for ci and � for

cj , where ci and cj are object classes involved in a relationship class bGb

0 base distribution of a relationship class b

βb parameters of a base distribution Gb0

The stick breaking construction (SBC) [22] is a representation of DPs, bywhich we can explicitly sample random distributions of attribute parametersand relationship parameters. In the following we describe the generative modelof IHRM in terms of SBC.

1. For each object class c,(a) Draw mixing weights πc ∼ Stick(·|αc

0), defined as

V ck

iid∼ Beta(1, αc0); πc

1 = V c1 , πc

k = V ck

k−1∏k′=1

(1 − V ck′), k > 1. (1)

(b) Draw i.i.d. mixture components θck ∼ Gc

0, k = 1, 2, . . .


2. For each relationship class b between two object classes ci and cj , drawφb

k,� ∼ Gb0 i.i.d. with component indices k for ci and � for cj .

3. For each object eci in a class c,

(a) Draw cluster assignment Zci ∼ Mult(·|πc);

(b) Draw object attributes Aci ∼ P (·|θc, Zc

i ).4. For eci

i and ecj

j with a relationship of class b, draw Rbi,j ∼ P (·|φb, Zci

i , Zcj

j ).

The basic property of SBC is that: the distributions of the parameters (θck

and φbk,�) are sampled, e.g., the distribution of θc

k can be represented as Gc =∑∞k=1 πc

kδθck, where δθc

kis a distribution with a point mass on θc

k. In terms ofthis property, SBC can sample objects independently; thus it might be efficientwhen a large domain is involved.

3 Inference

The key inferential problem in the IHRM model is computing posterior of unob-servable variables given the data, i.e. P ({πc, Θc, Zc}c, {Φb}b|D, {αc

0, Gc0}c, {Gb

0}b).Unfortunately, the computation of the joint posterior is analytically intractable,thus we consider approximate inference methods to solve the problem.

3.1 Inference with Gibbs Sampling

Markov chain Monte Carlo (MCMC) sampling has been used to approximateposterior distribution with a DP mixture prior. In this section, we describethe efficient blocked Gibbs sampling (GS) with truncated stick breaking rep-resentation [14] for the IHRM model. The advantage is that given the posteriordistributions, we can independently sample hidden variables in a block, whichhighly accelerates the computation. The Markov chain is thus defined not onlyon hidden variables, but also on parameters.

Truncated stick breaking construction (TSB) fixes a value Kc for each classof objects by letting V c

Kc = 1. That means the mixing weights πck are equal to 0

for k > Kc (refer to Equ. 1). The number of the clusters is thus reduced to Kc.Note, that Kc is an additional parameter in the inference method.

At each iteration, we first update the hidden variables conditioned on theparameters sampled in the last iteration, and then update the parameters con-ditioned on the hidden variables. In detail:

1. For each class of objects,(a) Update each hidden variable Z

c(t+1)i with probability proportional to:

πc(t)k P (Ac

i |Zc(t+1)i = k, Θc(t))

∏b′

∏j′

P (Rb′i,j′ |Zc(t+1)

i = k, Zcj′ (t)j′ , Φb′(t)), (2)

where Aci and Rb′

i,j′ denotes the known attributes and relationships about

i. cj′ denotes the class of the object j′, Zcj′ (t)j′ denotes hidden variable

of j′ at the last iteration t. Intuitively, the equation represents to whatextent the cluster k agrees with the data Dc

i about the object i.

84 Z. Xu et al.

(b) Update πc(t+1) as follows:i. Sample v

c(t+1)k from Beta(λc(t+1)

k,1 , λc(t+1)k,2 ) for k = {1, . . . , Kc − 1}

λc(t+1)k,1 = 1 +

Nc∑i=1

δk(Zc(t+1)i ), λ

c(t+1)k,2 = αc

0 +Kc∑

k′=k+1

Nc∑i=1

δk′(Zc(t+1)i ), (3)

and set vc(t+1)Kc = 1. δk(Zc(t+1)

i ) equals to 1 if Zc(t+1)i = k and 0

otherwise.ii. Compute πc(t+1) as: π

c(t+1)k = v

c(t+1)k

∏k−1k′=1(1 − v

c(t+1)k′ ) for k > 1

and πc(t+1)1 = v

c(t+1)1 .

2. Update θc(t+1)k ∼ P (·|Ac, Zc(t+1), Gc

0) and φb(t+1)k,� ∼ P (·|Rb, Z(t+1), Gb

0). Theparameters are drawn from their posterior distributions conditioned on thesampled hidden states. Again, since we assume conjugated priors as the basedistributions (Gc

0 and Gb0), the simulation is tractable.

After convergence, we collect the last W samples to make predictions for the rela-tionships of interest. Note that in blocked Gibbs sampling, the MCMC sequenceis defined by hidden variables and parameters, including Zc(t), πc(t), Θc(t), andΦb(t). The predictive distribution of a relationship Rb

new,j between a new objectec

new and a known object ecj

j is approximated as

P (Rbnew,j |D, {αc

0, Gc0}C

c=1, {Gb0}B

b=1)

≈ 1W

W+w∑t=w+1

P (Rbnew,j |D, {Zc(t), πc(t), Θc(t)}C

c=1, {Φb(t)}Bb=1)

∝ 1W

W+w∑t=w+1

Kc∑k=1

P (Rbnew,j |φb(t)

k,� ) πc(t)k P (Ac

new|θc(t)k )

∏b′

∏j′

P (Rb′new,j′ |φb′(t)

k,�′ ),

where � and �′ denote the cluster assignments of the objects j and j′, respectively.The equation is quite intuitive. The prediction is a weighted sum of predictionsP (Rb

new,j |φb(t)k,� ) over all clusters. The weight of each cluster is the product of

the last three terms, which represents to what extent this cluster agrees withthe known data (attributes and relationships) about the new object. Since theblocked method also samples parameters, the computation is straightforward.

3.2 Inference with Variational Approximation

The IHRM model has multiple DPs which interact through relationships, thusblocked Gibbs sampling is still slow due to the slow exchange of informationbetween DPs. To solve the problem, we outline an alternative solution by varia-tional inference method. The main strategy is to convert a probabilistic inferenceproblem into an optimization problem, and then to solve the problem with theknown optimization techniques. In particular, the method assumes a distributionq, referred to as a variational distribution, to approximate the true posterior Pas close as possible. The difference between the variational distribution q andthe true posterior P can be measured via Kullback-Leibler (KL) divergence. Let


ξ denote a set of unknown quantities, and D denote the known data. The KLdivergence between q(ξ) and P (ξ|D) is defined as:

KL(q(ξ)||P (ξ|D)) =∑

ξ

q(ξ) log q(ξ) −∑

ξ

q(ξ) log P (ξ|D). (4)

The smaller the divergence, the better is the fit between the true and the ap-proximate distributions. The probabilistic inference problem (i.e. computing theposterior) now becomes: to minimize the KL divergence with respect to thevariational distribution. In practice, the minimization of the KL divergence isformulated as the maximization of the lower bound of the log-likelihood:

log P (D) ≥∑

ξ

q(ξ) log P (D, ξ) −∑

ξ

q(ξ) log q(ξ). (5)

A mean-field method was explored in [6] to approximate the posterior of un-observable quantities in a DP mixture model. The main challenge of usingmean-field inference for the IHRM model is that there are multiple DP mix-ture models coupled together with relationships and correlation mixture com-ponents. In the IHRM model, unobservable quantities include Zc, πc, Θc andΦb. Since mixing weights πc are computed on V c (see Equ. 1), we can replaceπc with V c in the set of unobservable quantities. To approximate the posteriorP ({V c, Θc, Zc}c, {Φb}b|D, {αc

0, Gc0}c, {Gb

0}b), we define a variational distributionq({Zc, V c, Θc}C

c=1, {Φb}Bb=1) as:[

C∏c

Nc∏i

q(Zci |ηc

i )Kc∏k

q(V ck |λc

k)q(θck|τ c

k)

] ⎡⎣ B∏

b

Kci∏k

Kcj∏�

q(φbk,�|ρb

k,�)

⎤⎦ , (6)

where ci and cj denote the object classes involved in the relationship class b.k and � denote the cluster indexes for ci and cj . Variational parameters in-clude {ηc

i , λck, τc

k , ρbk,�}. q(Zc

i |ηci ) is a multinomial distribution with parameters

ηci . Note, that there is one ηc

i for each object eci . q(V c

k |λck) is a Beta distribution.

q(θck|τc

k) and q(φbk,�|ρb

k,�) are respectively with the same forms as Gc0 and Gb

0.We substitute Equ. 6 into Equ. 5 and optimize the lower bound with a coor-

dinate ascent algorithm, which generates the following equations to iterativelyupdate the variational parameters until convergence:

λck,1 = 1 +

Nc∑i=1

ηci,k, λc

k,2 = αc0 +

Nc∑i=1

Kc∑k′=k+1

ηci,k′ , (7)

τ ck,1 = βc

1 +Nc∑i=1

ηci,kT(Ac

i ), τ ck,2 = βc

2 +Nc∑i=1

ηci,k, (8)

ρbk,�,1 = βb

1 +∑i,j

ηcii,kη

cj

j,�T(Rbi,j), ρb

k,�,2 = βb2 +

∑i,j

ηcii,kη

cj

j,�, (9)

ηci,k ∝ exp

(Eq[log V c

k ] +k−1∑k′=1

Eq[log(1 − V ck′)] + Eq[log P (Ac

i |θck)]

+∑b′

∑j

∑�

ηcj

j,�Eq[log P (Rb′i,j |φb′

k,�)]

), (10)

86 Z. Xu et al.

where λck denotes parameters of Beta distribution q(V c

k |λck), λc

k is a two-dimensional vector λc

k = (λck,1, λ

ck,2). τc

k denotes parameters of the exponentialfamily distribution q(θc

k|τck). We decompose τc

k such that τck,1 contains the first

dim(θck) components and τc

k,2 is a scalar. Similarly, βc1 contain the first dim(θc

k)components and βc

2 is a scalar. ρbk,�,1, ρb

k,�,2, βb1 and βb

2 are defined equivalently.T(Ac

i ) and T(Rbi,j) denote the sufficient statistics of the exponential family dis-

tributions P (Aci |θc

k) and P (Rbi,j |φb

k,�), respectively.It is clear that Equ. 7 and Equ. 8 correspond to the updates for variational

parameters of object class c, and they follow equations in [6]. Equ. 9 representsthe updates of variational parameters for relationships, which is computed on theinvolved objects. The most interesting updates are Equ. 10, where the posteriorsof object cluster-assignments are coupled together. These essentially connect theDPs together. Intuitively, in Equ. 10 the posterior updates for ηc

i,k include aprior term (first two expectations), the likelihood term about object attributes(third expectation), and the likelihood terms about relationships (last term). Tocalculate the last term we need to sum over all the relationships of the objectec

iweighted by ηcj

j,� that is variational expectation about cluster-assignment ofthe other object involved in the relationship.

Once the procedure reaches stationarity, we obtain the optimized variationalparameters, by which we can approximate the predictive distributionP (Rb

new,j |D, {αc0, G

c0}C

c=1, {Gb0}B

b=1) of the relationship Rbnew,j between a new

object ecnew and a known object e

cj

j with q(Rbnew,j |D, λ, η, τ, ρ) proportional to:

Kc∑k

Kcj∑�

q(Rbnew,j |ρb

k,�)q(Zcj

j = �|ηcj

j )q(Zcnew = k|λc)

× q(Acnew|τ c

k)∏b′

∏j′

∑�′

q(Zcj′j′ = �′|ηcj′

j′ )q(Rb′new,j′ |ρb′

k,�′). (11)

The prediction is a weighted sum of predictions q(Rbnew,j |ρb

k,�) over all clusters.The weight consists of two parts. One is to what extent the cluster � agrees withthe object e

cj

j (i.e. the 2nd term), the other is to what extent the cluster k agreeswith the new object (i.e. the product of the last 3 terms). The computationsabout the two parts are different. The reason is that e

cj

j is a known object, wehave optimized variational parameters η

cj

j about its cluster assignment.

4 Experimental Analysis

4.1 Monastery Data

The first experiment is performed on the Sampson’s monastery dataset [19] forcommunity discovery. Sampson surveyed social relationships between 18 monksin an isolated American monastery. The relationships between monks includedesteem/disesteem, like/dislike, positive influence/negative influence, praise andblame. Breiger et al. [7] summarized these relationships and yielded a single


2 4 6 8 10 12 14 16 18

2

4

6

8

10

12

14

16

18

Fig. 3. Left: The matrix displaying interactions between Monks. Middle: A sociogramfor three monks. Right: The IHRM model for the monastery sociogram.

relationship matrix, which reflected interactions between monks, as shown inFig. 3 (left).

After observing the monks in the monastery for several months, Sampson pro-vided a description of the factions among the monks: the loyal opposition (Peter,Bonaventure, Berthold, Ambrose and Louis), the young turks (John Bosco, Gre-gory, Mark, Winfrid, Hugh, Boniface and Albert) and the outcasts (Basil, Eliasand Simplicius). The other three monks (Victor, Ramuald and Amand) waveredbetween the loyal opposition and the young turks, and were identified as thefourth group, the waverers. Sampson’s observations were confirmed by the eventthat the young turks group resigned after the leaders of the group were expelledover religious differences. The task of the experiment is to cluster the monks.

Fig. 3 (middle) shows a sociogram with 3 monks. The IHRM model for themonastery network is illustrated as Fig. 3 (right). There is one hidden variablefor each monk. The relationships between monks are conditioned on the hiddenvariables of the involved monks. The mean field method is used for inference.We initially assume that each monk is in his own cluster. After convergence, thecluster number is optimized as 4, which is exactly the same as the number ofthe groups that Sampson identified. The clustering result is shown as Table 2. Itis quite close to the real groups. Cluster 1 corresponds to the loyal opposition.Cluster 2 is the young turks, and cluster 3 is the outcasts. The waverersare split. Amand is assigned to cluster 4, Victor and Ramuald are assignedto the loyal opposition. Actually, previous research analysis has questioned thedistinction of the waverers, e.g., [7,12] clustered Victor and Ramuald into theloyal opposition, which coincides with the result of the IHRM model.

Table 2. Clustering result of the IHRM model on Sampson’s monastery data

Cluster Members1 Peter, Bonaventure, Berthold, Ambrose, Louis, Victor, Ramuald2 John, Gregory, Mark, Winfrid, Hugh, Boniface, Albert3 Basil, Elias, Simplicius4 Amand

88 Z. Xu et al.

4.2 Bernard & Killworth Data

In the second experiment, we perform link analysis with IHRM on the Bernard& Killworth data [5]. Bernard and Killworth collected several data sets on hu-man interactions in bounded groups. In each study they obtained measures ofsocial interactions among all actors, and ranking data based on the subjects’memory of those interactions. Our experiments are based on three datasets. TheBKFRAT data is about interactions among students living in a fraternity ata West Virginia college. All subjects had been residents in the fraternity fromthree months to three years. The data consists of rankings made by the subjectsof how frequently they interacted with other subjects in the observation week.The BKOFF data concern interactions in a small business office. Observationswere made as the observer patrolled a fixed route through the office every fifteenminutes during two four-day periods. The data contains rankings of interactionfrequency as recalled by employees over the two-week period. The BKTEC datais about interactions in a technical research group at a West Virginia university.It contains the personal rankings of the remembered frequency of interactions.

In the experiments, we randomly select 50% (60%, 70%, 80%) interactions asknown and predict the left ones. The experiments are repeated 20 times for eachsetting. The average prediction accuracy is reported in Table 3. We compare ourmodel with the Pearson collaborative filtering method. It shows that the IHRMmodel provides better performance on all the three datasets. Fig. 4 illustratesthe link prediction results on the BKOFF dataset with 70% known links. Thepredicted interaction matrix is quite similar with the real one.

Table 3. Link prediction on the Bernard & Killworth data with the IHRM

Prediction Accuracy (%)50% 60% 70% 80%

IHRM Pearson IHRM Pearson IHRM Pearson IHRM PearsonBKFRAT 66.50 61.82 67.63 64.56 68.26 66.91 68.69 67.41BKOFF 66.21 57.32 67.89 59.45 69.20 60.58 69.82 61.54BKTEC 65.47 58.85 66.79 62.04 68.31 63.61 69.58 64.46

Fig. 4. Left: Interaction matrix on the BKOFF data. Right: The predicted one, whichis quite similar with the real situation.


4.3 MovieLens Data

We also evaluate the IHRM model on the MovieLens data [21]. There are intotal 943 users and 1680 movies, and we obtain 702 users and 603 movies afterremoving low-frequent ones. Each user has about 112 ratings on average. Themodel is shown in Fig. 5. There are two classes of objects (users and movies)and one class of relationships (Like). The task is to predict preferences of users.The users have attributes Age, Gender, Occupation, and the movies have at-tributes Published-year, Genres and so on. The relationships have two states,where R = 1 indicates that the user likes the movie and 0 otherwise. The userratings in MovieLens are originally based on a five-star scale, so we transfer eachrating to binary value with R = 1 if the rating is higher than the user’s aver-age rating, vice versa. The performance of the IHRM model is analyzed from2 points: prediction accuracy and clustering effect. To evaluate the predictionperformance, we perform 4 sets of experiments which respectively select 5, 10,15 and 20 ratings for each test user as the known ratings, and predict the re-maining ratings. These experiments are referred to as given5, given10, given15and given20 in the following. For testing the relationship is predicted to exist(i.e., R = 1) if the predictive probability is larger than a threshold ε = 0.5.

We implement the following three inference methods: Chinese restaurant pro-cess Gibbs sampling (CRPGS), truncated stick-breaking Gibbs sampling (TS-BGS), and the corresponding mean field method TSBMF. The truncation pa-rameters Ks for TSBGS and TSBMF are initially set to be the number of entities.For TSBMF we consider α0 = {5, 10, 100, 1000}, and obtain the best predictionwhen α0 = 100. For CRPGS and TSBGS α0 is 100. For the variational infer-ence, the change of variational parameters between two iterations is monitoredto determine the convergence. For the Gibbs samplers, the convergence was

Fig. 5. Top: A sociogram for movie recommendation system, illustrated with 2 usersand 3 movies. For readability, only two attributes (user’s occupation and movie’s genre)show in the figure. Bottom: The IHRM model for the sociogram.

90 Z. Xu et al.

Fig. 6. Left: The traces of the number of user clusters for the runs of two Gibbssamplers. Middle: The trace of the change of the variational parameter ηu for meanfield method. Right: The sizes of the largest user clusters of the three inference methods.

analyzed by three measures: Geweke statistic on likelihood, Geweke statistic onthe number of components for each class of objects, and autocorrelation. Fig. 6(left) shows the trace of the number of user clusters in the 2 Gibbs samplers.Fig. 6 (middle) illustrates the change of variational parameters ηu in the varia-tional inference. For CRPGS, the first w = 50 iterations (6942 s) are discardedas burn-in period, and the last W = 1400 iterations are collected to approxi-mate the predictive distributions. For TSBGS, we have w = 300 (5078 s) andW = 1700. Although the number of iterations for the burn-in period is muchless in the CRPGS if compared to the blocked Gibbs sampler, each iteration isapproximately a factor 5 slower. The reason is that CRPGS samples the hiddenvariables one by one, which causes two additional time costs. First, the expec-tations of attribute parameters and relational parameters have to be updatedwhen sampling each user/movie. Second, the posterior of hidden variables haveto be computed one by one, thus we can not use fast matrix multiplication tech-niques to accelerate the computation. Therefore if we include the time, which isrequired to collect a sufficient number of samples for inference, the CRPGS isslower by a factor of 5 (the row Time(s) in Table 4 ) than the blocked sampler.The mean field method is again by a factor around 10 faster than the blockedGibbs sampler and thus almost two orders of magnitude faster than the CRPGS.

Table 4. Performance of the IHRM model on MovieLens data

CRPGS TSBGS TSBMF Pearson SVDGiven5 65.13 65.51 65.26 57.81 63.72Given10 65.71 66.35 65.83 60.04 63.97Given15 66.73 67.82 66.54 61.25 64.49Given20 68.53 68.27 67.63 62.41 65.13Time(s) 164993 33770 2892 - -Time/iter. 109 17 19 - -#C.u 47 59 9 - -#C.m 77 44 6 - -


The prediction results are shown in Table 4. All IHRM inference methodsunder consideration achieve comparably good performance; the best results areachieved by the Gibbs samplers. To verify the performance of the IHRM, wealso implement Pearson-coefficient collaborative filtering (CF) method [18] anda SVD-based CF method [20]. It is clear that the IHRM outperforms the tra-ditional CF methods, especially when there are few known ratings for the testusers. The main advantage of the IHRM is that it can exploit attribute infor-mation. If the information is removed, the performance of the IHRM becomesclose to the performance of the SVD approach. For example, after ignoring allattribute information, the TSBMF generates the predictive results: 64.55% forGiven5, 65.45% for Given10, 65.90% for Given15, and 66.79% for Given20.

The IHRM provides cluster assignments for all objects involved, in our casefor the users and the movies. The rows #C.u and #C.m in Table 4 denotethe number of clusters for users and movies, respectively. The Gibbs samplersconverge to 46-60 clusters for the users and 44-78 clusters for the movies. Themean field solution have a tendency to converge to a smaller number of clusters,depending on the value of α0. Further analysis shows that the clustering resultsof the methods are actually similar. First, the sizes of most clusters generatedby the Gibbs samplers are very small, e.g., there are 72% (75.47%) user clusterswith less than 5 members in CRPGS (TSBGS). Fig. 6 (right) shows the sizesof the 20 largest user clusters of the 3 methods. Intuitively, the Gibbs samplerstend to assign the outliers to new clusters. Second, we compute the rand index(0-1) of the clustering results of the methods, the values are 0.8071 betweenCRPGS and TSBMF, 0.8221 between TSBGS and TSBMF, which demonstratesthe similarity of the clustering results.

Fig. 7 gives the movies with highest posterior probability in the 4 largestclusters generated from TSBMF. In cluster 1 most movies are very new and

Fig. 7. The major movie clusters generated by TSBMF on MovieLens data

92 Z. Xu et al.

popular (the data set was collected from September 1997 through April 1998).Also they tend to be action and thriller movies. Cluster 2 includes many oldmovies, or movies produced by the non-USA countries. They tend to be dramamovies. Cluster 3 contains many comedies. In cluster 4 most movies includerelatively serious themes. Overall we were quite surprised by the good inter-pretability of the clusters. Fig. 8 (top) shows the relative frequency coefficient(RFC) of the attribute Genre in these movie clusters. RFC of a genre s in acluster k is calculated as (fk,s − fs)/σs, where fk,s is the frequency of the genres in the movie cluster k, fs is mean frequency, and σs is standard deviation offrequency. The labels for each cluster specify the dominant genres in the cluster.For example, action and thriller are the two most frequent genres in cluster 1. Ingeneral, each cluster involves several genres. It is clear that the movie clustersare related to, but not just based on, the movie attribute Genre. The clusteringeffect depends on both movie attributes and user ratings. Fig. 8 (bottom) showsRFC of the attribute Occupation in user clusters. Equivalently, the labels foreach user cluster specify the dominant occupations in the cluster.

Note that in the experiments we predicted a relationship attribute R indicat-ing the rating of a user for a movie. The underlying assumption is that in prin-ciple anybody can rate any movie, no matter whether that person has watchedthe movie or not. If the latter is important, we could introduce an additional at-tribute Exist to specify if a user actually watched the movie. The relationship Rwould then only be included in the probabilistic model if the movie was actuallywatched by a user.

Fig. 8. Top: The relative frequency coefficient of the attribute Genre in different movieclusters, Bottom: that of the attribute Occupation in different user clusters


5 Related Work

The work on infinite relational model (IRM) [15] is similar to the IHRM, and hasbeen developed independently. One difference is that the IHRM can specify anyreasonable probability distribution for an attribute given its parent, whereas theIRM would model an attribute as a unary predicate, i.e. would need to transformthe conditional distribution into a logical binary representation. Aukia et al. alsodeveloped a DP mixture model for large networks [4]. The model associates aninfinite-dimensional hidden variable for each link (relationship), and the objectsinvolved in the link are drawn from a multinomial distribution conditioned on thehidden variable of the link. The model is applied to the community web datawith promising experimental results. The latent mixed-membership model [1]can be viewed as a generalization of LDA model on relational data. Althoughit is not nonparametric, the model exploits hidden variables to avoid the ex-tensive structure learning and provides a principled way to model the relationalnetworks. The model associates each object with a membership probability-likevector. For each relationship, cluster assignments of the involved objects are gen-erated with respect to their membership vectors, and then relationship is drawnconditioned on the cluster assignments.

There are some other important SRL research works for complex relationalnetworks. The probabilistic relational model (PRM) with class hierarchies [10]specializes distinct probabilistic dependency for each subclass, and thus obtainsrefined probabilistic models for relational data. A group-topic model is proposedin [23]. It jointly discovers latent groups in a network as well as latent topicsof events between objects. The latent group model in [16] introduces two latentvariables ci and gi for an object, and ci is conditioned on gi. The object attributesdepends on ci and relations depend on gi of the involved objects. The limitationis that only relations between members in the same group are considered. Thesemodels demonstrate good performance in certain applications. However, most arerestricted to domains with simple relationships. These models demonstrate goodperformance in certain applications. However, most are restricted to domainswith simple relationships.

6 Extension: Conditional IHRM

We have presented the IHRM model and an empirical analysis of social networkdata. As a generative model, the IHRM models both object attributes and re-lational attributes as random variables conditioned on clusters of objects. If thegoal is to predict relationship attributes, one might expect to obtain improvedprediction performance if one trains a model conditioned on the attributes. Aspart of ongoing work we study the extension of the IHRM model to discrimina-tive learning. A conditional IHRMs directly models the posterior probability ofrelations given features derived from attributes of the objects. Fig. 9 illustratesthe conditional IHRM model with a simple sociogram example.

94 Z. Xu et al.

Fig. 9. A conditional IHRM model for a simple sociogram. The main difference fromthe IHRM model in Fig. 2 is that attributes G are not indirect influence over relationsR via object clusters Z, but are direct conditions of relations.

The main difference to the IHRM model in Fig. 2 is that relationship attributesare conditioned on both the states of the latent variables and features derivedfrom attributes. A simple conditional model is based on logistic regression of theform

log P (Ri,j |Zi = k, Zj = �, F (Gi, Gj)) = σ(〈ωk,�, xi,j〉),where xi,j = F (Gi, Gj) denotes a vector describing features derived from allattributes of i and j. ωk,� is a weight vector, which determines how much aparticular attribute contributes to a choice of relation and can implicitly imple-ment feature selection. Note that there is one weight vector for each cluster pair(k, �). 〈·, ·〉 denotes an inner product. σ(·) is a real-valued function with any formσ : R → R. The joint probability of the conditional model is now written as:

P (R,Z|G) =∏

i

P (Zi)∏i,j

P (Ri,j |Zi, Zj , F (Gi, Gj)), (12)

where P (Zi) is still defined as a stick breaking construction (Equ. 1). The prelim-inary experiments show promising results, and we will report the further resultsin future work.

7 Conclusions

This paper presents a nonparametric relational model IHRM for social networkmodeling and analysis. The IHRM model enables expressive knowledge represen-tation of social networks and allows for flexible probabilistic inference withoutthe need for extensive structural learning. The IHRM model can be applied tocommunity detection, link prediction, and product recommendation. The em-pirical analysis on social network data showed encouraging results with inter-pretable clusters and relation prediction. For the future work, we will explorediscriminative relational models for better performance. It will also be interest-ing to perform analysis on more complex relational structures in social networksystems, such as domains including hierarchical class structures.


Acknowledgments

This research was supported by the German Federal Ministry of Economy andTechnology (BMWi) research program THESEUS, the EU FP7 project LarKC,and the Fraunhofer ATTRACT fellowship STREAM.

References

1. Airoldi, E.M., Blei, D.M., Xing, E.P., Fienberg, S.E.: A latent mixed-membershipmodel for relational data. In: Proc. ACM SIGKDD Workshop on Link Discovery(2005)

2. Aldous, D.: Exchangeability and related topics. In: Ecole d’Ete de Probabilites deSaint-Flour XIII 1983, pp. 1–198. Springer, Heidelberg (1985)

3. Antoniou, G., van Harmelen, F.: A Semantic Web Primer. MIT Press, Cambridge(2004)

4. Aukia, J., Kaski, S., Sinkkonen, J.: Inferring vertex properties from topology inlarge networks. In: NIPS 2007 workshop on statistical models of networks (2007)

5. Bernard, H., Killworth, P., Sailer, L.: Informant accuracy in social network dataiv. Social Networks 2 (1980)

6. Blei, D., Jordan, M.: Variational inference for dp mixtures. Bayesian Analysis 1(1),121–144 (2005)

7. Breiger, R.L., Boorman, S.A., Arabie, P.: An algorithm for clustering relationaldata with applications to social network analysis and comparison to multidimen-sional scaling. Journal of Mathematical Psychology 12 (1975)

8. Dzeroski, S., Lavrac, N. (eds.): Relational Data Mining. Springer, Berlin (2001)9. Getoor, L., Friedman, N., Koller, D., Pfeffer, A.: Learning probabilistic relational

models. In: Dzeroski, S., Lavrac, N. (eds.) Relational Data Mining, Springer, Hei-delberg (2001)

10. Getoor, L., Koller, D., Friedman, N.: From instances to classes in probabilistic re-lational models. In: Proc. ICML 2000 Workshop on Attribute-Value and RelationalLearning (2000)

11. Getoor, L., Taskar, B. (eds.): Introduction to Statistical Relational Learning. MITPress, Cambridge (2007)

12. Handcock, M.S., Raftery, A.E., Tantrum, J.M.: Model-based clustering for socialnetworks. Journal of the Royal Statistical Society 170 (2007)

13. Hofmann, T., Puzicha, J.: Latent class models for collaborative filtering. In: Proc.16th International Joint Conference on Artificial Intelligence (1999)

14. Ishwaran, H., James, L.: Gibbs sampling methods for stick breaking priors. Journalof the American Statistical Association 96(453), 161–173 (2001)

15. Kemp, C., Tenenbaum, J.B., Griffiths, T.L., Yamada, T., Ueda, N.: Learning sys-tems of concepts with an infinite relational model. In: Proc. 21st Conference onArtificial Intelligence (2006)

16. Neville, J., Jensen, D.: Leveraging relational autocorrelation with latent groupmodels. In: Proc. 4th international workshop on Multi-relational mining, pp. 49–55. ACM Press, New York (2005)

17. Raedt, L.D., Kersting, K.: Probabilistic logic learning. SIGKDD Explor.Newsl. 5(1), 31–48 (2003)

96 Z. Xu et al.

18. Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., Riedl, J.: Grouplens: An openarchitecture for collaborative filtering of netnews. In: Proc. of the ACM 1994 Con-ference on Computer Supported Cooperative Work, pp. 175–186. ACM, New York(1994)

19. Sampson, F.S.: A Novitiate in a Period of Change: An Experimental and CaseStudy of Social Relationships. PhD thesis (1968)

20. Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Application of dimensionality re-duction in recommender systems–a case study. In: WebKDD Workshop (2000)

21. Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J.: Analysis of recommenderalgorithms for e-commerce. In: Proc. ACM E-Commerce Conference, pp. 158–167.ACM, New York (2000)

22. Sethuraman, J.: A constructive definition of dirichlet priors. Statistica Sinica 4,639–650 (1994)

23. Wang, X., Mohanty, N., McCallum, A.: Group and topic discovery from relationsand text. In: Proc. 3rd international workshop on Link discovery, pp. 28–35. ACM,New York (2005)

24. Xu, Z., Tresp, V., Yu, K., Kriegel, H.-P.: Infinite hidden relational models. In: Proc.22nd UAI (2006)

25. Xu, Z., Tresp, V., Yu, S., Yu, K.: Nonparametric relational learning for socialnetwork analysis. In: Proc. 2nd ACM Workshop on Social Network Mining andAnalysis, SNA-KDD 2008 (2008)

26. Yedidia, J., Freeman, W., Weiss, Y.: Constructing free-energy approximations andgeneralized belief propagation algorithms. IEEE Transactions on Information The-ory 51(7), 2282–2312 (2005)

Using Friendship Ties and Family Circlesfor Link Prediction

Elena Zheleva1, Lise Getoor1, Jennifer Golbeck2, and Ugur Kuter1

1 Department of Computer Science and Institute for Advanced Computer Studies,University of Maryland, College Park, Maryland 20742, USA

{elena,getoor,ukuter}@cs.umd.edu2 College of Information Studies,

University of Maryland, College Park, Maryland 20742, [email protected]

Abstract. Social networks can capture a variety of relationships amongthe participants. Both friendship and family ties are commonly studied,but most existing work studies them in isolation. Here, we investigatehow these networks can be overlaid, and propose a feature taxonomyfor link prediction. We show that when there are tightly-knit familycircles in a social network, we can improve the accuracy of link predictionmodels. This is done by making use of the family circle features based onthe likely structural equivalence of family members. We investigated thepredictive power of overlaying friendship and family ties on three real-world social networks. Our experiments demonstrate significantly higherprediction accuracy (between 15% and 30% more accurate) compared tousing more traditional features such as descriptive node attributes andstructural features. The experiments also show that a combination of allthree types of attributes results in the best precision-recall trade-off.

1 Introduction

There is a growing interest in social media and in data mining methods whichcan be used to analyze, support and enhance the effectiveness and utility ofsocial media sites. The analysis methods being developed build on traditionalmethods from the social network analysis community, extend them to deal withthe heterogeneity and growing size of the data being generated and use toolsfrom graph mining, statistical relational learning and methods for informationextraction from unstructured and semi-structured text.

Traditionally, social network analysis has focused on actors and ties (or rela-tionships) between them, such as friendships or kinships. The two most commontypes of networks are (1) unimodal networks, where the nodes are actors and theedges represent ties such as friendships, and (2) affiliation networks which can berepresented as bipartite graphs, where there are two types of nodes, the actorsand organizations, and the edges represent the affiliations between actors andorganizations. Most of the existing work has focused on networks that exhibit asingle relationship type, either friendship or affiliation.


98 E. Zheleva et al.

In this paper, we investigate the power of combining friendship and affilia-tion networks. We use the notion of structural equivalence, when two actors aresimilar based on participating in equivalent relationships, which is fundamen-tal to finding groups in social networks. Our approach is an attempt to bridgeapproaches based on structural equivalence and community detection, wheredensely connected groups of actors are clustered together into communities. Weshow how predictive models, based on descriptive, structural, and communityfeatures, perform surprisingly well on challenging link-prediction tasks.

We validate our results on a trio of social media websites describing friendshipsand family relationships. We show that our models are able to predict linksaccurately, in this case friendship relationships, in held-out test data. This istypically a very challenging prediction problem. With our results, we also hopeto motivate further research in discovering closely-knit groups in social networksand using them to improve link-prediction performance.

Our link-prediction approach can be applied in a variety of domains. Theimportant properties of the data that we use are that there are actors, links be-tween them and closely-knit groups such as families, housemates or officemates.In some data, groups are given; in other datasets, it may be necessary to firstcluster the nodes in a meaningful manner. For example, in email communicationnetworks, such as Enron [1,2], groups could be cliques of people that email eachother frequently. In the widely studied co-authorship networks [3,4,5,6,7,8,9],affiliation groups may be cliques of authors that collaborate on many paperstogether. In these domains, the link-prediction task translates to finding peoplewho are likely to communicate with each other [1] or authors who are likely tocollaborate in the future [5,8].

Our contributions include the following:

– We propose a general framework for combining social and affiliation net-works.

– We show how to instantiate it for overlaying friendship and family networks.– We show how features of the overlaid networks can be used to accurately

predict friendship relationships.– We validate our results on three social media websites.

In Section 2, we describe the link prediction problem that we focus on inthis paper, and in Section 3, the social network model. Section 4 addresses thetaxonomy of the descriptive, structural, and group features that we used forlink prediction in our overlaid networks. We then propose a comparison of ournetwork overlay method with two alternatives in Section 5. We describe exper-imental results in Section 6, related work in Section 7, the generality of ourapproach in Section 8, and discuss conclusions and future work in Section 9.

2 Link Prediction Problem

In this paper we study the problem of predicting friendship links in multi-relational social networks. This problem is closely related to problems of link

Using Friendship Ties and Family Circles for Link Prediction 99

prediction [4,1,5,8], link completion [10], and anomalous link discovery [1,9]which are covered in more depth in Section 7.

Link prediction in social networks is useful for a variety of tasks. The moststraightforward use is for making data entry easier – a link-prediction systemcan propose links, and users can select the friendship links that they wouldlike to include, rather than users having to enter the friendship links manually.Link prediction is also a core component of any system for dynamic networkmodeling—the dynamic model can predict which actors are likely to gain popu-larity, and which are likely to become central according to various social networkmetrics.

Link prediction is challenging for a number of reasons. When it is posed asa pair-wise classification problem, one of the fundamental challenges is dealingwith the large outcome space; if there are n actors, there are n2 possible re-lations. In addition, because most social networks are sparsely connected, theprior probability of any link is extremely small, thus we have to contend witha large class skew problem. Furthermore, because the number of links is poten-tially so large, the number of the negative instances will be huge, so constructinga representative training set is challenging.

In our approach to link prediction in multi-relational social networks, we ex-plore the use of both attribute and structural features, and, in particular, westudy how group membership (in our case, family membership) can significantlyaid in accurate link (here, friendship) prediction.

3 Social Network Model

Social networks describe actors and their relationships. The actors can haveproperties or attributes such as age and income. Relationships can representdyadic (binary) relationships or group memberships (cliques or hyperedges); inaddition, relationships can be directed, undirected and/or weighted. Here weconsider both dyadic and group-membership relationships. Specifically, we con-sider friendship relationships and family group memberships. In our domain,these are undirected, unweighted relationships.

More formally, the networks we consider consist of the following:

actors: a set of actors A = {a1, . . . , an},and a grouping or partitioning of the actors into non-overlapping groups:

groups: a group of individuals connected through a common affiliation. Theaffiliations group the actors into sets G = {G1, . . . , Gm}.

Affiliation groups can be a partitioning of the actors, or overlapping groups ofactors. In this work, families of actors are such affiliation groups.

We consider the following relationships:

friends: F{ai, aj} denotes that ai is friends with aj , andfamily: M{ai, Gk} denotes that ai is a member of family Gk.


Fig. 1. Actors in the same tightly-knit group often exhibit structural equivalence, i.e.,they have the same connections to all other nodes. Using the original network (a), anda structural equivalence assumption, one can construct a network with new predictedlinks (b).

Actors can have attributes; if b is an attribute, then we use ai.b to denote the battribute of actor ai. We denote the set of friends of actor ai by ai.F , and theset of family members of the same actor as ai.M .

Figure 1(a) shows an example network of eight actors and five groups. Eachnode represents an actor, and a group is shown as a circle around the actors.The thick lines inside a group mark family relationships, and the thin black linesdenote friendship relationships. Every actor belongs to at least one group. Thereare single-member groups, and there are actors without friends.

4 A Feature Taxonomy for Multi-relational SocialNetworks

We identified three classes of features that describe characteristics of potentiallinks in a multi-relational social network:

– Descriptive attributes are attributes inherent to the nodes, and they do notconsider the structure of the network.

– Structural attributes include characteristics of the networks based on thefriendship relationships such as node degree.

– Group attributes are based on structural properties of the network when bothtypes of relationships, friendship and family, are considered. The groups inthis case are the cliques of family members.

Each feature within a class can be assigned to an actor or to a pair of actors (cor-responding to a potential edge). The following sections describe our taxonomyof the features in more detail.


4.1 Descriptive Attributes

The descriptive attributes are attributes of nodes in the social network thatdo not consider the link structure of the network. These features vary acrossdomains. They provide semantic insight into the inherent properties of eachnode in a social network, or compare the values of the same inherent attributesfor a pair of nodes.

We define two classes of descriptive attributes for multi-relational social net-works:

1. Actor features. These are inherent characteristics of an actor.2. Actor-pair features. The actor-pair features compare the values of the same

node attribute for a pair of nodes.

4.2 Structural Features

The next set of features that we introduce describe features of network structure.The first is a structural features for a single node, ai, while the remaining describestructural attributes of pairs of nodes, ai and aj .

1. Actor features. These features describe the link structure around a node.Number of friends. The degree, or number of friends, of an actor ai: |ai.F |.

2. Actor-pair features. These features describe how interconnected two nodesare. They measure the sets of friends that two actors have ai.F and aj .F .Number of common friends. The number of friends that the pair of nodes

have in common in the network: |ai.F ∩ aj .F |.Jaccard coefficient of the friend sets. The Jaccard coefficient over the friend

sets of two actors describes the ratio of the number of their commonfriends to their total number of friends:

Jaccard(ai, aj) = |ai.F∩aj .F ||ai.F∪aj .F | .

The Jaccard coefficient is a standard metric for measuring the similarityof two sets. Unlike the feature number of common friends, it considersthe size of the friendship circle of each actor.

Density of common friends. For the set of common friends, the densityis the number of friendship links between the common friends over thenumber of all possible friendship links in the set. The density of commonfriends of two nodes describes the strength in the community of commonfriends. Density is also known as clustering coefficient.

4.3 Group Features

The third category of features that we consider are based on group membership;in the networks we studied, the groups are families. These are the features thatoverlay friendship and affiliation networks.


1. Actor features. These are features that describe the groups to which an actorbelongs.Family Size. This is the simplest attribute and describes the size of an

actor’s family: |ai.M |.2. Actor-pair features. There are two types of features for modeling these inter-

family relations based on the overlapping friend and family sets of two actorsai.F and aj.M :Number of friends in the family. The first feature describes the number of

friends ai has in the family of aj : |ai.F ∩ aj .M |. This feature allows oneto reason about the relationship between an actor and a group of otheractors, where the latter is semantically defined over the network throughthe family relations.

Portion of friends in the family. The second feature on inter-family relationsdescribes the ratio between the number of friends that ai has in aj ’sfamily (the same as the above feature) and the size of aj’s family. Therationale behind this feature is that the higher this ratio is, the morelikely it is that aj is close to ai in the network since more of its familymembers are friends with ai.

The idea behind the group features is based on the notion of structural equiv-alence of nodes within a group. Two nodes are structurally equivalent if theyhave the same links to all other actors. If we can detect tightly-knit groups in asocial network and we assume that the nodes in each group are likely to behavesimilarly, then new links can be predicted by projecting links such that the nodesin the group become structurally equivalent. In our networks, such groups arethe family cliques. In a weighted graph, a tight group could map to a clique ofnodes with highly-weighted edges.

Figure 1 shows an example of how a structural equivalence assumption canhelp in predicting new links. For example, if one of the actors from Group A isfriends with an actor from Group B, as shown on the original network (a), thenit may be more likely that there is a link between the other actor from GroupA and the actor from Group B, shown as a dashed line in (b).

5 Alternative Network Representations

The traditional approach to studying networks is to treat all relationships as equal.In the previous section, we described overlaying networks with different link typesin a way that distinguishes between these types, and uses information about affili-ation groups. In other words, our link-prediction approach uses information aboutthe actors A, the groups G, the friendship relationships F , and the family relation-ships M . We call our representation different-link and affiliation overlay. There-fore, a logical question one may ask is what the benefits of treating links as differentare, and whether affiliation groups really make a difference in link prediction. Ourclaim is that affiliations are important and that they can have a predictive value.To illustrate the benefit of our approach as compared to the traditional one, wecompare it to two alternative representations of the network.


In the first alternative representation, which we call same-link and no affilia-tion overlay, the family and friendship links are treated the same, and affiliationgroups are not given. More formally, in this representation, the graph consistsof these components: actors A, and a set of edges to which we refer as impliedfriendships Fimplied = F ∪ M . We can compute the descriptive and structuralfeatures in this alternative overlay, and use them for link prediction. In our ex-periments, we investigate whether this alternative overlay can offer the same orbetter link-prediction accuracy as the different-link and affiliation overlay.

Even if the first alternative overlay does not offer better accuracy, we stillneed to check whether the predictive value of the different-link and affiliationoverlay comes from treating the links as different or from the fact that we aregiven the affiliation groups. To investigate that, we look at a second alternativeoverlay, the same-link and affiliation overlay, in which the family and friendshiplinks are treated the same, and affiliation groups are given. In this overlay, thegraph consists of these components: actors A, groups G, and implied friendshipsFimplied. We can compute all classes of features in this alternative overlay, anduse them for link prediction.

6 Experimental Evaluation

6.1 Social Media Data Sets

This research is based upon using networks that have two sets of connections:friendship links and family ties. We performed our experiments on three noveldatasets describing petworks : Dogster, Catster, and Hamsterster1. On thesesites, profiles include photos, personal information, characteristics, as well asmembership in community groups. Members also maintain links to friends andfamily members. As of February 2007, Dogster has approximately 375,000 mem-bers. Catster is based on the same platform as Dogster and contains about150,000 members. Hamsterster has a different platform, but it contains similarinformation about its members. It is much smaller than Dogster and Catster -about 2,000 members.

These sites are the only three of the hundreds we visited that publicly shareboth family and friendship connections2. However, these are networks where bothtypes of connections are realistic and representative of what we would expect tosee in other social networks if they collected this data. The family connectionsare representative of real life, since family links are only made between profilesof pets created by the same owner. The friendship linking behavior is in linewith patterns seen in other social networks [11].

1. Actor features:Breed. This is the pet breed such as golden retriever or chihuahua. A pet

can have more than one breed value.1 At http://www.dogster.com, http://www.catster.com, andhttp://www.hamsterster.com.

2 For a full list, see http://trust.mindswap.org/SocialNetworks

http://www.dogster.com

http://www.catster.com

http://www.hamsterster.com

http://trust.mindswap.org/SocialNetworks


Fig. 2. Sample profile on Dogster which includes family and friends

Breed category. Each breed belongs to a broader category set. For examplein Dogster, the major breed categories we identified are working, herding,terrier, toy, sporting, non-sporting, hound, and other, a catchall for theother breeds that appear in a the site, but not as frequently as theprevious ones. When a dog has multiple breeds, its breed category ismixed.

Single Breed. This boolean feature describes whether a pet has a singlebreed or whether it has multiple breed characteristics.

Purebred. This is a boolean feature which specifies whether a dog ownerconsiders its pet to be purebred or not.

2. Actor-pair features. All of the above features describe characteristics of asingle user in the network.Same breed. This boolean feature is true if two profiles have at least one

common breed.

6.2 Data Description

We have obtained a random sample of 10, 000 profiles each from Dogster andCatster, and all 2059 profiles registered with Hamsterster. Each instance in thetest data contained the features for a pair of profiles where some of the featureswere individual node features. To construct the test data, we chose the pairs of


nodes for which there was an existing friendship link, and we sampled from thespace of node pairs which did not have a link. We computed the descriptive,structural and group features for each of the profiles.

For each pair of profiles in the test data, we computed the features from thethree classes described in Section 4. A test instance for a pair of profiles ai andaj includes both the individual actor features and the actor-pair features. It hasthe form

< ai features, aj features, (ai, aj)-pair features, class >

where class is the binary class which denotes whether a friendship link existsbetween the actors.

For Dogster, the sample of 10,000 dogs had around 17,000 links among them-selves, and we sample from the non-existing links at a 10:1 ratio (i.e., the non-existing links are 10 times more than the existing links). For Catster, the 10,000cats had 43,000 links, and for the whole Hamsterster dataset, the number of linkswas around 22,000. We sampled from the non-existing links in these datasets atthe same 10:1 ratio.

6.3 Experimental setup

We used three well-known classifiers, namely Naıve Bayes, logistic regression anddecision trees for our experiments. The goal was to perform binary classificationon the test instances and predict friendship links. The implementations of theseclassifiers were from the latest version of Weka (v3.4.12) from http://www.cs.waikato.ac.nz/ml/weka/. We allocated a maximum of 2GB of memory for eachclassifier we ran. We measured prediction accuracy by computing precision, recall,and their harmonic mean, F1 score, using 10-fold cross-validation.

6.4 Link-Prediction Results

We report only on the results from decision-tree classification because it consis-tently had the highest accuracy among the three classifiers. Table 1 summarizesour results. Adding group features to the descriptive and structural features in-creased accuracy by 15% to 30%. We discuss the results in more detail in thesubsequent subsections.

Table 1. Comparison of F1 values in the three datasets, with the feature types fromour taxonomy

Feature Type Dogster Catster Hamsterster

Descriptive 37.6% 0.4% 19.8%Structural 76.1% 83.1% 59.9%Group 90.8% 95.2% 89.2%Descriptive and structural 78.6% 83.0% 60.3%Descriptive, structural, and group 94.8% 97.9% 90.5%

http://www.cs.waikato.ac.nz/ml/weka/

http://www.cs.waikato.ac.nz/ml/weka/


Fig. 3. a) Recall, precision, and F1 score for Dogster using descriptive and struc-tural attributes; b) F1 score across datasets. Using descriptive attributes together withstructural attributes leads to a better F1 score in Dogster but not in Catster andHamsterster.

Descriptive attributes can be useful in combination with structuralattributes. In these experiments, we have investigated the predictive powerof the simplest features, i.e., the descriptive attributes versus the impact of thestructural attributes. Figure 3 shows the accuracy results from the decision-treeclassifier. When we use only descriptive attributes, the link-prediction accuracyvaries across datasets. In Dogster, there is some advantage to using descriptiveattributes, yet the accuracy (F1 score) is relatively low (37.6%). In Catster andHamsterster, building the complete decision trees led to 0.4% and 19.8% accu-racy, respectively (using Weka’s default pruning parameter, the trees were empty,and the accuracies were 0%). This confirms that, in general, link prediction is achallenging prediction task.

When we used the structural features (such as number of friends that two pro-files share), the link-prediction accuracy increased to 76.1% in in Dogster. This sug-gests that the structural features are much more predictive than simple descriptiveattributes. This effect was even more pronounced for Catster and Hamsterster.

In Dogster, combining the node attributes and the structural features leadsto futher improvement. Using descriptive attributes together with structuralattributes leads to a better F1 score (78.6%) as compared to using either categoryalone (37.6% and 76.1%, respectively) in Dogster, as shown in Figure 3. ForCatster and Hamsterster, the difference was less than 0.4%.

Family group features are highly predictive. As the previous experimentsshowed, structural attributes are stronger predictors than the descriptive at-tributes alone. Next, we investigate the predictive power of the group featuresin our taxonomy. In Dogster, Catster and Hamsterster, the group features in-volve the families and friends of the users. Figure 4 shows our comparisons. Ourresults suggested that family groups are strong predictors for friendship links(F1 = 90.8% for Dogster). We also ran experiments where we used not onlyfamily cliques, but also the structural and descriptive features. In these experi-ments, the results show that the accuracy (F1) improves by 4% in Dogster, 0.6%in Catster and 1.3% Hamsterster.


Fig. 4. Link-prediction accuracy using all feature classes: descriptive, structural andgroup features. a) Recall, precision, and F1 score for Dogster; b) F1 score acrossdatasets. Group features are highly predictive, yet adding the other features providedbenefit too.

Computing more expensive structural attributes is not highly ben-eficial. Some structural features in our taxonomy were more computationallyexpensive to construct than others. For example, the feature that described thenumber of friends is easy to compute, whereas the feature that described the den-sity of common friends for each pair of profiles is the hardest. Using a database,computing density of common friends for all pairs of profiles requires severaljoins of large tables. In order to investigate the trade-off between computing ex-pensive features and their predictive impact on our results, we have performedthe following experiments.

We have designed experiments in which we add more expensive structuralfeatures one by one, and assess the link-prediction accuracy at each step. Weused the following combinations of features: (1) using number of friends only,(2) using number of friends and number of common friends, (3) using number offriends, number of common friends and jaccard coefficient, and finally (4) usingnumber of friends, number of common friends, jaccard coefficient and density ofcommon friends. We are reporting on the results of these four sets of structuralfeatures together with the descriptive attributes since we showed in the previoussubsection that using descriptive attributes can sometimes be beneficial. We alsoreport on the setting in which group features were used.

Surprisingly, it turned out that computing the more expensive features addedvery little benefit. Figure 5 shows the results of the experiments. For example, inthe Dogster case, adding the number of common friends of two nodes improvedaccuracy (F1 score) by 2% over the individual number of friends. Computing themost expensive feature density of common friends pays off slightly (improves F1score by 0.4%) only when there are no group attributes. Computing the moreexpensive jaccard coefficient did not pay off over using the simpler feature numberof common friends. In the Catster and Hamsterster cases, the improvementwas less that 0.5%. Our results also support the claim made in the preferentialattachment model [3] that the number of friends of a node (node degree) plays arole in the process of new nodes linking to it. They contradict the link-predictionresults in co-authorship networks [5] where jaccard coefficient and the number


Fig. 5. Link-prediction accuracy using structural features of increasing computationcost (number of friends, number of common friends, jaccard coefficient of commonfriends, density of common friends). Computing more expensive structural attributesis not highly beneficial, especially in the presence of group information.

of common friends consistently out-performed the metric based on number offriends. This may be inherent to the types of networks discussed.

Alternative network representations. In the next set of experiments, weused the alternative network overlays to test whether there was an advantage tokeeping the different types of links and the affiliation groups. We compare ourproposed different-link and affiliation overlay to the alternative representationssame-link and no affiliation overlay and same-link and affiliation overlay (seeSection 5). We compute only the descriptive and structural features in the overlaywith no affiliation information, and compute all classes of features in the overlayswhere affiliation information was given.

The results on Figure 6 show that when family affiliations were given, itdid not matter whether the links were treated as the same type or different


Fig. 6. Prediction accuracy when links are treated equal, with and without groupaffiliations. As the results from the affiliation overlays suggest, group features are themain contributor to the high link-prediction accuracy.

types: the link-prediction accuracy was the same. However, in the case whenthe affiliations were not given, it was better to compute the structural featuresusing both types of relationships but treat them as one type. When family linkswere treated as friendship links, the accuracy of the predictions made by thestructural attributes improved by 6% to 20%. This may be due to the fact thatthe overlap between friends and family links in the data was very small, andusing both types of links when computing the structural features was beneficial.Using the affiliation information and computing all features on the data led tothe best accuracy, and the accuracy was the same both in the different-link andsame-link cases. These experiments also confirmed the previous results: groupaffiliation was the main contributor to the high link-prediction accuracy.

7 Related Work

In general, link-prediction algorithms process a set of features in order to learnand predict whether it is likely that two nodes in the data are linked. Sometimes,these features are hand-constructed by analyzing the problem domain, the at-tributes of the actors, and the relational structure around those actors [12,4,5,9].Other times, they are automatically generated, i.e., the prediction algorithm firstlearns the best features to use and then predicts new links [8]. In this section,we discuss the existing work that is most relevant to the link-prediction problemin multi-relational social networks.

The link-prediction techniques that are based on feature-construction are clos-est to our work [12,4,1,5,9]. As most of the relational domains can be representedas a network model, the constructed features not only include the attributes ofthe actors, but also the characteristics of the structure. Most of this work ex-amines co-authorship and citation networks [4,5,8,9] whereas we validate ourmethod using online social networks. Some of the approaches use machine learn-ing techniques for classification [4,13,8,14], and others rely on ranking the featurevalues [12,5,9].


Adamic and Adar [12] use a similarity-based approach to predicting friendshipsamongst students. They gather data from university student websites and mail-ing lists, and construct a vector of features for each student such as website text,in-links, out-links, and mailing lists the students belong to. Their approach usesdescriptive features whereas ours also considers structural and group features.

It has been shown that there is ”useful information contained in the networktopology alone” [5]. Liben-Nowell and Kleinberg use a variety of structural fea-tures such as shortest path, (a variant of) number of friends, number of commonfriends, jaccard coefficient, and more elaborate structural features based on allpaths between two nodes in co-authorship networks. Their experiments comparethe link-prediction accuracy of each feature in isolation. They rank the node-pairs by each feature value and pick the top pairs as their predicted links. Theirresults suggest that simple features such as number of common friends coefficientperform well compared to others. In our work, besides structural features, wealso constructed descriptive and group features, and instead of using the featuresin isolation, we combined them.

Rattigan and Jensen [9] recognize that the extremely large class skew asso-ciated with the link-prediction task makes it very challenging. They look at arelated problem, anomalous link discovery, in which instead of discovering newlinks, they are interested in learning properties of the existing links. They usestructural features in co-authorship networks and rank the most and least likelycollaborations based on an expensive structural feature, the Katz score. Anotherwork that uses link prediction for anomaly discovery is the work of Huang andZeng [1], in which they rank anomalous emails in the Enron dataset.

The work described so far uses descriptive and structural attributes in isola-tion. Hasan et al. [4] use both. Their work studies classification for link predictionbased on hand-constructed features in co-authorship networks. They report pre-diction accuracy (F score), precision, and recall results from a range of classifierssuch as decision trees, k-nearest neighbor, multilayer perceptron, and support-vector machines. The novelty in our work compared to theirs is that we studylink prediction in richer social network settings and we explore the use of groupfeatures and alternate representations.

The link-prediction problem has also been studied in the domain of citationnetworks for scientific publications [8]. The authors posed the link-predictionproblem as a binary classification problem, and used logistic regression to solveit. Their features are database queries such as most cited author, and thus theyare similar to both the descriptive and structural features we have discussed sofar. Their work describes a statistical learning approach for feature generation.In particular, it extends the traditional Inductive Logic Programming (ILP) toreason about probabilities, and uses this extension to learn new features from theproblem domain both statistically and inductively. The experiments in this worksuggest that the ratio of existing to non-existing links in the test data mattered,and the fewer non-existing link examples were included, the better the precision-recall curve was. However, testing with more non-existing link examples wouldgive a better estimate of the probability of a randomly picked pair of nodes in the


network to be classified correctly. Another statistical learning approach to linkprediction was presented by Taskar et al. [14]. The authors use relational Markovnetworks to define a probabilistic model over the entire link graph. Their featuresare both descriptive and relational. They apply their method to two domains:linked university websites and a student online social network.

Another automated feature-generation method has been presented by Kubicaet al. [13] who described a learning method for the task of friend identificationwhich is similar to anomalous link discovery. Their method, called cGraph, learnsan approximate graph model of the actual underlying link data given noisy linkeddata. Then, this graph model is used to predict pairwise friendship informationregarding the nodes in the network. The types of features that they use aredescriptive, structural and group. The difference with our work is that we usethese features for link prediction rather than ranking the links.

Link completion is a problem related to link prediction. Given the arity of arelationship and all but one entity participating in it, the goal is to predict themissing entity, as opposed to classifying the missing link itself. Goldenberg et al.[10] present a comparison of several classification algorithms such as Naive Bayes,Bayesian networks, cGraph (mentioned above), logistic regression, and nearestneighbor. This study used several real-world datasets from different domains,including co-authorship networks, and data collected from the Internet MovieDatabase site. It suggested that logistic regression performs well in general inthe datasets above; in our study on real-world social networks, logistic regressionusually performed worse than decision-tree classifier in terms of accuracy.

There has also been interest in learning group features in social networks.Kubica et al. [15] describe a group-detection algorithm that uses descriptive fea-tures and links. First, they perform clustering based on the descriptive features(clustering) and find the groups. They allow group overlap and assume thatgroup memberships are conditionally independent of each other given descrip-tive features. Then, their algorithm assigns a probability of a link between twoactors based on the similarity of their groups, and it can answer ranking queriessimilar to the ones in the anomalous link discovery work. One of the issues withthe proposed algorithm is that it is slow [13].

The work of Friedland and Jensen [16] studies the problem of identifyinggroups of actors in a social network that exhibit a common behavior over time.The authors focused on networks of employees in large organizations, and in-vestigated the employee histories to identify the employees who worked togetherintentionally from those who simply shared frequently occurring employmentpatterns in the industry.

8 Discussion

When studying other large social networks, family information is not alwaysrelevant or available. However, groups and affiliations are often available, orcommunities can be discovered.

Thenetworksusedherehadabinary relationships - friendor family -buta similareffect can be achieved in networks where relationships are weighted. For example,


co-authorship networks are widely studied as social networks [3,4,5,6,7,8,9], andedges can be weighted by the number of articles a pair of authors have authoredtogether. In email communication networks - the Enron email corpus [1,2], for ex-ample - the number of messages between two senders can be used as a weight. Tomimic the strong family-type relationshipwe used in this article, a thresholdweightcan be set. Any edge with a weight over that threshold can be treated as a “strong”relationship (like our family relationship). Clusters of nodes connected with strongties would represent the equivalent of a family unit.

9 Conclusions and Future Work

Link prediction is a notoriously difficult problem. In this research, we foundthat overlaying friendship and affiliation networks was very effective. For thenetworks used in our study, we found that family relationships were very usefulin predicting friendship links. Our experiments show that we can achieve sig-nificantly higher prediction accuracy (between 15% and 30% more accurate) ascompared to using more traditional features such as descriptive node attributesand structural features. Family groups helped not only because they representa clique of actors, but because the family relationship itself was indicative ofstructural equivalence. As future work, we plan to investigate the use of edgeweights and thresholds to define strongly connected clusters, and see if it worksas well in link prediction as the family groups did here.

Acknowledgments

This work was partially supported by NSF under Grants No. 0746930 and No.0423845.

References

1. Huang, Z., Zeng, D.: A Link Prediction Approach to Anomalous Email Detection.In: IEEE International Conference on Systems, Man, and Cybernetics (2006)

2. Klimt, B., Yang, Y.: The enron corpus: A new dataset for email classificationresearch. In: Boulicaut, J.F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004)

3. Barabasi, A.L., Jeong, H., Neda, Z., Ravasz, E., Schubert, A., Vicsek, T.: Evolutionof the social network of scientific collaborations. PHYSICA A 311, 3 (2002)

4. Hasan, M., Chaoji, V., Salem, S., Zaki, M.: Link Prediction using Supervised Learn-ing. In: Proceedings of the Workshop on Link Analysis, Counter-terrorism andSecurity (with SIAM Data Mining Conference) (2006)

5. Liben-Nowell, D., Kleinberg, J.: The Link Prediction Problem for Social Networks.In: Proceedings of the 12th International Conference on Information and Knowl-edge Management (CIKM) (2003)

6. Newman, M.: Who is the best connected scientist? a study of scientific coauthorshipnetworks. Working Papers 00-12-064, Santa Fe Institute (December 2000),http://ideas.repec.org/p/wop/safiwp/00-12-064.html

http://ideas.repec.org/p/wop/safiwp/00-12-064.html


7. Newman, M.: Coauthorship networks and patterns of scientific collaboration. Pro-ceedings of the National Academy of Sciences 101, 5200–5205 (2004)

8. Popescul, A., Ungar, L.H.: Statistical relational learning for link prediction. In:Proceedings of the Workshop on Learning Statistical Models from Relational Dataat IJCAI 2003 (2003)

9. Rattigan, M.J., Jensen, D.: The case for anomalous link discovery. SIGKDD Explor.Newsl. 7(2), 41–47 (2005)

10. Goldenberg, A., Kubica, J., Komarek, P., Moore, A., Schneider, J.: A Comparisonof Statistical and Machine Learning Algorithms on the Task of Link Completion.In: KDD Workshop on Link Analysis for Detecting Complex Behavior (2003)

11. Golbeck, J.: The dynamics of web-based social networks: Membership, relation-ships, and change. First Monday 12 (2007)

12. Adamic, L., Adar, E.: Friends and neighbors on the web. Social Networks 25(3),211–230 (2003)

13. Kubica, J.M., Moore, A., Cohn, D., Schneider, J.: cGraph: A Fast Graph-BasedMethod for Link Analysis and Queries. In: Proceedings of the 2003 IJCAI Text-Mining & Link-Analysis Workshop (2003)

14. Taskar, B., Wong, M.F., Abbeel, P., Koller, D.: Link Prediction in Relational Data.In: Advances in Neural Information Processing Systems, NIPS 2003 (2003)

15. Kubica, J., Moore, A., Schneider, J., Yang, Y.: Stochastic Link and Group Detec-tion. In: Proceedings of the Eighteenth National Conference on Artificial Intelli-gence, AAAI 2002 (2002)

16. Friedland, J., Jensen, D.: Finding Tribes: Identifying Close-Knit Individuals fromEmployment Patterns. In: Proceedings of Knowledge Discovery and Data Mining(KDD 2007) (2007)

Information Theoretic Criteria for CommunityDetection

L. Karl Branting

The MITRE Corporation,7525 Colshire Drive

McLean, VA 22102, [email protected]

Abstract. Many algorithms for finding community structure in graphssearch for a partition that maximizes modularity. However, recent workhas identified two important limitations of modularity as a communityquality criterion: a resolution limit; and a bias towards finding equal-sizedcommunities. Information-theoretic approaches that search for partitionsthat minimize description length are a recent alternative to modularity.This paper shows that two information-theoretic algorithms are them-selves subject to a resolution limit, identifies the component of eachapproach that is responsible for the resolution limit, proposes a vari-ant, SGE (Sparse Graph Encoding), that addresses this limitation, anddemonstrates on three artificial data sets that (1) SGE does not exhibit aresolution limit on sparse graphs in which other approaches do, and that(2) modularity and the compression-based algorithms, including SGE,behave similarly on graphs not subject to the resolution limit.

1 Introduction

Many complex networks, such as the Internet, metabolic pathways, and socialnetworks, are characterized by a community structure that groups related ver-tices together. Traditional clustering techniques group vertices based on somemetric for attribute similarity [2]. More recent research has focused on detectionof community structure from graph topology. Under this approach, the input toa community-detection algorithm is a graph in which vertices correspond to indi-viduals (e.g., URLs, molecules, or people) and edges correspond to relationships(e.g., hyperlinks, chemical reactions, or marital and business ties). The outputconsists of a partition of the graph in which subgraphs correspond to meaningfulgroupings (e.g., web communities, families of molecules, or social clans).1

Community detection algorithms can be viewed as comprising two compo-nents: a utility function that expresses the quality of any given partition of a

1 Some communities, such as social clubs and families, can overlap. Membership insuch communities is better modeled as attributes of vertices rather than througha partition of the graph [3]. The focus of this paper, however, as in the bulk ofcommunity detection research, is on partition-based community structure.


Information Theoretic Criteria for Community Detection 115

Table 1. Utility functions and search strategies for various community-detection algo-rithms. DHC represents divisive hierarchical clustering, ADHH represents agglomera-tive hierarchical clustering, and MDL represents “Minimum Description length.”

utility function search strategy algorithm

modularity DHC/betweenness centrality Newman & Girvan (2004) [11]modularity AHC Newman (2004) [1]modularity Genetic Algorithm Tasgin & Bingol (2006) [12]modularity DHC/network structure index Rattigan et al. (2007) [13]modularity AHC/spectral division Donetti & Munoz (2004) [14]log-likelihood fixed-point iteration Zhang et al. (2007) [15]MDL simulated annealing Rosvall & Bergstrom (2007) [16]MDL iterated hill climbing Chakrabarti (2004) [8]

graph; and a search strategy that specifies a procedure for finding a partitionthat optimizes the utility function. Table 1 sets forth utility functions and searchstrategies of eight recent community-detection algorithms, showing that utilityfunctions have been paired with a variety of different search strategies.

The utility function most prevalent in recent community detection research isthe modularity function introduced in [1]:

Q =∑

1<i≤m

(w(Dii)/l − (li/l)2) (1)

where i is the index of the communities, w(Dii) is the number of edges in thegraph that connect pairs of vertices within community i, li =

∑j≤i w(Dij), i.e.,

the number of edges in the graph that are incident to at least one vertex incommunity i, and l is the total number of edges in the entire graph. Modularityformalizes the intuition that communities consist of groups of entities havingmore links with each other than with members of other groups.

Because of the shortage of real-world data sets with known community struc-ture, maximum modularity has sometimes even been equated with correct com-munity structure. However, two important weaknesses have been identified inmodularity as a community-structure criterion.

First, the group structure that optimizes modularity within a given subgraphcan depend on the number of edges in the entire graph in which the subgraph isembedded. Specifically, modularity is characterized by an intrinsic scale underwhich Q is maximized when pairs of distinct groups having fewer than

√2l

edges (where l is the total number of edges in the graph) are combined intosingle groups [4]. This phenomenon is apparent in ring graphs, i.e., connectedgraphs that consist of identical subgraphs each connected to exactly two othersubgraphs by a single link. For example, in the graph shown in Figure 1 consistingof a ring of 15 squares, modularity is greater when adjacent squares are groupedtogether than when each square is a separate group.

116 L. Karl Branting

Fig. 1. Ring graph R15,4 consisting of 15 communities, each containing 4 vertices

A second weakness of modularity is that even when the resolution limit is notexceeded, modularity exhibits a bias towards groups of similar size. Intuitively,the sum of the square terms, (li/l)2, representing the expected number of intra-group edges within community i under the null model, is minimized, and Qtherefore maximized, when all li are as nearly equal in size as possible.

One approach to the resolution limit of modularity is to apply modularityrecursively, so that the coarse structure found at one level is refined at lowerlevels [5].2 An alternative approach is to substitute a different community-qualitycriterion for modularity.

One such alternative criterion for community quality that has recently beenproposed, based on information theory, is minimizing description length [7,8,9].In this approach, the quality of a given partition of a graph is a function of thecomplexity of the community structure together with the mutual informationbetween the community structure and the graph as a whole. The best commu-nity structure is one that minimizes the sum of (1) the number of bits neededto represent the community structure plus (2) the number of bits needed to

2 See [6] for recent approach that addresses resolution limits by using an absoluteevaluation of community structure rather than comparison to a null model.


represent the entire graph given the community structure. Under this approach,the task of community detection consists of finding the community structure thatleads to the minimum description length (MDL) representation of the graph,where description length is measured in number of bits.

The structure of the paper is as follows: Section 2 of this paper compares thecompression approach used in two previous approaches to information-theoreticcommunity detection and identifies a feature common to both that can leadto a bias toward combining distinct communities in large sparse graphs. Analternative encoding, termed SGE (Sparse Graph Encoding) that addresses thisbias is proposed in Section 3. Section 4 describes the design of an empiricalevaluation comparing the previous information-theoretic utility functions, SGE,and modularity on three classes of artificial data. The results of this experimentare set forth in Section 5.

2 Minimum Description Length Encodings

The intuition behind the minimum description length (MDL) criterion for com-munity structure is that a partition of a graph that permits a more concisedescription of the graph is more faithful to the actual community structure thana partition leading to a less concise description. The best partition is the onethat lends itself to the most concise description, that is, the encoding of the par-tition and of the graph given with the partition in the fewest bits. However, theminimum description length (MDL) criterion does not in itself specify how toencode either the community structure or the graph given the community struc-ture. Indeed, the close connection between MDL and Kolmogorov complexity[10], which is undecidable, suggests that MDL may itself be undecidable.

The encoding algorithms of Rosvall and Bergstrom [7] (hereinafter “RB”) andChakrabarti [8] (hereinafter “AP,” standing for “AutoPart”) use quite differentapproaches to measuring the description length of community structures andgraphs. However, RB and AP have in common that both are characterized by aresolution limit similar to that observed in modularity.

RB and AP decompose the task of encoding a graph and its communitystructure into similar steps, but they calculate the bits in each term differently.For the purposes of this comparison, the following notation will be followed:

– n - the number of vertices in the graph– m - the number of groups– ai - the number of vertices in group i– l - the total number of edges in the entire graph– li - the number of edges incident to group i– Dij - a binary adjacency matrix between groups i and j– n(Dij) - the number of elements in adjacency matrix D– w(Dij) - the number of 1’s in Dij , i.e., the number of edges between groups

i and j

– P (Dij) - the density of 1’s in Dij , i.e., w(Dij)n(Dij)


– P ′(Dij) - for a square matrix Dij , the density of 1’s ignoring the diagonal– H(Dij) = −P (Dij) log(P (Dij))−(1−P (Dij)) log(1−P (Dij)), i.e., the mean

entropy of Dij

– H ′(Dij) = −P ′(Dij) log(P ′(Dij)) − (1 − P ′(Dij)) log(1 − P ′(Dij)), i.e., themean entropy of Dij if values on the diagonal of Dij are ignored

– B - a matrix representing for each pair of groups whether the pair is con-nected, i.e., Bij = 1 ⇐⇒ w(Dij) > 0

The encoding schemes used in RB and AP are as follows:

1. Bits needed to represent the number of vertices in the graph. Since this termdoesn’t vary with differing community structure, it is irrelevant to the choicebetween different community structures and can be ignored.

2. Bits needed to represent the number of groups.– RB. Not explicitly represented.– AP. log∗(m). log∗(x) = log2(x) + log2log2(x) + ... where only positive

terms are included in the sum. This series is apparently intended torepresent the mean coding length of integers given that the probabilityof an integer of a given length is a monotonically decreasing functionof the integer’s length, i.e., longer integers are less probable, but nomaximum length is known [17].

3. Bits needed to represent the association between vertices and groups– RB. n log(m). The rationale appears to be that for each of the n vertices,

log(m) bits are needed to identify the group to which the vertex belongs.– AP. If the groups are placed in decreasing order of length, i.e., a1 ≥ a2 ≥

... ≥ am ≥ 1,m−1∑i=1

�log(ai)�

where ai = (∑m

t=1 at) − m + i.4. Bits needed for the group adjacency matrix, i.e., the number of edges between

pairs of groups.– RB. 1

2m(m+1) log(l). The first term (12m(m+1)) represents the number

of pairs of groups, and the second term (log(l)) the number of bits neededto specify the number of edges between any pair of groups.

– AP. ∑1<i,j<m

�log(aiaj + 1)�

This expression sums for every pair of groups sufficient bits to representthe number of edges between that pair.

5. Bits needed to represent the full adjacency matrix for vertices, given thegroup structure represented in terms 2-4.– RB.

log(m∏

i=1

(ai(ai − 1)/2

w(Dii)

) ∏i<j

(aiaj

w(Dij)

))


The expression following the first product sign represents the number ofways to choose the actual pairs that are connected within a single groupfrom the set of all possible pairs. The expression following the secondproduct sign is the number of ways to choose the actual pairs betweenvertices in two different groups from the set of possible edges betweenvertices in those groups.

– AP.m∑

i=1

m∑j=1

aiajH(Dij)

For each pair of groups, the entropy of the adjacency matrix for thatpair, i.e., the size of the matrix times its the mean entropy.

RB and AP clearly calculate each term quite differently. In general, RB usesencodings that are much larger than those used in AP. However, a key similarityis in term 4, the bits needed to encode the number of edges between pairs ofgroups. In both RB and AP at least one bit is required for each pair of groupsregardless of how few groups are actually connected (i.e., how few pairs of groupshave at least one edge from a vertex in one to a vertex in the other). The numberof bits arising from this term therefore increases with the square of the numberof groups, regardless of the sparsity of their interconnections. One would expectthat for sufficiently large graphs with sparse community structure the savingsin term 4 from combining groups would be greater than the added cost in term5 of specifying the vertex adjacencies for the resulting relatively sparse group,and that this would lead to conflation of distinct groups similar to that observedwhen modularity is used as a community quality function. As discussed in theevaluation below, this conflation is in fact observed. For example, the numberof bits required to encode the graph shown in Figure 1 is lower under both theRB and AP procedures if some pair of adjacent groups are combined, yielding14 communities, than if it is divided into 15 equal communities.

3 Sparse Graph Encoding (SGE)

The observations that RB and AP (1) assign at least one bit per pair of com-munities, regardless of how few are actually connected and (2) conflate distinctgroups in large sparse graphs (as shown experimentally below) suggests the hy-pothesis that an encoding in which the bits required to encode the number ofedges between pairs of groups grow more slowly than the square of the numberof groups would be less prone to the resolution limit. Sparse Graph Encoding(SGE) is an encoding scheme designed to test this hypothesis.

The key idea is to encode the group adjacency matrix using two terms. Thefirst term encodes, for each pair of groups, whether the groups are connected.The number of bits required for this is equal to the entropy of B, the binarymatrix representing for each pair of groups whether those groups are connected.The mean entropy of B is at most 1.0, if each group is randomly connected toexactly half the others. If few, or most, groups are connected to one another, the


mean entropy is less than 1.0, and the total entropy is therefore less than thesquare of the number of groups.

Moreover, the number of bits needed to represent B can be further reduced bynoting that the value of B’s diagonal need not be explicitly represented becauseit can be determined from the number of nodes in each group. Singleton groupshave no within-group edges (assuming that self-loops are prohibited) and groupswith more than one element must have at least one within-group edge (if thereare no within-group edges, the density of within-group edges cannot be higherthan the density of between-group edges, the basic characteristic of a group).

The bits needed to represent B are therefore:

m(m − 1)H ′(B) (2)

where H ′(B) = −P ′(B) log(P ′(B)) − (1 − P ′(B)) log(1 − P ′(B)) and P ′(B) isthe density of 1’s in B, ignoring the diagonal.

The second term contains, for each connected pair, the number of bits neededto represent the number of edges between that pair (the second sum is neededif, as we assume, edges from a vertex to itself are forbidden):∑

i�=j∧w(Dij)≥0

log(aiaj) +∑

i=j∧w(Dij)>0

log(ai(aj − 1)) (3)

If the cost of representing the group adjacent matrix is calculated as expression2 + expression 3, the cost will grow with the number of connected pairs ratherthan with the total number of pairs.

SGE employs several additional minor modifications to further reduce thedescription length. The entire calculation is as follows:

1. Bits needed to represent the number of vertices in the graph. As with RBand AP, these bits are ignored.

2. Bits needed to represent the number of groups. The log* function of [17]used in AP is predicated on the assumption that no maximum integer size isknown a priori. Here, however, the maximum number of groups is boundedby both the machine word size and the virtual memory size of the machineon which the algorithm is executed. Therefore, SGE uses instead RB’s cal-culation:

log(m)

3. Bits needed to represent the association between vertices and groups. Nogroup can contain more than n−m+ l vertices (since each group must haveat least one vertex). Accordingly, the following expression contains sufficientbits to represent the number of vertices in all m groups:

m log(n − m + 1)

4. Bits needed for the group adjacency matrix, i.e., the number of edges betweenpairs of groups. As discussed above, the number of bits is:

H ′(B) +∑

i�=j∧w(Dij)>0

log(aiaj) +∑

i=j∧w(Dij)>0

log(ai(aj − 1))


5. Bits needed to represent the full adjacency matrix for vertices given thegroup structure represented in terms 2-4. This consists, for every pair ofgroups i and j, of size of the i, j adjacency matrix, aiaj , times the entropyper entry in the corresponding binary matrix, H(Dij). This is equivalent tothe AP calculation, shown above:

m∑i=1

m∑j=1

aiajH(Dij)

In summary, the relationship between SGE, RB, and AP is as follows:

1. Bits needed to represent the number of vertices in the graph. Ignored as inRB and AP.

2. Bits needed to represent the number of groups. Follows RB.3. Bits needed to represent the association between vertices and groups. Uses

an expression with fewer bits than that used in RB, and that is simpler thanthat used in AP.

4. Bits needed for the group adjacency matrix. The primary novelty of SGE,in that for sparse adjacency matrices this term grows more slowly than thesquare of the number of groups.

5. Bits needed to represent the full adjacency matrix for vertices. Follows AP.

4 Empirical Evaluation

The previous section suggested that a graph encoding in which the calculation ofthe number bits required to represent a group adjacency matrix was reduced froman expression that grows as the square of the number of groups, as in RB and AP,to an expression that grows in proportion to the number of pairs of connectedgroups, as in SGE, would reduce or eliminate any resolution limit in sparselyconnected graphs. This hypothesis was tested by comparing the communitiesfound by optimizing RB, AP, SGE, and modularity on three different artificialdata sets.

To avoid conflating the effect of a utility function with the behavior of a searchstrategy, it was necessary to compare alternative utility functions using a singlecommon search strategy. Accordingly, a single search function was applied to allfor utility functions in the experimental evaluation: the greedy divisive hierar-chical clustering algorithm of Newman & Girvan (2004) [11]. In the Newman &Girvan procedure, the edge with the highest betweenness centrality is iterativelyremoved, and the partition in the resulting sequence having the optimal valueunder the utility function was returned as the community structure. Using a sin-gle search strategy removes the potentially confounding disparity of the searchalgorithms used in published descriptions of RB, AP, and modularity.

4.1 Evaluation Criteria

Various objective functions have been proposed for evaluating the quality of aproposed community structure given the actual correct community structure,


including the Rand index [23], the adjusted Rand index [24], and f-measure.There is no consensus regarding the most informative objective function. In thisevaluation, f-measure was selected since its use in information retrieval has madeit familiar to a wide range of researchers.

The intuition underlying the use of f-measure is that group structure can beexpressed as a relation c(G) = {〈vi, vj〉 | ∃g ∈ G ∧ vi, vj ∈ g}, that is, thecommunity structure can be represented by specifying for each pair of verticeswhether that pair is in the same group. The similarity between the proposedgroup structure and the actual group structure can be evaluated by comparingc(proposed) with c(actual). One way to make the comparison is to view eachpair in c(proposed) that is also in c(actual) as a true positive, whereas each pairin c(proposed) that is not in c(actual) is a false positive. Under this view, recalland precision can be defined as follows:

– Recall = |c(proposed)|⋂ |c(actual)||c(actual)|

– Precision = |c(proposed)|⋂ |c(actual)||c(proposed)|

F-measure is the harmonic mean of recall and precision:

– F-measure = 2 ∗ recall ∗ precisionrecall + precision

4.2 Experimental Procedure

Three experiments were performed, each with a different type of artificial graph.The first, ring graphs, are characterized by the sparsity of connections betweengroups observed in many large-scale real-world graphs [20]. The second, uniformrandom graphs, has been used in a number of evaluations of community-detectionalgorithms. The third, embedded Barabasi-Albert Graphs, consists of communitiesgenerated by preferential attachment [20] embedded in a random graph. Fiftytrials were performed under each experimental condition for uniform randomand EBA graphs. There is no randomness in the construction of ring graphs, soa single trial was sufficient.

Experiment 1: Ring graphs. Ring graph Rm,c comprises m communities,each consisting of a ring of c vertices, connected to two other communities,each by a single link, such that all communities are connected. Ring graphs aresimilar to the clique rings of [4] but differ in that the individual communities arethemselves rings rather than cliques. For example, Figure 1 depicts ring graphR15,4.

The evaluation compared RB, AP, SGE, and modularity on 91 ring graphs forwhich 〈m, c〉 ∈ {4 . . .16} × {3 . . . 9}.3 Strikingly different behavior was observed

3 Note that for m, c > 3 ring graphs contain no triangles. Therefore, community de-tection techniques based on clustering coefficient, e.g., [18], are ineffective for findingcommunities in such ring graphs.


among the four community-structure utility functions. Optimizing SGE led tothe correct partitions in all but two ring graphs, but RB and AP found no correctpartitions. Optimizing modularity led to correct partitions only for those graphsbelow the resolution threshold identified by [4].

– SGE. The partition having the optimal (lowest) SGE had the correct parti-tion (i.e., no separate communities were conflated) in every graph except forR4,3 and R13,3 In other words, the correct community structure was foundin 89 out of 91 ring graphs.

– RB and AP. No community structure was found by optimizing either RBor AP. The partition having the optimal (lowest) value for RB and APcontained at least one pair of communities that were grouped together inevery ring graph tested.

– Modularity. Optimizing modularity led to incorrect community structurefor rings of more than 8 triangles, more than 10 squares, more than 11 pen-tagons, or more that 13 hexagons or heptagons. In other words, the correctpartitions were obtained with modularity only for rings and communities ofthe following sizes:

Fig. 2. A uniform random graph with 32 vertices, 4 groups, size ratio 1.25, and io ratio0.67


• R4,3 − R8,3

• R4,4 − R10,4

• R4,5 − R11,5

• R4,6 − R13,6

• R4,7 − R13,7

• R4,8 − R16,8

• R4,9 − R16,9

This evaluation confirmed empirically the existence of the resolution limit formodularity derived formally in [4]. The evaluation also showed the surprisingresult that optimizing RB and AP leads to even more conflation of distinctcommunities than does modularity. The observation that optimizing SGE led tothe correct community structure provides confirmation for the hypothesis thatthe conflation of communities in RB and AP arises from term 4, which uses morebits than necessary to represent the number of edges connecting groups in sparsegraphs. Substituting rings of cliques for rings of graphs that are themselves ringsleads to almost identical results to those described here.

Fig. 3. An Embedded Barabasi-Albert (EBA) graph with 4 communities, each with5 initial vertices per community, 3 new edges per time step, 10 time steps, and 25singleton-group edges


Experiment 2: Uniform random graphs. A common data set for testingcommunity-extraction algorithms consists of random networks of 128 verticesdivided into 4 communities with average vertex degree of 16 [11,16,19]. In thisexperiment, the relative size of the communities was controlled by a size ratioparameter s such that if the communities were placed in ascending order, |ai+1|

|ai| =s, where ai is the ith communities. The connections among the vertices weredetermined by the average vertex degree d and in/out ratio i such that theaverage number of within-community edges incident to each vertex was i ∗ dand the average number of cross-community edges incident to each vertex was(1 − i) ∗ d. For example, Figure 2 shows a uniform random graph with s = 1.25and i = 0.6. Tests were performed for each combination of n = 32, m = 4, d =6, s ∈ {1.0, 1.25, 1.5, 1.75, 2.0}, and i ∈ {0.6, 0.75, 0.9}.

Figures 4, 5, and 6 show the results of the 4 algorithms on uniform graphsfor i ∈ {0.6, 0.75, 0.9} respectively. For i ∈ {0.75, 0.9}, in which the communitystructure is relatively distinct, all four algorithms led to similar results exceptwhen the size ratio s was equal to 2.0 (i.e., the sizes of the groups were highlyskewed). Under these circumstances, modularity led to much lower f-measurethan the other algorithms. When i was equal to 0.6 (i.e., the community structurewas relatively unclear) modularity was best and AP worst for low size ratio, andRB and AP were best for high size ratio. These results are consistent with [7],which showed better performance for RB than modularity for skewed communitysizes, but comparable performance when community sizes were equal.

Fig. 4. F-measure for uniform random graphs with i=0.6 (weak community structure)


Fig. 5. F-measure for uniform random graphs with i=0.75 (moderate community struc-ture)

Fig. 6. F-measure for uniform random graphs with i=0.9 (strong community structure)


Experiment 3: Embedded Barabasi-Albert Graphs. A wide range of nat-urally occurring graphs, including those mentioned in the introduction (the In-ternet, biochemical pathways, social networks) exhibit a power-law degree dis-tribution that is not present in uniform random graphs [20,21,22]. However, fewsuch “scale-free” graphs are annotated with correct community structure. Thethird data set consisted of communities with scale-free structure embedded ina sparse random graph. Each graph consists of m communities generated bythe Jung 1.74 implementation of the Barabasi-Albert preferential attachmentalgorithm, each starting with i initial vertices in each community, with e newedges per time step following the preferential attachment rule of [20] for each oft time steps, together with c singleton-group vertices. The singleton-group ver-tices were connected to 1. . . e vertices randomly selected from the entire graph,i.e., including both community and singleton-group vertices. The graphs used tortesting had 4 communities, 4 initial vertices per community, 2–4 edges added pertime step, 20 time steps, and 25 singleton-group vertices. For example, Figure3 depicts an EBA graph with 3 edges added per time step. In evaluating EBAgraphs, singleton-group vertices were ignored, regardless of whether they weregrouped into new communities or added to existing communities.

As shown in Figure 7, the behavior of all four algorithms was quite similarwhen the number of edges added per time step was 3 or 4, which leads torelatively densely connected graphs. When only 2 edges were added per timestep (i.e., the communities where quite sparse), AP’s performance was muchworse, and SGE’s somewhat worse, than that of the other two algorithms.

Fig. 7. F-measure for embedded Barabasi-Albert graph with 2–4 edges added per timestep


5 Conclusion

The empirical evaluation demonstrated that RB and AP conflate distinct com-munities in ring graphs, and that changing the calculation of the number of bitsneeded to represent the group adjacency matrix eliminated this conflation overthe range of ring graphs tested. Ring graphs are artifacts not likely to occurin many real-world graphs of interest, but many real-world graphs are like ringgraphs in having very sparse group adjacency matrices (i.e., communities withlinks to few other communities). The ring-graph experiment suggests that RBand AP may perform even more poorly than modularity in such graphs.

SGE’s description length calculation did not entirely eliminate resolution lim-its in clustering. For example, SGE combines adjacent communities in extremelylarge rings, such as R100,4. Moreover, SGE combines adjacent communities inR3,4 and R13,3. Thus, it appears that SGE’s bit encoding is not optimal even insparse graphs.

No one algorithm consistently outperformed the others in EBA or uniformrandom graphs, but modularity was consistently worse than the MDL algo-rithms on highly skewed uniform random graphs, and AP and SGE had lowerperformance than the others on sparse EBA graphs. Neither uniform randomgraphs nor EBA graphs have the sparse group adjacency matrices that char-acterize ring graphs, so most errors consist of assigning a vertex to the wrongcommunity rather than combining two communities that should remain distinct.Under these circumstances, SGE’s representation of the group adjacency matrixconfers no particular benefit.

While MDL is clearly a powerful tool for identifying community structure,there are many options for MDL encodings, and the consequences of each choicecan be difficult to anticipate. SGE demonstrates that the resolution limits of RBand AP in graphs with sparse group adjacency matrices can be easily addressed,but the fact that SGE did not outperform RB or RB on other types of graphssuggests that considerable subtlety is required to identify the MDL encodingmost effective over a wide range of graph and community types.

Acknowledgments

This work was funded under contract number CECOM Wl5P7T-08-C-F600. TheMITRE Corporation is a nonprofit Federally Funded Research and DevelopmentCenter chartered in the public interest.

References

1. Newman, M.E.J.: Fast algorithm for detecting community structure in networks.Physical Review E 69, 066133 (2004)

2. Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A.L., French, J.C.: Clusteringlarge datasets in arbitrary metric spaces. In: Proceedings of the 15th IEEE Inter-national Conference on Data Engineering, Sydney, pp. 502–511 (1999)


3. Koutsourelakis, P., Eliassi-Rad, T.: Finding mixed-memberships in social networks.In: Papers from the 2008 AAAI Spring Symposium on Social Information Process-ing, Technical Report WW-08-06, pp. 48–53. AAAI Press, Menlo Park (2008)

4. Fortunato, S., Barthelemy, M.: Resolution limit in community detection. Proc.Natl. Acad. Sci. USA 104, 36 (2007)

5. Ruan, J., Zhang, W.: Identifying network communities with a high resolution.PhysRevE (2007)

6. Ronhovde, P., Nussinov, Z.: An improved potts model applied to community de-tection. physics.soc-ph (2008)

7. Rosvall, M., Bergstrom, C.: An information-theoretic framework for resolving com-munity structure in complex networks. Proc. Natl. Acad. Sci. USA 104(18), 7327–7331 (2007)

8. Chakrabarti, D.: Autopart: Parameter-free graph partitioning and outlier detec-tion. In: Proceedings of the European Conference on Machine Learning and Prac-tice of Knowledge Discovery in Databases, pp. 112–124 (2004)

9. Sun, J., Faloutsos, C., Papadimitriou, S., Yu, P.: Graphscope: parameter-free min-ing of large time-evolving graphs. In: KDD 2007: Proceedings of the 13th ACMSIGKDD international conference on Knowledge discovery and data mining, pp.687–696. ACM, New York (2007)

10. Wallace, C.S., Dowe, D.L.: Minimum message length and Kolmogorov complexity.The Computer Journal 42(4), 270–283 (1999)

11. Newman, M.E., Girvan, M.: Finding and evaluating community structure in net-works. Physical review. E, Statistical, nonlinear, and soft matter physics 69(2 Pt2) (February 2004)

12. Tasgin, M., Bingol, H.: Community detection in complex networks using geneticalgorithm. In: ECCS 2006: Proc. of the European Conference on Complex Systems(2006)

13. Rattigan, M.J., Maier, M., Jensen, D.: Graph clustering with network structure in-dices. In: ICML 2007: Proceedings of the 24th international conference on Machinelearning, pp. 783–790. ACM, New York (2007)

14. Donetti, L., Muoz, M.: Detecting net-work communities: a new systematic and ef-ficient algorithm. Journal of Statistical Mechanics: Theory and Experiment 10012,1–15 (2004)

15. Zhang, H., Giles, C.L., Foley, H.C., Yen, J.: Probabilistic community discoveryusing hierarchical latent gaussian mixture model. In: AAAI 2007: Proceedings ofthe 22nd national conference on Artificial intelligence, pp. 663–668. AAAI Press,Menlo Park (2007)

16. Rosvall, M., Bergstrom, C.T.: An information-theoretic framework for resolvingcommunity structure in complex networks. PNAS 104(7327) (2007)

17. Rissanen, R.: A universal prior for integers and estimation by minimum descriptionlength. The Annals of Statistics 2, 416–431 (1983)

18. Du, N., Wu, B., Pei, X., Wang, B., Xu, L.: Community detection in large-scalesocial networks. In: WebKDD/SNA-KDD 2007: Proceedings of the 9th WebKDDand 1st SNA-KDD 2007 workshop on Web mining and social network analysis, pp.16–25. ACM, New York (2007)

19. Bagrow, J.: Evaluating local community methods in networks. J. Stat.Mech. 2008(05), P05001 (2008)

20. Barabasi, A., Albert, R.: Emergence of scaling in random networks. Science 286,509–512 (1999)

21. Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empiricaldata, cite arxiv:0706.1062 (2007) http://www.santafe.edu/~aaronc/powerlaws/

http://www.santafe.edu/~aaronc/powerlaws/


22. Clauset, A., Shalizi, C., Newman, M.: Power-law distributions in empirical data.SIAM Review 51(4), 661–703 (2009)

23. Rand, W.M.: Objective Criteria for the Evaluation of Clustering Methods. Journalof the American Statistical Association 66(336), 846–850 (1971)

24. Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification 2, 193–218(1985)

Author Index

Berger-Wolf, Tanya Y. 55Branting, L. Karl 114

Eliassi-Rad, Tina 1

Gallagher, Brian 1Getoor, Lise 97Ghosh, Rumi 20Golbeck, Jennifer 97Goldberg, Mark 36

Habiba, 55

Kelley, Stephen 36Kersting, Kristian 77Kuter, Ugur 97

Lerman, Kristina 20

Magdon-Ismail, Malik 36Mertsalov, Konstantin 36

Rettinger, Achim 77

Saia, Jared 55

Tresp, Volker 77

Wallace, William (Al) 36

Xu, Zhao 77

Yu, Yintao 55

Zheleva, Elena 97

[lecture notes in computer science] advances in social network mining and analysis volume 5498 ||

Documents