[ieee 2014 ieee/acm international conference on advances in social networks analysis and mining...

8
Community Evolution Prediction in Dynamic Social Networks Mansoureh Takaffoli, Reihaneh Rabbany, Osmar R. Za¨ ıane Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8 Email:{takaffol, rabbanyk, zaiane}@ualberta.ca Abstract— Finding patterns of interaction and predicting the future structure of networks has many important applica- tions, such as recommendation systems and customer targeting. Community structure of social networks may undergo different temporal events and transitions. In this paper, we propose a framework to predict the occurrence of different events and transition for communities in dynamic social networks. Our framework incorporates key features related to a community – its structure, history, and influential members, and automatically detects the most predictive features for each event and transition. Our experiments on real world datasets confirms that the evolu- tion of communities can be predicted with a very high accuracy, while we further observe that the most significant features vary for the predictability of each event and transition. I. I NTRODUCTION AND BACKGROUND A social network shows the structure of relationships be- tween individuals, where the relationships are usually defined based on some type of interaction, hence are intrinsically temporal and changing over time; examples are friendships between people, co-authorships between scholars, email in- teractions between employees within an organization, etc. Modeling a dynamic network enables the study of its structure over time, the detection of how the network evolves, and ultimately the prediction of the future structure of the network. Finding patterns of interaction and predicting the future structure of networks has many applications, such as in viral marketing [1], revenue maximization [2], and social influence [3]. It can help decision makers setup profitable marketing strategies in advance. However, very little work has been done on why dynamic networks experience specific evolution transitions. Most of the previous research in this area focus on either predicting the macroscopic graph structure, or the microscopic properties from the point of view of a single node or edge. In this paper, we consider predicting the trend of one mesoscopic structure, called community. The community con- sists of a group of nodes that are relatively densely connected to each other but sparsely connected to other dense groups in the network. Analysis of the evolution of communities is related to important social phenomena such as homophily [4] and influence [5]. In social networks, communities can be either explicit or implicit. Explicit communities are built independently from their members and are based on a set of rules. In this case, people mostly join communities after the formation of the communities. Employees of a company or students participat- ing in a course are examples of two explicit communities. On the other hand, the formation of implicit communities heavily depends on their members and connections. In this paper, we mainly focus on implicit communities and the setting where at any given time, an individual can belong to only one community. In this setting, which is called non-overlapping communities or hard-partitioning, a community serves as the main engagement platform for the individual. An individual can move from one community and join another one, while the amount of interactions between members of a community also changes over time. Thus, a community experiences different events and transitions during its life. In this paper, we propose a machine learning model to accurately predict the next event and transition of a community, based on the relevant structural and temporal properties. The first contribution of our model is leveraging the relation between the behavior of individuals and the future of their communities. Members of a community play an important role in attracting new members and generally shaping the future of their community. This fact is however overlooked by all previous works. Our models further assume that individuals who are more likely to undertake actions in their communities, are more influential in the future trend of their community, and therefore are principal factors in the predictive process. For instance, in marketing strategy, considering the impact of individuals on their communities is necessary for targeting the right consumers to direct advertisement, and to maximize the expectation of the total profit [6]. Moreover, unlike previous works that only consider one aspect of the communities (i.e. size, age, or event), our models provide a complete predictive process for any transition and event that a community may undergo, and at the same time, identify the most prominent features for each community transition and event. The last important distinction of our model is that our events and transitions do not have to taken place in consecutive snapshots. A community may not necessarily be observed at consecutive snapshots, while it may be missing from one or more inter- mediate steps. Hence, our model predicts the next stage of a community either in the exact next snapshot or any later snapshot. Before describing our method, we need to first review few notations from [7]. A dynamic social network is modelled as a sequence of graphs {G 1 ,G 2 , ..., G n }, where G i =(V i ,E i ) denotes the graph at snapshot i, which contains V i individuals and E i interactions. The set C i = {C 1 i ,C 2 i , ..., C ni i } represents n i communities detected at snapshot i, where each community C p i C i is also represented by a graph (V p i ,E p i ). We consider two communities from different snapshots similar if the ratio of their mutual members exceeds a threshold, more formally: 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014) 9 ASONAM 2014, August 17-20, 2014, Beijing, China 978-1-4799-5877-1/14/$31.00 ©2014 IEEE

Upload: osmar-r

Post on 17-Mar-2017

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) - China (2014.8.17-2014.8.20)] 2014 IEEE/ACM International Conference on Advances

Community Evolution Predictionin Dynamic Social Networks

Mansoureh Takaffoli, Reihaneh Rabbany, Osmar R. ZaıaneDepartment of Computing Science, University of Alberta,

Edmonton, Alberta, Canada T6G 2E8

Email:{takaffol, rabbanyk, zaiane}@ualberta.ca

Abstract— Finding patterns of interaction and predictingthe future structure of networks has many important applica-tions, such as recommendation systems and customer targeting.Community structure of social networks may undergo differenttemporal events and transitions. In this paper, we propose aframework to predict the occurrence of different events andtransition for communities in dynamic social networks. Ourframework incorporates key features related to a community –its structure, history, and influential members, and automaticallydetects the most predictive features for each event and transition.Our experiments on real world datasets confirms that the evolu-tion of communities can be predicted with a very high accuracy,while we further observe that the most significant features varyfor the predictability of each event and transition.

I. INTRODUCTION AND BACKGROUND

A social network shows the structure of relationships be-tween individuals, where the relationships are usually definedbased on some type of interaction, hence are intrinsicallytemporal and changing over time; examples are friendshipsbetween people, co-authorships between scholars, email in-teractions between employees within an organization, etc.Modeling a dynamic network enables the study of its structureover time, the detection of how the network evolves, andultimately the prediction of the future structure of the network.

Finding patterns of interaction and predicting the futurestructure of networks has many applications, such as in viralmarketing [1], revenue maximization [2], and social influence[3]. It can help decision makers setup profitable marketingstrategies in advance. However, very little work has beendone on why dynamic networks experience specific evolutiontransitions. Most of the previous research in this area focuson either predicting the macroscopic graph structure, or themicroscopic properties from the point of view of a single nodeor edge. In this paper, we consider predicting the trend of onemesoscopic structure, called community. The community con-sists of a group of nodes that are relatively densely connectedto each other but sparsely connected to other dense groupsin the network. Analysis of the evolution of communities isrelated to important social phenomena such as homophily [4]and influence [5].

In social networks, communities can be either explicit orimplicit. Explicit communities are built independently fromtheir members and are based on a set of rules. In this case,people mostly join communities after the formation of thecommunities. Employees of a company or students participat-ing in a course are examples of two explicit communities.On the other hand, the formation of implicit communities

heavily depends on their members and connections. In thispaper, we mainly focus on implicit communities and the settingwhere at any given time, an individual can belong to only onecommunity. In this setting, which is called non-overlappingcommunities or hard-partitioning, a community serves as themain engagement platform for the individual. An individualcan move from one community and join another one, while theamount of interactions between members of a community alsochanges over time. Thus, a community experiences differentevents and transitions during its life.

In this paper, we propose a machine learning model toaccurately predict the next event and transition of a community,based on the relevant structural and temporal properties. Thefirst contribution of our model is leveraging the relationbetween the behavior of individuals and the future of theircommunities. Members of a community play an important rolein attracting new members and generally shaping the futureof their community. This fact is however overlooked by allprevious works. Our models further assume that individualswho are more likely to undertake actions in their communities,are more influential in the future trend of their community,and therefore are principal factors in the predictive process.For instance, in marketing strategy, considering the impact ofindividuals on their communities is necessary for targeting theright consumers to direct advertisement, and to maximize theexpectation of the total profit [6]. Moreover, unlike previousworks that only consider one aspect of the communities (i.e.size, age, or event), our models provide a complete predictiveprocess for any transition and event that a community mayundergo, and at the same time, identify the most prominentfeatures for each community transition and event. The lastimportant distinction of our model is that our events andtransitions do not have to taken place in consecutive snapshots.A community may not necessarily be observed at consecutivesnapshots, while it may be missing from one or more inter-mediate steps. Hence, our model predicts the next stage ofa community either in the exact next snapshot or any latersnapshot.

Before describing our method, we need to first review fewnotations from [7]. A dynamic social network is modelled asa sequence of graphs {G1, G2, ..., Gn}, where Gi = (Vi, Ei)denotes the graph at snapshot i, which contains Vi individualsand Ei interactions. The set Ci = {C1

i , C2i , ..., C

nii } represents

ni communities detected at snapshot i, where each communityCp

i ∈ Ci is also represented by a graph (V pi , E

pi ). We consider

two communities from different snapshots similar if the ratioof their mutual members exceeds a threshold, more formally:

2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)

9ASONAM 2014, August 17-20, 2014, Beijing, China

978-1-4799-5877-1/14/$31.00 ©2014 IEEE

Page 2: [IEEE 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) - China (2014.8.17-2014.8.20)] 2014 IEEE/ACM International Conference on Advances

Community Similarity: Similarity of communities Cpi and

Cqj , detected at snapshot i and j �= i, is measured as:

sim(Cpi , C

qj ) =

{ |V pi∩V q

j|

max(|V pi|,|V q

j|) if

|V pi∩V q

j|

max(|V pi|,|V q

j|) ≥ k

0 otherwise(1)

Based on this notion of similarity, changes that occur for acommunity are defined as: form, dissolve, split, merge, orsurvive. Note that, the events and transitions formulation pro-posed in [7], track the changes of communities over the entireobservation time, rather than only between two consecutivesnapshots. Moreover, the instances of the same community atdifferent time-frames are considered as one meta community,formally defined as:

Meta Community: Given a set of snapshots 1, 2 . . . n, ameta community is a sequence of communities M ={Cp1

t1 , Cp2

t2 , ..., Cpm

tm } such that

(a) 1 ≤ t1 < t2 < ... < tm ≤ n,(b) ∀ ti, t1 < ti ≤ tm ∃ tj < ti : sim(C

pj

tj , Cpi

ti ) > 0

II. PROBLEM FORMULATION

In predicting the trend of a community using predictivemodels, a response variable is a property related to communitywhich can quantify a particular change in a community overtime. A feature is any property that can influence one of theresponse variables. Thus, the first step is to select appropriatefeatures from the properties related to entities and communi-ties, as well as deciding on the response variables. Then, wecan model the relationship between each response variable andone or more features, which can be used later to predict themost probable changes that may occur for a community.

A. Community Transitions and Events as Response Variables

In the case of implicit communities, where the formationof communities heavily depends on their members and con-nections, an entity may leave its current community and joinanother community, due to the shifts of their interests or due tocertain external events. Thus, when a community survives intonext snapshot, it may also experience different transitions. Itmay expand if the number of its members increases, or shrinksif the number of its members decreases. Moreover, membersof a survived community may change their engagement level,making the community more cohesive, or loose.

Based on these transitions and the events [7], we quantifythe changes that may occur for a community as follows:survive{true, false}, merge{true, false}, split{true, false},size{expand, shrink}, and cohesion{cohesive, loose}. Allthese events and transitions are binary which constitute theresponse variables in our predictive model. Since size andcohesion transitions only defined for a survival community, wepropose a multistage cascading technique to detect these twotransitions. First, we predict the survival, then the detection ofthese transitions is followed. These response variables are notmutually exclusive and may occur together at the same time,where different features may trigger them. Hence, we learnseparate models to predict each of them.

Here, we consider the cohesion of a community as howclosely its members interact with each other relative to outsideof the community. More formally:

Cohesion: Cohesion of a community Cpi at snapshot i is:

cohesion(Cpi ) =

2|Epi|

|V pi|(|V p

i|−1)

|OEpi|

|V pi|(|Vi|−|V p

i|)

=2|Ep

i |(|Vi| − |V pi |)

|OEpi |(|V p

i | − 1)(2)

where |OEpi | is the set of outer edges of community Cp

i .

B. Properties of Community as Features

To predict the next stage of a community, we consider threemain classes of features: properties of its influential members,properties of the community itself, and temporal changes ofthese properties. These features are summarized in Table I, andare explained in detail in the following.

1) Properties of its influential members: The evolution ofnetwork is usually analyzed by considering all members andtheir properties. However, communities are often led by asmaller set of individuals, who have considerable influenceover other members, and shape the fate of their community.To identify the influential nodes in a community we use therole leader defined in the role mining framework proposedby Abnar [8]. The leaders of a community are defined as theoutstanding individuals in terms of centrality or importancein that community. To detect the leaders of a community, theprobability distribution function (pdf) of closeness centralityscores for all the nodes in each community is computed. AsAbnar [8] found out, this pdf is close to a normal distribution,and therefore, the upper threshold of μ+ 2σ 1 can be used todistinguish the leaders. For these detected leaders, we considertwo structural features, i.e. their degree and closeness centralityscores. Since a community may have more than one leader, wetake the average degree and closeness centrality scores of thedetected leaders. We also consider the ratio of the leaders tothe community size as a separate feature for the community.Similarly, we consider the ratio of outermosts in a communityas another feature. Where Outermosts are defined as the smallset of least significant individuals in the community [8] 2.

2) Structural properties of the community itself: To quan-tify the structural properties of a community we consider itssize (number of nodes), cohesion (Equation 2), density, andclustering coefficient. The density, and clustering coefficientof a community Cp

i are defined as:

Density: The ratio of edges to the maximum possible edges:

density(Cpi ) =

2|Epi |

|V pi |(|V p

i | − 1)(3)

Clustering coefficient: The ratio of edges between neigh-bours of a nodes to the maximum possible edges between

1according to the properties of a normal distribution, almost 95% of thepopulation lies in the interval [μ− 2σ, μ+2σ], where μ and σ are mean andstandard deviation of this distribution.

2Similar to leaders, from the pdf of the closeness centrality scores, we useμ− 2σ as the lower threshold to identify outermosts

2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)

10

Page 3: [IEEE 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) - China (2014.8.17-2014.8.20)] 2014 IEEE/ACM International Conference on Advances

them. More formally, clustering coefficient of node v is:

clustering-coeff(v) =|{(u,w)|(v, u), (v, w), (u,w) ∈ E}|

|{(u,w)|(v, u), (v, w) ∈ E}|The clustering coefficient of a community is then defined asthe mean of clustering coefficient of all its members:

clustering-coeff(Cpi ) =

∑v∈Cp

iclustering-coeff(v)

|Cpi |

(4)

Similar to the clustering coefficient, we also consider theaverage and variance of the centrality scores of all membersas separate features.

3) Temporal changes of features: We consider the currentrate of change in each property of a community as an additionalfeature. More specifically, the difference between properties ofcommunity Cp

i and properties of its previous instance, i.e. Cqj ,

are considered as features. We also consider the events andtransitions that occurred for community Cq

j , since there couldbe an auto-correlation.

4) Contextual attributes as features: For the datasets con-taining text, we consider two more features: stable topics,and stable topics of leaders. Here, we represent topics withthe most frequent keywords3. We expect that the changes inthe topics discussed within a community or by its influentialmembers affect its future.

III. COMMUNITY EVOLUTION PREDICTION

Based on our problem formulation, the prediction of ourevents and transitions becomes a typical machine learning task,for which we use logistic regression and different classificationmethods. The only exception is that the size and cohesion tran-sitions only occur for a community that survives. Thus, in orderto predict these two response variables (size{expand, shrink},and cohesion{cohesive, loose}), we propose a two-stage cas-cade predictive model, where the information collected fromthe output of a first stage is used as additional information forthe second stage in the cascade. The first stage predicts thesurvive response variable (survive{true, false}), then only ina case of true predicted value, the size and cohesion transitionsshould be predicted. The procedure to predict the cohesion, andsize transition can be summed up as follows:

Two-stage cascade predictive model:

1: Predict the survival of a community using the survivepredictive model. If the predicted value for survive istrue, go to Stage 2, otherwise, community does nothave any transitions.

2: Predict the size and cohesion transitions using theirrespective predictive models.

Each of the five different response variables is defined asa binary categorical variable. Thus, we adopt a logistic regres-sion for each response variable using the features in Table I

3KEA [9] is applied to produce a list of the keywords discussed by eachentity. Then, the topics for each node is defined as its 10 most frequentkeywords extracted by KEA. The topics of a community is then 10 mostfrequent keywords discussed by its members.

as the predictors. Then, in order to select the most significantfeature set, we apply forward stepwise additive regression [10],where LogitBoost with simple regression functions is used forfitting the logistic models, and attribute selection.

Beside the logistic regression, we also adopt the mostwell-known binary classifier methods to predict each responsevariable: Naıve Bayes classifier, Bagging classifier, DecisionTable classifier, Decision Stump classifier, J48 Decision tree,Bayesian Networks classifier, Simple CART classifier, SupportVector Machine (SVM) classifier, and Neural network classi-fier4. Using all the features provided in Table I may not leadto the highest accuracy due to over-fitting, and redundant orirrelevant features. Therefore, we apply a wrapper method toselect the appropriate feature sets for each binary classifier.The wrapper method uses a classifier to estimate the scoreof different features based on the error rate of that classifier.The wrapper method is computationally intensive and has tobe applied for each binary classifier separately, however wedecided to use it since it provides the best performing featureset for the chosen classifier.

IV. CASE STUDIES

In this section, we present the performance analysis of ourpredictive models on two real world dynamic social networks:the Enron email dataset and the DBLP dataset. The Enronemail dataset contains emails exchanged between employeesof the Enron Corporation. We study the year 2001, the yearthe company declared bankruptcy, and consider a total of210 nodes, with each month being one snapshot. For theDBLP dataset, the co-authorship network related to the field ofdatabase and data mining from year 2001 to 2010 is extracted.This dataset contains a total of 19461 authors, where thesnapshot is defined to be one year.

In these experiments, we apply the computationally effec-tive local community mining algorithm [12] to produce setsof disjoint communities for each snapshot. Furthermore, weincorporate the extraction of the topics for the entities andthe discovered communities. KEA [9] is applied to producea list of the keywords discussed in the email messages orused in the paper titles. Then, the topic for each entity andcommunity is defined as its 10 most frequent keywords. Todetect events and transitions, our previously proposed MODECframework [7] is applied on the set of communities minedduring the observation time. Note that, MODEC not onlydetect events and transitions between consecutive snapshots,but also between any arbitrary snapshots.

Given the feature set, and the response variables of TableI, we develop a 10-fold cross-validation framework in whichthe communities with their response variables and featuresare randomly partitioned into 10 equal size subsamples. Then,9 subsamples are used as training data, while the remainingsubsample is retained as the validation data for testing thepredictive model. We repeat the cross-validation process 10times and average the 10 results from the folds to produce asingle estimation.

In most of the experiments presented in the following,the two labels of the underlying response variable are not

4The WEKA Data Mining implementation of the classifiers is used [11].

2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)

11

Page 4: [IEEE 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) - China (2014.8.17-2014.8.20)] 2014 IEEE/ACM International Conference on Advances

TABLE I. PROBLEM FORMULATION: FEATURES AND RESPONSE VARIABLES RELATED TO A COMMUNITY

Category Feature Description Domain

Influential MemberClosenessLeaders average of closeness centrality of leaders (0, 1]DegreeLeaders average of degree centrality of leaders (0, 1]LeadersRatio ratio of leaders (0, 1]OutermostsRatio ratio of outermosts [0, 1]

Community

Density ratio of edges to maximum possible edges (Equation3) (0, 1]ClusteringCoefficient ratio of edges between neighbours of a nodes to maximum possible edges (Equation4) (0, 1]NodesNumber number of nodes [2,∞)Cohesion ratio of members interact with each other to outside of the community (Equation 2) (0,∞)AverageCloseness average of closeness centrality scores (0, 1]VarianceCloseness variance of closeness centrality scores [0, 1]AverageDegree average of degree centrality scores (0, 1]VarianceDegree variance of degree centrality scores [0, 1]

Temporal

ΔClosenessLeaders difference between average of closeness centrality of leaders (0, 1]ΔDegreeLeaders difference between average of degree centrality of leaders (0, 1]ΔLeadersRatio difference between ratio of leaders [0, 1]ΔOutermostsRatio difference between ratio of outermosts [0, 1]ΔDensity difference between density [0, 1]ΔClusteringCoefficient difference between clustering coefficient [0, 1]ΔAverageCloseness difference between average of closeness centrality scores [0, 1]ΔVarianceCloseness difference between variance of closeness centrality scores [0, 1]ΔAverageDegree difference between average of degree centrality scores [0, 1]ΔVarianceDegree difference between variance of degree centrality scores [0, 1]JoinNodesRatio percentage of nodes joining to this community [0, 1]LeftNodesRatio percentage of nodes leaving this community [0, 1]Similarity similarity between community and its previous instance (Equation 1) [k, 1]LifeSpan number of snapshots between this community and the first instance of the same community [1, n]

Previous Events

PreviousSurvive survive event occurred for previous instance of the community {true, false}PreviousMerge merge event occurred for previous instance of the community {true, false}PreviousSplit split event occurred for previous instance of the community {true, false}PreviousSizeTransition size transition occurred for previous instance of the community {expand, shrink}PreviousCohesionTransition cohesion transition occurred for previous instance of the community {cohesive, loose}

ContextualStableTopics stable topics between community and its previous instance {true, false}StableLeaderTopics stable topics between leaders of community and leaders of its previous instance {true, false}

Response variable

survive survive event occurred for the community {true, false}merge merge event occurred for the community {true, false}split split event occurred for the community {true, false}size size transition occurred for the community {expand, shrink}cohesion cohesion transition occurred for the community {cohesive, loose}

balanced. Thus, to prevent over fitting and balance the twoclass labels, we use SMOTE (synthetic minority oversam-pling technique) [13] when the number of instances are low.Whereas, in the case of having high number of instancesor having a huge difference between the number of the twolabels, the undersampling technique5, is applied to prevent theoverfitting.

A. Results on Enron Email Dataset

To predict any of the three events, all the communities de-tected at the twelve snapshots with their features and responsevariables are used to build predictive models for each event.In total we have 114 community instances, where | survive =true | = 60, | split = true | = 24, and |merge = true | = 59.We first select the influential features for each binary classifierusing the wrapper method. Then, each binary classifier istrained with its selected features. Table II shows the top fiveaccurate predictive models for the survive event. As shown inTable II, the accuracy of all models is about 78%. However,a closer look at the falsely classified instances reveals thatthey are mostly communities of very small size, i.e. less than3 members, while their meta community has the length ofonly one snapshot i.e. the community forms at a snapshot anddissolves immediately. Therefore, we remove these communityinstances (of a size less than three, and a meta community of

5The spreadsubsample undersampling technique available in WEKA is used.6RSurvive represents survive prediction on the reduced community in-

stances (communities with more than 3 members, where their meta communitylast more than one snapshot).

TABLE II. ENRON: SURVIVE EVENT PREDICTION

Event Predictive Model Accuracy Precision Recall F-measure

Survive

Neural Network 78.1513 0.782 0.782 0.782Naıve Bayes 77.3109 0.783 0.773 0.771J48 Decision tree 72.2689 0.723 0.723 0.723Logistic Regression 71.4286 0.715 0.714 0.714SimpleCART 71.4286 0.721 0.714 0.712

RSurvive6

Naıve Bayes 91.5888 0.92 0.916 0.916BayesNet 90.6542 0.907 0.907 0.907Neural Network 89.7196 0.897 0.897 0.897Logistic Regression 85.9813 0.863 0.86 0.86Decision Table 85.0467 0.884 0.85 0.846

length one), and retrain the model. This reduction is intuitive,since a community that consists of only two members doesnot really represent a group of nodes, and hence is not a realcommunity. The reduction procedure results in 76 communityinstances with | survive = true | = 55, where we see at least20% increase in the accuracy of models, with accuracy as highas 92%. Our results indicate that the survival of a communitycan be accurately predicted based on the features we definedand extracted, while using a typical general purpose classifier.

We observe similar performance in predicting the other twoevents. The top five predicted models for split, and mergeare shown in Table III. We can see that our models predictthe split of a community (into other communities in a nextsnapshot) with about 91% accuracy, regardless of the classifierused. Where, the merge event (of a community with anothercommunities in a next snapshot) can be predicted with anaccuracy as high as 72%.

The size, and cohesion transitions are preceded by the

2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)

12

Page 5: [IEEE 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) - China (2014.8.17-2014.8.20)] 2014 IEEE/ACM International Conference on Advances

TABLE III. ENRON: MERGE AND SPLIT EVENTS PREDICTION

Event Predictive Model Accuracy Precision Recall F-measure

Split

Naıve Bayes 90.8046 0.908 0.908 0.908Bagging 89.6552 0.9 0.897 0.896J48 Decision tree 88.5057 0.89 0.885 0.884Neural Network 88.5057 0.885 0.885 0.885Decision Table 87.931 0.88 0.879 0.879

Merge

Naıve Bayes 71.9298 0.723 0.719 0.719Neural Network 71.0526 0.711 0.711 0.71SimpleCART 69.2982 0.693 0.693 0.693SVM 66.6667 0.686 0.667 0.654J48 Decision tree 65.7895 0.664 0.658 0.652

prediction of a survive event. Based on our results in TableII, we choose the Naıve Bayes classifier to detect the surviveevents. Then, communities with predicted survive = true us-ing this classifier are used to build the models for the size, andcohesion transitions. The Naıve Bayes classifier predicts 50community instances with survive = true, for which we have| size = expand | = 29, and | cohesion = cohesive | = 23. Thetop five predictive models for the size, and cohesion transitionsare shown in Table IV. We can see that these size and cohesiontransitions of a community7, can be predicted with a highaccuracy of 79%, and 77% respectively.

TABLE IV. ENRON: COMMUNITY TRANSITIONS PREDICTION

Transition Predictive Model Accuracy Precision Recall F-measure

Size

Naıve Bayes 79.3103 0.795 0.793 0.793J48 Decision tree 72.4138 0.724 0.724 0.724Neural Network 72.4138 0.728 0.724 0.723SVM 68.9655 0.715 0.69 0.68Decision Stump 68.9655 0.691 0.69 0.689

Cohesion

SimpleCART 76.9231 0.769 0.769 0.769Bagging 75 0.761 0.75 0.749Neural Network 75 0.754 0.75 0.75J48 Decision tree 73.0769 0.741 0.731 0.726Naıve Bayes 67.3077 0.764 0.673 0.648

B. Results on DBLP Database

We perform similar analysis on the ten snapshots of theDBLP dataset. In total, there are 7668 community instances,where | survive = true | = 1814, | split = true | = 159,and |merge = true | = 315. As shown in Table V, the bestaccuracy for the survive predictive models is 62%. Similar toEnron, the false predicted instances are all small size commu-nities with less than 3 members, where their meta communityalso has a length one. With the same reasoning as before, weremove these instances. Here, the reduction procedure resultsin 1984 community instances with | survive = true | = 1140.The accuracy of the top five predictive models on these reducedinstances is reported in Table V. Our results confirm the trendwe have observed on the Enron dataset, i.e. the survival of acommunity can be accurately predicted based on our set offeatures (with a 84% accuracy).

The results for split, and merge are shown in TableVI. Our results indicate that, with 81% accuracy, we canpredict the split of a community into other communities ina next snapshot. However, the best prediction accuracy formerge of a community with another community is 62%. Thefalse predicted instances on merge do not have any clearcharacteristics to explain how we can get better accuracy. Thus,on DBLP, unlike survival and split, merging of communities

7with at least three members and its meta community lasts more than onesnapshot

TABLE V. DBLP: SURVIVE EVENT PREDICTION

Event Predictive Model Accuracy Precision Recall F-measure

Survive

J48 Decision tree 61.742 0.618 0.617 0.617Bagging 61.6869 0.618 0.617 0.616Decision Table 61.301 0.613 0.613 0.613Neural Network 61.1356 0.614 0.611 0.609Naıve Bayes 61.0805 0.611 0.611 0.61

RSurvive

Decision Table 84.1232 0.879 0.841 0.837Decision Stump 84.1232 0.879 0.841 0.837Neural Network 84.0047 0.873 0.84 0.836BayesNet 82.4645 0.847 0.825 0.822SimpleCART 81.6351 0.859 0.816 0.811

TABLE VI. DBLP: MERGE AND SPLIT EVENTS PREDICTION

Event Predictive Model Accuracy Precision Recall F-measure

Split

Naıve Bayes 80.5031 0.805 0.805 0.805J48 Decision tree 80.1887 0.805 0.802 0.801SVM 79.8742 0.799 0.799 0.799Neural Network 79.8742 0.801 0.799 0.798SimpleCART 79.5597 0.798 0.796 0.795

Merge

J48 Decision tree 61.5873 0.644 0.616 0.596Bagging 60.1587 0.606 0.602 0.597SimpleCART 59.8413 0.601 0.598 0.596Decision Table 59.8413 0.609 0.598 0.588SVM 59.5238 0.598 0.595 0.593

with each other can not be accurately predicted based on thepresent set of features, where the best prediction accuracyis only 62%. This could be partly explained by a variety ofexternal factors that can affect such event, for example meetingat a conference, moving between institutions, etc.

As shown in Table V, applying the Decision Table classifierproduces the highest accuracy for the survive event on reducedcommunity instances. Thus, only the communities with pre-dicted survive = true using the Decision Table classifier areused to build the predictive models for the size, and cohesion.In this case, we have 576 community instances, where | size =expand | = 383, and | cohesion = cohesive | = 440. The topfive predictive models on the size, and cohesion transitionsare shown in Table VII respectively. We see that the size andcohesion transitions of a community can be predicted with a80%, and 92% accuracy respectively.

TABLE VII. DBLP: COMMUNITY TRANSITIONS PREDICTION

Transition Predictive Model Accuracy Precision Recall F-measure

Size

BayesNet 79.622 0.797 0.796 0.796Naıve Bayes 79.217 0.792 0.792 0.792Decision Table 78.812 0.789 0.788 0.788Bagging 78.543 0.785 0.785 0.785Neural Network 78.003 0.786 0.78 0.778

Cohesion

Naıve Bayes 92.089 0.921 0.921 0.921Neural Network 91.499 0.918 0.915 0.915Bagging 91.499 0.916 0.915 0.915Decision Stump 91.145 0.925 0.911 0.911SVM 91.145 0.915 0.911 0.911

C. Correlation between Features

Figure 1 shows the correlation between different featuresof Table I. The correlation is measured as the absolute valueof Spearman’s rank correlation coefficient between differentfeatures. In order to better visualize the correlations, the rowsand coloumns of the heat-map are clustered to create blocksof highly correlated features. For instance, Density, Clustering-Coefficient, AverageDegree, and AverageCloseness features arecorrelated as expected. Note that, their corresponding temporalfeatures are also correlated with each other. However, as wecan see in these heat-maps, most of the defined features are not

2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)

13

Page 6: [IEEE 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) - China (2014.8.17-2014.8.20)] 2014 IEEE/ACM International Conference on Advances

Fig. 1. Absolute value of Spearman’s rank correlation coefficient betweendifferent features. Top: Enron, Bottom: DBLP. These correlation matricesdepict that the overlap between features used in our predictive models is low.

highly correlated in neither Enron nor DBLP. This behaviouris desirable, since we define features to capture differentproperties of a community and its temporal changes. In otherwords, the low correlation/ overlap between features confirmsthat the features used in our predictive models are distinctive.

D. Ensemble Analysis

Any of the predictive models we introduced, selects adifferent set of features. We consider a feature is more promi-nent for a specific event or transition, if it is selected by themajority of the models trained for predicting that event ortransition. Figure 2 shows the number of times that each featureis selected by our 10 predictive models for predicting eachevent. Here, to better visualize the selection of the features,only the rows of the heat-map is clustered to create blocksof similarly colored cells. The Pearson correlation betweendifferent features and the response variables is also depicted

Fig. 2. Enron: The number of times a feature is selected by the 10 predictivemodels (left), and the correlation between each feature and response variable(right).

in Figure 2. Here, to calculate correlation between featuresand cohesion, and size transitions, we consider their cohesive,and expand values respectively. Furthermore, to simplify thecomparison between this heat-map and the one showing theselection of the features, the rows are ordered correspondingly.Similar to the number of times that a feature is selected, thecorrelation between each feature and response variables differsfor different response variables. Moreover, the correlation of afeature is positively correlated with the number of times it isselected.

We can infer interesting patterns form this ensemble anal-ysis. For example ClusteringCoefficient and Cohesion areprominent positive factors on the survival of Enron com-munities, while StableLeaderTopics and LeftNodesRatio areimportant negative factors on survival. The importance of thesefactors on survive is intuitive, for instance, a community withhigh clustering coefficient has strong relationship between itsmembers and will not dissolve easily. On the other hand,loosing members (i.e. high LeftNodesRatio) is a good signof an unstable community which is not going to survive. Incase of split, LeadersRatio and NodesNumber are positivelyimportant, i.e. a community with more leaders or a biggersize community is more probable to split. Whereas, the neg-ative effect of VarianceCloseness and Cohesion shows that acommunity with high variance of closeness scores and highcohesion is immune to split. The merge of a communityis positively influenced by StableTopics, talking about thesame topics over the time leads the community to mergewith another communities. On the other hand, Similarity hasnegative influence on merge, i.e. a community with almoststable members is not probable to merge with others.

Similarly, the ensemble analysis for DBLP is depicted inFigure 3. Again, interesting patterns can be inferred from thesetwo heat-maps. For instance, Density is a positive factor insize transition, whereas, NodesNumber is negatively important,i.e. a dense community attract new members and expands,while a bigger size community has less chance to attractnew members. We also observe that Cohesion is a prominentnegative feature on the cohesion transition of a community ina later snapshot. This importance indicates that on DBLP, a

2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)

14

Page 7: [IEEE 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) - China (2014.8.17-2014.8.20)] 2014 IEEE/ACM International Conference on Advances

Fig. 3. DBLP: The number of times a feature is selected by the 10 predictivemodels (left), and the correlation between each feature and response variable(right).

Fig. 4. Comparison of prominent features on the ENRON (left) and DBLP(right) dataset. Only features that are selected more than five times by at leastone event or transition are included.

less cohesive community has a better potential to form newconnections and become more cohesive.

Comparing the features importance between these twodatasets, we see that these patterns although similar, depend onthe underlying dynamic social network. This finding demon-strates the importance of the feature selection step for theprediction task.

Figure 4 provides the comparison between prominent fea-tures selected for the two dataset. Here, we only include thefeatures that are selected more than five times by at leastone (event or transition) predictive models. The diagramsshow that, for instance on DBLP, JoinNodesRatio is influentialfor all the five events and transitions. On the other hand,StableLeaderTopics is influential in only size, and cohesionprediction.

V. RELATED WORK

The works on evolution of dynamic networks can beclassified into three categories: microscopic, macroscopic andmesoscopic approaches. The microscopic approaches focus onthe evolution at the level of nodes and edges, such as the studyof preferential attachment phenomenon in [14], [15] or the

modelling of the node arrival and edge creations in [16]. Onthe other end, the macroscopic perspectives study the evolutionof the high level properties of networks, for instance, the studyof the evolution of degree distribution, clustering coefficient,and degree correlation of online social networks in [17], or thestudy of shrinking diameter of social networks in [18].

Different models are proposed to predict microscopic ormacroscopic trends of dynamic networks. Notably, Yang et al.[19] develop a prediction model to analyze the loss of a user inan online social network, by extracting a set of attributes andusing a decision tree classifier. Huang and Lee [20], propose amodel to select the most influential activity features, and thenincorporate these features to predict the growth or shrinkageof the network. Based on their findings, on the Facebook data,the number of active members and the number of edges isthe most informative factors to predict the network evolution.Whereas, on the Citeseer data, it is observed that the numberof collaborations between members is the main indicator toexplain the evolving patterns of this co-authorship network.

A less explored perspective is provided by the mesoscopicapproaches, which predict the trend of networks based onan intermediate structure of the networks, i.e. communitystructure. The evolution of communities from the standpointof growth is modeled in [21]–[23], where an individual ina community never leaves the community, i.e. a communityin these studies always grows. For instance, Backstrom et al.[21] apply a decision-tree approach by incorporating a widerange of structural features to predict whether and entity willjoin a community. Given a community, they also predict itsgrowth over a fixed time period. Patil et al. [24] build aclassifier to predict if a community is going to grow or is likelyto remain stable over a period of time. However, they onlyconsider explicit communities, for instance, conferences areconsidered as communities for the DBLP dataset. Kairam et al.[25] identify two types of growth for a community. Diffusiongrowth is when a community attracts new members throughties to existing members; whereas, in non-diffusion growth,individuals with no prior ties become members themselves.Their analysis is then focused on the differences in theprocesses which govern diffusion and non-diffusion growth.Their finding shows that if a community is highly clustered,it is more likely to experience diffusion growth. However,communities that grow more from diffusion tend to reachsmaller final sizes. They also generated a set of modelswhich use a community’s structural features and past growthexperience to predict its eventual size and lifespan.

The works mentioned above consider explicit communities,and can only be applied in the settings where users join mul-tiple communities and probably never quit these communities.Thus, the size of a community will monotonically increase overthe time. However, in most networks, an individual may quithis/her current community and join another one, in case he/sheis not satisfied with that community. Hence, the communitiesin these dynamic networks usually have fluctuating membersand could grow and shrink over the time.

In the case of implicit communities, Goldberg et al. [26],[27] develop a linear regression system to predict the lifespanof a community based on structural features extracted fromthe early stage of the community. They find that community’sproperties such as size, intensity and stability are the most

2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)

15

Page 8: [IEEE 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) - China (2014.8.17-2014.8.20)] 2014 IEEE/ACM International Conference on Advances

important features to predict its lifespan. The most relevantwork to ours is of Brodka et al. [28], [29], where they developdifferent classifiers to predict the events that may occur fora community (similarly defined as continue, merge, split, anddissolve). Their model is trained mainly based on the historyof events happened to the community in preceding snapshots.Therefore, events can only be predicted for communities thattheir past three instances is available. While these instancesalso have to be in consecutive snapshots. Another drawbackof their approach is that they consider events to be mutuallyexclusive and only predict the dominating event. In our model,however, the future of a community is predicted based on anextensive set of features on its current members, their roles andtheir relations, where we also leverage temporal information(up to one time-frame backward), if the previous instances ofthe community are available. Also using the notion of meta-community, we are able to track multiple events and transitionsthat a community undergoes in non-consecutive snapshots.

VI. CONCLUSION

We investigated the evolution of dynamic networks, at thelevel of their community structure. We defined and extractedan extensive set of relevant role-based, structural, contextualand temporal features, to represent the the structural and non-structural properties of communities and the behaviour of their(influential) members. Our experimental results on real-worlddatasets (Enron and DBLP) shows that the defined featuresare mainly non-overlapping, and distinctive. Based on which,the events and transitions of communities can be accuratelypredicted. Our predictive process also identifies the mostprominent features for each community transition and event.We confirm the relation between the behavior of individuals,specially the influential members of a community, and thefuture of the community they belong to, and also observe manyinteresting, yet expected, evolution patterns, e.g. recruiting newmembers by a community is a good indicator of its survival.In future work, we would like to extend our prediction forthe networks with overlapping community structure. Then, thenext step is to present a Bayesian method for constructinga probabilistic network for dynamic communities. Once con-structed, the probabilistic network can indicate possible casualrelationships between different features of a community andits events and transitions.

REFERENCES

[1] J. Leskovec, L. A. Adamic, and B. A. Huberman, “The dynamics ofviral marketing,” ACM Transactions on the Web, vol. 1, p. 5, 2007.

[2] H. Akhlaghpour, M. Ghodsi, N. Haghpanah, V. S. Mirrokni, H. Mahini,and A. Nikzad, “Optimal iterative pricing over social networks,” inInternet and Network Economics, 2010, pp. 415–423.

[3] N. Agarwal, H. Liu, L. Tang, and P. S. Yu, “Identifying the influentialbloggers in a community,” in International Conference on Web Searchand Data Mining, 2008, pp. 207–218.

[4] M. McPherson, L. Smith-Lovin, and J. M. Cook, “Birds of a feather:Homophily in social networks,” Annual Review of Sociology, vol. 27,no. 1, pp. 415–444, 2001.

[5] A. Anagnostopoulos, R. Kumar, and M. Mahdian, “Influence and corre-lation in social networks,” in ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining, 2008, pp. 7–15.

[6] S. Goel and D. G. Goldstein, “Predicting individual behavior with socialnetworks,” Marketing Science, vol. 33, no. 1, 2014.

[7] M. Takaffoli, F. Sangi, J. Fagnan, and O. R. Zaıane, “Tracking changesin dynamic information networks,” in International Conference onComputational Aspects of Social Networks, 2011.

[8] A. Abnar, “Structural social role mining,” Master’s thesis, Universityof Alberta, 2014.

[9] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning, “KEA: Practical automatic keyphrase extraction,” in ACMConference on Digital Libraries, 1999, pp. 254–255.

[10] N. Landwehr, M. Hall, and E. Frank, “Logistic model trees,” MachineLearning, vol. 59, no. 1-2, pp. 161–205, 2005.

[11] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.Witten, “The weka data mining software: an update,” ACM SIGKDDexplorations newsletter, vol. 11, no. 1, pp. 10–18, 2009.

[12] J. Chen, O. R. Zaıane, and R. Goebel, “Detecting communities in largenetworks by iterative local expansion,” in International Conference onComputational Aspects of Social Networks, 2009, pp. 105–112.

[13] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,“SMOTE: Synthetic minority over-sampling technique,” Artificial In-telligence Research, vol. 16, pp. 321–357, 2002.

[14] A.-L. Barabasi and R. Albert, “Emergence of scaling in randomnetworks,” Science, vol. 286, no. 5439, pp. 509–512, 1999.

[15] E. Elmacioglu and D. Lee, “Modeling idiosyncratic properties ofcollaboration networks revisited.” Scientometrics, vol. 80, no. 1, pp.195–216, 2009.

[16] J. Leskovec, L. Backstrom, R. Kumar, and A. Tomkins, “Microscopicevolution of social networks,” in ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining, 2008, pp. 462–470.

[17] Y.-Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong, “Analysis oftopological characteristics of huge online social networking services,”in International Conference on World Wide Web, 2007, pp. 835–844.

[18] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graphs over time: Densi-fication laws, shrinking diameters and possible explanations,” in ACMSIGKDD International Conference on Knowledge Discovery in DataMining, 2005, pp. 177–187.

[19] T. Yang, Y. Chi, S. Zhu, Y. Gong, and R. Jin, “A bayesian approachtoward finding communities and their evolutions in dynamic socialnetworks,” in SIAM International Conference on Data Mining, 2009.

[20] S. Huang and D. Lee, “Exploring activity features in predicting socialnetwork evolution,” in IEEE International Conference on MachineLearning and Applications, 2011, pp. 7–15.

[21] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan, “Groupformation in large social networks: membership, growth, and evolution,”in ACM SIGKDD international conference on Knowledge discovery anddata mining, 2006, pp. 44–54.

[22] E. Zheleva, H. Sharara, and L. Getoor, “Co-evolution of social andaffiliation networks,” in ACM SIGKDD international conference onKnowledge discovery and data mining, 2009, pp. 1007–1016.

[23] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhat-tacharjee, “Measurement and analysis of online social networks,” inACM SIGCOMM conference on Internet measurement, 2007.

[24] A. Patil, J. Liu, and J. Gao, “Predicting group stability in online socialnetworks,” in International Conference on World Wide Web, 2013, pp.1021–1030.

[25] S. R. Kairam, D. J. Wang, and J. Leskovec, “The life and death of onlinegroups: Predicting group growth and longevity,” in ACM InternationalConference on Web Search and Data Mining, 2012, pp. 673–682.

[26] M. Goldberg, M. Magdon ismail, S. Nambirajan, and J. Thompson,“Tracking and predicting evolution of social communities,” in Interna-tional Conference on Social Computing, 2011, pp. 7–15.

[27] M. K. Goldberg, M. Magdon-Ismail, and J. Thompson, “Identifyinglong lived social communities using structural properties,” in Interna-tional Conference on Advances in Social Networks Analysis and Mining,2012, pp. 647–653.

[28] P. Brodka, P. Kazienko, and B. Koloszczyk, “Predicting group evolutionin the social network,” in International Conference on Social Informat-ics, 2012, pp. 54–67.

[29] B. Gliwa, P. Brodka, A. Zygmunt, S. Saganowski, P. Kazienko, andJ. Kozlak, “Different approaches to community evolution predictionin blogosphere,” in International Conference on Advances in SocialNetworks Analysis and Mining, 2013, pp. 1291–1298.

2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014)

16