the semantic evolution of online communities

21
THE SEMANTIC EVOLUTION OF ONLINE COMMUNITIES MATTHEW ROWE 1 AND MARKUS STROHMAIER 2 1. LANCASTER UNIVERSITY, LANCASTER, UK @MROWEBOT | [email protected] 2. UNIVERSITY OF KOBLENZ AND GESIS, COLOGNE, GERMANY @MSTROHM | [email protected] World Wide Web Conference 2014 Seoul, South Korea

Upload: matthew-rowe

Post on 30-Nov-2014

863 views

Category:

Social Media


0 download

DESCRIPTION

World Wide Web Conference 2014

TRANSCRIPT

Page 1: The Semantic Evolution of Online Communities

THE SEMANTIC EVOLUTION OF ONLINE COMMUNITIES MATTHEW ROWE1 AND MARKUS STROHMAIER2

1. LANCASTER UNIVERSITY, LANCASTER, UK @MROWEBOT | [email protected] 2. UNIVERSITY OF KOBLENZ AND GESIS, COLOGNE, GERMANY @MSTROHM | [email protected] World Wide Web Conference 2014

Seoul, South Korea

Page 2: The Semantic Evolution of Online Communities

Studies of Online Community Evolution

The Semantic Evolution of Online Communities

1

¨  Prior work has examined online community development based on: ¤ Social network properties

n  (Gong et al., 2012), (Mislove et al., 2007)

¤ Social group formation n  (Backstrom et al., 2006)

¤ Lexical term usage and uptake n  (Danescu et al., 2013)

Organic fixie Pitchfork, fingerstache fashion axe 8-bit ethical. Neutra shabby chic brunch, mustache vegan twee typewriter dreamcatcher try-hard organic church-key!

subsequent interpretation, of churners in online commu-nities. Churners present a serious issue for communitymanagers and hosts as the leaving of certain users can havea detrimental effect on the community (i.e. experts leavinga question-answering community can cause an increase inunanswered queries).

In this section we define churn prediction as a binaryclassification task and use the previously examined indica-tors of lifecycle trajectories to predict whether a user is achurner or not. As we confine user lifecycle periods fromthe start of their lifecycle to the end we use the trajectoriesmined from this period to characterise how users develop.We define churners as any user who posts for the last timebefore the final 10% of the time window of our datasets,cutoff points are: 2012-07-09 for Facebook, 2010-05-11 forSAP, and 2010-12-23 for ServerFault. Our dataset is of thefollowing form: D = {(xi, yi)}, where yi denotes the classlabel of the user from one of two values: y ! {0, 1},4

while xi denotes an 11-element R-valued feature vector foreither a Facebook or SAP user, and a 10-element featurevector for a Server Fault user - given that we use a linearregression model for each user’s lexical community cross-entropy development. We model the feature vector of eachuser using the trajectory indicators from the previous section,in short Table II defines our set of features where we placeeach within a set depending on the dynamics it captures.

Table IIFEATURES USED FOR THE CHURN PREDICTION EXPERIMENTS. THE

INDICATORS OF LIFECYCLE TRAJECTORIES ARE USED TO

CHARACTERISE USER EVOLUTION ALONG THE DIFFERENT USER

PROPERTIES.

Property Indicator Model Feature(s) PlatformIn-degree Period Entropy Linear Regression ! All

Period Cross-Ent Exponential Decay " All

Comm’ Cross-Ent Quad’ Regress’ a1, a2 AllOut-degree Period Entropy Linear Regression ! All

Period Cross-Ent Exponential Decay " AllComm’ Cross-Ent Linear Regression ! All

Lexical Period Entropy Linear Regression ! AllPeriod Cross-Ent Exponential Decay " All

Comm’ Cross-Ent Quad’ Regress’ a1, a2 Fb, SAPComm’ Cross-Ent Linear Regression ! SF

A. Prediction Model Definition

The observed feature vector of user ui (xi) containsthe indicator trajectories of the user along the differentproperties. We use the logistic regression model to predictthe conditional probability of user ui churning as follows:

Pr(Y = 1 | xi) =1

1 + e!!!xi

(9)

The model’s coefficients (!) define the weight attachedto each identity trajectory feature within the linear model(f(i) = !!xi). In order to derive the model’s coefficients

41 indicating churner, 0 not.

we use maximum likelihood fitting through the R statisticalsoftware package5 to select the maximum likelihood esti-mation ! of the model’s coefficients. Following fitting, thederived model is used to predict the churn probability ofeach user within the test dataset.

B. Experimental Setup

For our experiments we first standardised the datasetsby combining the test (20%) and training (80%) datasetstogether and setting each indicator feature to have 0 meanand a standard deviation of 1, we then divided the datasetagain into the respective test and training splits maintainingthe same instances as before. We wanted to test the effects ofobserving different user properties and development dynam-ics on churn prediction. We therefore tested each user prop-erty in isolation, for instance using the in-degree propertyand the entropy, period cross-entropy and community cross-entropy trajectory indicators; and then each developmentmodel in isolation, for instance using the entropy modeland examining in-degree, out-degree, and term distributions;finally we combined all features together within a singlemodel. In doing so we could isolate any effects of keyfeatures on prediction performance, and thus inform modelselection for specific platforms (i.e. identifying the bestperforming model for Facebook, SAP and Server Fault).

As we used the logistic regression model for our pre-diction model we are provided with a function whose co-domain is a churn probability value for a given user withinthe closed interval [0, 1]. Therefore we evaluated the per-formance of each induced model using two evaluation mea-sures: (i) precision@k (P ), and (ii) area under the receiveroperator characteristic curve (AUC). To derive precision@kwe ranked the users by their churn probability according tothe induced model and then assessed the precision of thetop-k ranks, setting k = {1, 5, 10, 20, 50, 100}, and takingthe mean of these precision values. This assesses the extentto which the upper portion of the predicted churners arecorrect. We used the baseline measure of the probability ofa randomly selected user being a churner, thus correspondingto the probability of success in a single Bernouilli trial(setting p = |churners|/|Dtest|). To derive the area underthe receiver operator characteristic curve we varied theconfidence of an indicator function (f(x)) through discretesettings of confidence bounds ! = {0, 0.05, . . . , 0.95, 0.1},thereby setting the class label for given instance (x) asfollows:

f(x) =

!

1, if Pr(Y = 1 | xi) > ! (10a)

0, otherwise (10b)

For each different setting of ! we measured the truepositive rate (TPR/recall) and the false positive rate (FPR),and from these measures plotted the receiver operator char-acteristic (ROC) curve. A model which maximises the area

5http://www.r-project.org/

= {w1,w2,…,wn}

Page 3: The Semantic Evolution of Online Communities

The Semantic Evolution of Online Communities 2

Work has yet to examine the semantic evolution of online communities

1.  Understand how semantic concepts emerge

2. Model the development of semantic structures over time

3.  Examine how communities differ from one another in their evolution

Page 4: The Semantic Evolution of Online Communities

Assessing Semantic Evolution

The Semantic Evolution of Online Communities

3

Define: Community Semantics!Entities, and their classes, discussed within a community over a given time-period, and the structure connecting those concepts!

Time

Page 5: The Semantic Evolution of Online Communities

Assessing Semantic Evolution

The Semantic Evolution of Online Communities

4

¨  Assessing community semantics enables: 1.  Characterisation of semantic evolution dynamics 2.  Comparison based on community evolution 3.  Forecasting churn rates from evolution signals

Our Contributions

Define: Community Semantics!Entities, and their classes, discussed within a community over a given time-period, and the structure connecting those concepts!

Page 6: The Semantic Evolution of Online Communities

The Semantic Evolution of Online Communities

Approach: Examining Semantic Evolution

5

Retrieve Posts and Extract Entities

Construct Time-

delimited Semantic Graphs

Inspect Macro

Evolution

Induce Community-

Specific Evolution Models

Apply Mined

Semantic Evolution Dynamics

We have two flavours: (i)  concept graphs (ii)  entity graphs

Applying graph measures over cumulative time-sensitive graphs

Inducing logistic population models for each community and graph measure

Using model dynamics as community motifs to: (i)  inspect communities (ii)  predict churn rate

Experiment Dataset: Boards.ie All posts during 2005-2008 93 Communities: online forums

Page 7: The Semantic Evolution of Online Communities

Extracting Entities from Online Community Posts

The Semantic Evolution of Online Communities

6

¨  We are provided with a set of post quadruples:

¨  We constrain content to within a given interval [t’,t’’)

¨  Extract each post’s entities within the time interval using TextRazor ¤  No quota limit, and good prior performance (Derczynski et al., 2013)

¨  Result: time-sensitive community-discussed entities (DBPedia URIs)

Retrieve Posts and Extract Entities

clustering of users) and found high degrees of local cluster-ing on the di↵erent platforms which contained densely pop-ulated subgroups of similar users. Recent work by Gong etal. [5] inspected the evolution of social networks on Google+as the platform was growing in memberships, in particu-lar they focused on social-attribute networks (i.e. bipar-tite graphs containing people and their attributes as nodes),finding that the platform exhibited unique growth and char-acteristics of the networks as more people joined Google+.Leskovec et al. [7] modelled the development of social net-works across four platforms (Flickr, Delicious, Yahoo! An-swers and LinkedIn) by modelling the process of node ar-rival (users joining), edges being created and waiting timesbetween edge creation. Previous work mostly ignored termbased and semantic information and concentrated on howthe social networks evolved, not the communities. Althoughsuch works a↵ord insights into the evolution of social net-works, they do not consider how a community of users evolvesemantically.

3. CHARACTERISING ONLINE COMMU-NITIES WITH SEMANTIC GRAPHS

For our experiments we used data from the Irish commu-nity message board Boards.ie.2 This is a general-discussioncommunity message board that includes a set of hierarchi-cally nested forums (F ) in which posts are made - i.e. fo-rum A can be a parent of forum B, and thus B containsspecialised topics of discussion over A. Posts are providedas a set of quadruples <u, s, t, f> 2 P , where user u postedmessage s at time t in forum f . A message (s) is com-posed of terms that we use to build the semantic modelsfor individual communities. The information discussed ina community, and thus its semantics, can change and alterover time, therefore we constrain a community’s model tospecific time snapshots - e.g. t0 ! t00 where t0 < t00 - forthis we use the following construct that filters through allrelevant posts’ contents within the allotted time window:

St0t00 = {s : <u, s, t, f> 2 P, t0 t < t00} (1)

Information discussed within online communities can berepresented in terms of its semantics, using information fromeither the schema-level (i.e. ontological classes and relationsbetween them) or the data level (i.e. using entities and howthey are related to one another). For the former we considerconcepts to be classes found within the DBPedia Ontology,that is: the types of entities that users are discussing (e.g.people, locations, etc.), while for the latter case we con-sider DBPedia resources: i.e. entities themselves (e.g. dbpe-

dia:Barrack_Obama). Given our set of post contents, St0t00 ,we derive concepts and entities from a forum over a time pe-riod as follows: we process each post content s 2 St0t00 usingan entity extraction tool (s) to return the set of entitiesrelated to the content of s. Given the entities (RE) returnedfor a given community forum over an allotted time periodwe then construct two types of semantic resource graphs:concept graphs, which function at the schema-level and con-tain class information; and entity graphs, which function atthe data level and contain information that relates entitiesto one another.

2http://www.boards.ie

3.1 Concept GraphsA concept graph (GC) is a type of semantic resource graph

that contains the types of entities found within a given fo-rum as vertices (V ) and the relations between these classesas edges (E). For a given community forum f we have a setof entities RE that were extracted over some period timet0 ! t00. Each entity, given that we are using DBPediaresource URIs, is typed according to one or more classesfrom the DBPedia ontology.3 Therefore to construct the setof concepts that are cited within a given forum f over theallotted time period we retrieve the classes that each en-tity is a type of and store these in the following set: RC .From this set we generate a time-dependent concept graph:GC [f, t

0, t00] = hVf , Ef i, such that GC [f, t0, t00] ⇢ Gtype -

where Gtype denotes the DBPedia type graph formed fromthe class structure of the DBPedia ontology. In this con-text the set of concepts denotes the seed set and is used topopulate the vertices in the graph and then construct edgesbetween the vertices based on existing links between theconcepts in the DBPedia type graph (Gtype):

Ef = {(ci, cj) : ci, cj 2 Vf , (ci, cj) 2 Etype} (2)

In order to derive the set of vertices we must consider howthe seed set can be used for this process as it is often the casethat the set is comprised of concepts which are not directlyconnected to one another in the concept graph. To connectsuch concepts, and derive a fully connected concept graph(i.e. with no disconnected components) we extract the RootPath Graph as follows: From each concept (c 2 RC) weidentify the parent concept (<c rdfs:subClassOf p>) anditeratively move up the concept graph until the root node isreached (owl:Thing), thereby returning a set of nodes thatformed the path from c to the root node: rootpath(c) ={c, p, ...,owl:Thing}. The graph’s vertices are therefore de-rived by taking the union of all paths to the root returnedfrom each concept within the seed set:

Vf =[

c2RC

rootpath(c) (3)

3.2 Entity GraphsAn entity graph (GE) is a type of semantic resource graph

where the vertices (V ) are entities and the set of edges (E)connecting these entities are relations between them derivedfrom the Web of Linked Data. We define the entity graphas GE [f, t

0, t00] = hVf , Ef i, such that GE [f, t0, t00] ⇢ Gentity

- where Gentity denotes the DBPedia entity graph contain-ing relations between entities at the data level. As we areprovided with a collection of time-delimited entities RE fora given forum, we query DBPedia for links between entitypairs and add such edges to the graph where such a linkexists:

Ef = {(ri, rj) : ri, rj 2 Vf , (ri, rj) 2 Eentity} (4)

Given this edge construction mechanism we only look forrelations one-hop away in the entity graph, that is: givenRE we only look for relations between elements in the set.This could be extended to include 2-hop relations, howeverwe are interested in how entities in the communities areconnected to one another directly. In this work, the verticesin the entity graph are thus those entities which are foundto be connected to one another directly via 1-hop distances.3http://dbpedia.org/Ontology

clustering of users) and found high degrees of local cluster-ing on the di↵erent platforms which contained densely pop-ulated subgroups of similar users. Recent work by Gong etal. [5] inspected the evolution of social networks on Google+as the platform was growing in memberships, in particu-lar they focused on social-attribute networks (i.e. bipar-tite graphs containing people and their attributes as nodes),finding that the platform exhibited unique growth and char-acteristics of the networks as more people joined Google+.Leskovec et al. [7] modelled the development of social net-works across four platforms (Flickr, Delicious, Yahoo! An-swers and LinkedIn) by modelling the process of node ar-rival (users joining), edges being created and waiting timesbetween edge creation. Previous work mostly ignored termbased and semantic information and concentrated on howthe social networks evolved, not the communities. Althoughsuch works a↵ord insights into the evolution of social net-works, they do not consider how a community of users evolvesemantically.

3. CHARACTERISING ONLINE COMMU-NITIES WITH SEMANTIC GRAPHS

For our experiments we used data from the Irish commu-nity message board Boards.ie.2 This is a general-discussioncommunity message board that includes a set of hierarchi-cally nested forums (F ) in which posts are made - i.e. fo-rum A can be a parent of forum B, and thus B containsspecialised topics of discussion over A. Posts are providedas a set of quadruples <u, s, t, f> 2 P , where user u postedmessage s at time t in forum f . A message (s) is com-posed of terms that we use to build the semantic modelsfor individual communities. The information discussed ina community, and thus its semantics, can change and alterover time, therefore we constrain a community’s model tospecific time snapshots - e.g. t0 ! t00 where t0 < t00 - forthis we use the following construct that filters through allrelevant posts’ contents within the allotted time window:

St0t00 = {s : <u, s, t, f> 2 P, t0 t < t00} (1)

Information discussed within online communities can berepresented in terms of its semantics, using information fromeither the schema-level (i.e. ontological classes and relationsbetween them) or the data level (i.e. using entities and howthey are related to one another). For the former we considerconcepts to be classes found within the DBPedia Ontology,that is: the types of entities that users are discussing (e.g.people, locations, etc.), while for the latter case we con-sider DBPedia resources: i.e. entities themselves (e.g. dbpe-

dia:Barrack_Obama). Given our set of post contents, St0t00 ,we derive concepts and entities from a forum over a time pe-riod as follows: we process each post content s 2 St0t00 usingan entity extraction tool (s) to return the set of entitiesrelated to the content of s. Given the entities (RE) returnedfor a given community forum over an allotted time periodwe then construct two types of semantic resource graphs:concept graphs, which function at the schema-level and con-tain class information; and entity graphs, which function atthe data level and contain information that relates entitiesto one another.

2http://www.boards.ie

3.1 Concept GraphsA concept graph (GC) is a type of semantic resource graph

that contains the types of entities found within a given fo-rum as vertices (V ) and the relations between these classesas edges (E). For a given community forum f we have a setof entities RE that were extracted over some period timet0 ! t00. Each entity, given that we are using DBPediaresource URIs, is typed according to one or more classesfrom the DBPedia ontology.3 Therefore to construct the setof concepts that are cited within a given forum f over theallotted time period we retrieve the classes that each en-tity is a type of and store these in the following set: RC .From this set we generate a time-dependent concept graph:GC [f, t

0, t00] = hVf , Ef i, such that GC [f, t0, t00] ⇢ Gtype -

where Gtype denotes the DBPedia type graph formed fromthe class structure of the DBPedia ontology. In this con-text the set of concepts denotes the seed set and is used topopulate the vertices in the graph and then construct edgesbetween the vertices based on existing links between theconcepts in the DBPedia type graph (Gtype):

Ef = {(ci, cj) : ci, cj 2 Vf , (ci, cj) 2 Etype} (2)

In order to derive the set of vertices we must consider howthe seed set can be used for this process as it is often the casethat the set is comprised of concepts which are not directlyconnected to one another in the concept graph. To connectsuch concepts, and derive a fully connected concept graph(i.e. with no disconnected components) we extract the RootPath Graph as follows: From each concept (c 2 RC) weidentify the parent concept (<c rdfs:subClassOf p>) anditeratively move up the concept graph until the root node isreached (owl:Thing), thereby returning a set of nodes thatformed the path from c to the root node: rootpath(c) ={c, p, ...,owl:Thing}. The graph’s vertices are therefore de-rived by taking the union of all paths to the root returnedfrom each concept within the seed set:

Vf =[

c2RC

rootpath(c) (3)

3.2 Entity GraphsAn entity graph (GE) is a type of semantic resource graph

where the vertices (V ) are entities and the set of edges (E)connecting these entities are relations between them derivedfrom the Web of Linked Data. We define the entity graphas GE [f, t

0, t00] = hVf , Ef i, such that GE [f, t0, t00] ⇢ Gentity

- where Gentity denotes the DBPedia entity graph contain-ing relations between entities at the data level. As we areprovided with a collection of time-delimited entities RE fora given forum, we query DBPedia for links between entitypairs and add such edges to the graph where such a linkexists:

Ef = {(ri, rj) : ri, rj 2 Vf , (ri, rj) 2 Eentity} (4)

Given this edge construction mechanism we only look forrelations one-hop away in the entity graph, that is: givenRE we only look for relations between elements in the set.This could be extended to include 2-hop relations, howeverwe are interested in how entities in the communities areconnected to one another directly. In this work, the verticesin the entity graph are thus those entities which are foundto be connected to one another directly via 1-hop distances.3http://dbpedia.org/Ontology

User u posted content s at time t in forum f

Page 8: The Semantic Evolution of Online Communities

Building Semantic Resource Graphs

The Semantic Evolution of Online Communities

7

1.  Concept graphs ¤  Nodes: Union of rootpaths from each entity’s class to

owl:Thing class

¤  Edges: Between nodes within the DBPedia Ontology

2.  Entity graphs ¤  Nodes: Entities provided by time-specific post extractions ¤  Edges: Relations between entity pairs within the DBPedia linked

data graph

Construct Time-

delimited Semantic Graphs

clustering of users) and found high degrees of local cluster-ing on the di↵erent platforms which contained densely pop-ulated subgroups of similar users. Recent work by Gong etal. [5] inspected the evolution of social networks on Google+as the platform was growing in memberships, in particu-lar they focused on social-attribute networks (i.e. bipar-tite graphs containing people and their attributes as nodes),finding that the platform exhibited unique growth and char-acteristics of the networks as more people joined Google+.Leskovec et al. [7] modelled the development of social net-works across four platforms (Flickr, Delicious, Yahoo! An-swers and LinkedIn) by modelling the process of node ar-rival (users joining), edges being created and waiting timesbetween edge creation. Previous work mostly ignored termbased and semantic information and concentrated on howthe social networks evolved, not the communities. Althoughsuch works a↵ord insights into the evolution of social net-works, they do not consider how a community of users evolvesemantically.

3. CHARACTERISING ONLINE COMMU-NITIES WITH SEMANTIC GRAPHS

For our experiments we used data from the Irish commu-nity message board Boards.ie.2 This is a general-discussioncommunity message board that includes a set of hierarchi-cally nested forums (F ) in which posts are made - i.e. fo-rum A can be a parent of forum B, and thus B containsspecialised topics of discussion over A. Posts are providedas a set of quadruples <u, s, t, f> 2 P , where user u postedmessage s at time t in forum f . A message (s) is com-posed of terms that we use to build the semantic modelsfor individual communities. The information discussed ina community, and thus its semantics, can change and alterover time, therefore we constrain a community’s model tospecific time snapshots - e.g. t0 ! t00 where t0 < t00 - forthis we use the following construct that filters through allrelevant posts’ contents within the allotted time window:

St0t00 = {s : <u, s, t, f> 2 P, t0 t < t00} (1)

Information discussed within online communities can berepresented in terms of its semantics, using information fromeither the schema-level (i.e. ontological classes and relationsbetween them) or the data level (i.e. using entities and howthey are related to one another). For the former we considerconcepts to be classes found within the DBPedia Ontology,that is: the types of entities that users are discussing (e.g.people, locations, etc.), while for the latter case we con-sider DBPedia resources: i.e. entities themselves (e.g. dbpe-

dia:Barrack_Obama). Given our set of post contents, St0t00 ,we derive concepts and entities from a forum over a time pe-riod as follows: we process each post content s 2 St0t00 usingan entity extraction tool (s) to return the set of entitiesrelated to the content of s. Given the entities (RE) returnedfor a given community forum over an allotted time periodwe then construct two types of semantic resource graphs:concept graphs, which function at the schema-level and con-tain class information; and entity graphs, which function atthe data level and contain information that relates entitiesto one another.

2http://www.boards.ie

3.1 Concept GraphsA concept graph (GC) is a type of semantic resource graph

that contains the types of entities found within a given fo-rum as vertices (V ) and the relations between these classesas edges (E). For a given community forum f we have a setof entities RE that were extracted over some period timet0 ! t00. Each entity, given that we are using DBPediaresource URIs, is typed according to one or more classesfrom the DBPedia ontology.3 Therefore to construct the setof concepts that are cited within a given forum f over theallotted time period we retrieve the classes that each en-tity is a type of and store these in the following set: RC .From this set we generate a time-dependent concept graph:GC [f, t

0, t00] = hVf , Ef i, such that GC [f, t0, t00] ⇢ Gtype -

where Gtype denotes the DBPedia type graph formed fromthe class structure of the DBPedia ontology. In this con-text the set of concepts denotes the seed set and is used topopulate the vertices in the graph and then construct edgesbetween the vertices based on existing links between theconcepts in the DBPedia type graph (Gtype):

Ef = {(ci, cj) : ci, cj 2 Vf , (ci, cj) 2 Etype} (2)

In order to derive the set of vertices we must consider howthe seed set can be used for this process as it is often the casethat the set is comprised of concepts which are not directlyconnected to one another in the concept graph. To connectsuch concepts, and derive a fully connected concept graph(i.e. with no disconnected components) we extract the RootPath Graph as follows: From each concept (c 2 RC) weidentify the parent concept (<c rdfs:subClassOf p>) anditeratively move up the concept graph until the root node isreached (owl:Thing), thereby returning a set of nodes thatformed the path from c to the root node: rootpath(c) ={c, p, ...,owl:Thing}. The graph’s vertices are therefore de-rived by taking the union of all paths to the root returnedfrom each concept within the seed set:

Vf =[

c2RC

rootpath(c) (3)

3.2 Entity GraphsAn entity graph (GE) is a type of semantic resource graph

where the vertices (V ) are entities and the set of edges (E)connecting these entities are relations between them derivedfrom the Web of Linked Data. We define the entity graphas GE [f, t

0, t00] = hVf , Ef i, such that GE [f, t0, t00] ⇢ Gentity

- where Gentity denotes the DBPedia entity graph contain-ing relations between entities at the data level. As we areprovided with a collection of time-delimited entities RE fora given forum, we query DBPedia for links between entitypairs and add such edges to the graph where such a linkexists:

Ef = {(ri, rj) : ri, rj 2 Vf , (ri, rj) 2 Eentity} (4)

Given this edge construction mechanism we only look forrelations one-hop away in the entity graph, that is: givenRE we only look for relations between elements in the set.This could be extended to include 2-hop relations, howeverwe are interested in how entities in the communities areconnected to one another directly. In this work, the verticesin the entity graph are thus those entities which are foundto be connected to one another directly via 1-hop distances.3http://dbpedia.org/Ontology

clustering of users) and found high degrees of local cluster-ing on the di↵erent platforms which contained densely pop-ulated subgroups of similar users. Recent work by Gong etal. [5] inspected the evolution of social networks on Google+as the platform was growing in memberships, in particu-lar they focused on social-attribute networks (i.e. bipar-tite graphs containing people and their attributes as nodes),finding that the platform exhibited unique growth and char-acteristics of the networks as more people joined Google+.Leskovec et al. [7] modelled the development of social net-works across four platforms (Flickr, Delicious, Yahoo! An-swers and LinkedIn) by modelling the process of node ar-rival (users joining), edges being created and waiting timesbetween edge creation. Previous work mostly ignored termbased and semantic information and concentrated on howthe social networks evolved, not the communities. Althoughsuch works a↵ord insights into the evolution of social net-works, they do not consider how a community of users evolvesemantically.

3. CHARACTERISING ONLINE COMMU-NITIES WITH SEMANTIC GRAPHS

For our experiments we used data from the Irish commu-nity message board Boards.ie.2 This is a general-discussioncommunity message board that includes a set of hierarchi-cally nested forums (F ) in which posts are made - i.e. fo-rum A can be a parent of forum B, and thus B containsspecialised topics of discussion over A. Posts are providedas a set of quadruples <u, s, t, f> 2 P , where user u postedmessage s at time t in forum f . A message (s) is com-posed of terms that we use to build the semantic modelsfor individual communities. The information discussed ina community, and thus its semantics, can change and alterover time, therefore we constrain a community’s model tospecific time snapshots - e.g. t0 ! t00 where t0 < t00 - forthis we use the following construct that filters through allrelevant posts’ contents within the allotted time window:

St0t00 = {s : <u, s, t, f> 2 P, t0 t < t00} (1)

Information discussed within online communities can berepresented in terms of its semantics, using information fromeither the schema-level (i.e. ontological classes and relationsbetween them) or the data level (i.e. using entities and howthey are related to one another). For the former we considerconcepts to be classes found within the DBPedia Ontology,that is: the types of entities that users are discussing (e.g.people, locations, etc.), while for the latter case we con-sider DBPedia resources: i.e. entities themselves (e.g. dbpe-

dia:Barrack_Obama). Given our set of post contents, St0t00 ,we derive concepts and entities from a forum over a time pe-riod as follows: we process each post content s 2 St0t00 usingan entity extraction tool (s) to return the set of entitiesrelated to the content of s. Given the entities (RE) returnedfor a given community forum over an allotted time periodwe then construct two types of semantic resource graphs:concept graphs, which function at the schema-level and con-tain class information; and entity graphs, which function atthe data level and contain information that relates entitiesto one another.

2http://www.boards.ie

3.1 Concept GraphsA concept graph (GC) is a type of semantic resource graph

that contains the types of entities found within a given fo-rum as vertices (V ) and the relations between these classesas edges (E). For a given community forum f we have a setof entities RE that were extracted over some period timet0 ! t00. Each entity, given that we are using DBPediaresource URIs, is typed according to one or more classesfrom the DBPedia ontology.3 Therefore to construct the setof concepts that are cited within a given forum f over theallotted time period we retrieve the classes that each en-tity is a type of and store these in the following set: RC .From this set we generate a time-dependent concept graph:GC [f, t

0, t00] = hVf , Ef i, such that GC [f, t0, t00] ⇢ Gtype -

where Gtype denotes the DBPedia type graph formed fromthe class structure of the DBPedia ontology. In this con-text the set of concepts denotes the seed set and is used topopulate the vertices in the graph and then construct edgesbetween the vertices based on existing links between theconcepts in the DBPedia type graph (Gtype):

Ef = {(ci, cj) : ci, cj 2 Vf , (ci, cj) 2 Etype} (2)

In order to derive the set of vertices we must consider howthe seed set can be used for this process as it is often the casethat the set is comprised of concepts which are not directlyconnected to one another in the concept graph. To connectsuch concepts, and derive a fully connected concept graph(i.e. with no disconnected components) we extract the RootPath Graph as follows: From each concept (c 2 RC) weidentify the parent concept (<c rdfs:subClassOf p>) anditeratively move up the concept graph until the root node isreached (owl:Thing), thereby returning a set of nodes thatformed the path from c to the root node: rootpath(c) ={c, p, ...,owl:Thing}. The graph’s vertices are therefore de-rived by taking the union of all paths to the root returnedfrom each concept within the seed set:

Vf =[

c2RC

rootpath(c) (3)

3.2 Entity GraphsAn entity graph (GE) is a type of semantic resource graph

where the vertices (V ) are entities and the set of edges (E)connecting these entities are relations between them derivedfrom the Web of Linked Data. We define the entity graphas GE [f, t

0, t00] = hVf , Ef i, such that GE [f, t0, t00] ⇢ Gentity

- where Gentity denotes the DBPedia entity graph contain-ing relations between entities at the data level. As we areprovided with a collection of time-delimited entities RE fora given forum, we query DBPedia for links between entitypairs and add such edges to the graph where such a linkexists:

Ef = {(ri, rj) : ri, rj 2 Vf , (ri, rj) 2 Eentity} (4)

Given this edge construction mechanism we only look forrelations one-hop away in the entity graph, that is: givenRE we only look for relations between elements in the set.This could be extended to include 2-hop relations, howeverwe are interested in how entities in the communities areconnected to one another directly. In this work, the verticesin the entity graph are thus those entities which are foundto be connected to one another directly via 1-hop distances.3http://dbpedia.org/Ontology

clustering of users) and found high degrees of local cluster-ing on the di↵erent platforms which contained densely pop-ulated subgroups of similar users. Recent work by Gong etal. [5] inspected the evolution of social networks on Google+as the platform was growing in memberships, in particu-lar they focused on social-attribute networks (i.e. bipar-tite graphs containing people and their attributes as nodes),finding that the platform exhibited unique growth and char-acteristics of the networks as more people joined Google+.Leskovec et al. [7] modelled the development of social net-works across four platforms (Flickr, Delicious, Yahoo! An-swers and LinkedIn) by modelling the process of node ar-rival (users joining), edges being created and waiting timesbetween edge creation. Previous work mostly ignored termbased and semantic information and concentrated on howthe social networks evolved, not the communities. Althoughsuch works a↵ord insights into the evolution of social net-works, they do not consider how a community of users evolvesemantically.

3. CHARACTERISING ONLINE COMMU-NITIES WITH SEMANTIC GRAPHS

For our experiments we used data from the Irish commu-nity message board Boards.ie.2 This is a general-discussioncommunity message board that includes a set of hierarchi-cally nested forums (F ) in which posts are made - i.e. fo-rum A can be a parent of forum B, and thus B containsspecialised topics of discussion over A. Posts are providedas a set of quadruples <u, s, t, f> 2 P , where user u postedmessage s at time t in forum f . A message (s) is com-posed of terms that we use to build the semantic modelsfor individual communities. The information discussed ina community, and thus its semantics, can change and alterover time, therefore we constrain a community’s model tospecific time snapshots - e.g. t0 ! t00 where t0 < t00 - forthis we use the following construct that filters through allrelevant posts’ contents within the allotted time window:

St0t00 = {s : <u, s, t, f> 2 P, t0 t < t00} (1)

Information discussed within online communities can berepresented in terms of its semantics, using information fromeither the schema-level (i.e. ontological classes and relationsbetween them) or the data level (i.e. using entities and howthey are related to one another). For the former we considerconcepts to be classes found within the DBPedia Ontology,that is: the types of entities that users are discussing (e.g.people, locations, etc.), while for the latter case we con-sider DBPedia resources: i.e. entities themselves (e.g. dbpe-

dia:Barrack_Obama). Given our set of post contents, St0t00 ,we derive concepts and entities from a forum over a time pe-riod as follows: we process each post content s 2 St0t00 usingan entity extraction tool (s) to return the set of entitiesrelated to the content of s. Given the entities (RE) returnedfor a given community forum over an allotted time periodwe then construct two types of semantic resource graphs:concept graphs, which function at the schema-level and con-tain class information; and entity graphs, which function atthe data level and contain information that relates entitiesto one another.

2http://www.boards.ie

3.1 Concept GraphsA concept graph (GC) is a type of semantic resource graph

that contains the types of entities found within a given fo-rum as vertices (V ) and the relations between these classesas edges (E). For a given community forum f we have a setof entities RE that were extracted over some period timet0 ! t00. Each entity, given that we are using DBPediaresource URIs, is typed according to one or more classesfrom the DBPedia ontology.3 Therefore to construct the setof concepts that are cited within a given forum f over theallotted time period we retrieve the classes that each en-tity is a type of and store these in the following set: RC .From this set we generate a time-dependent concept graph:GC [f, t

0, t00] = hVf , Ef i, such that GC [f, t0, t00] ⇢ Gtype -

where Gtype denotes the DBPedia type graph formed fromthe class structure of the DBPedia ontology. In this con-text the set of concepts denotes the seed set and is used topopulate the vertices in the graph and then construct edgesbetween the vertices based on existing links between theconcepts in the DBPedia type graph (Gtype):

Ef = {(ci, cj) : ci, cj 2 Vf , (ci, cj) 2 Etype} (2)

In order to derive the set of vertices we must consider howthe seed set can be used for this process as it is often the casethat the set is comprised of concepts which are not directlyconnected to one another in the concept graph. To connectsuch concepts, and derive a fully connected concept graph(i.e. with no disconnected components) we extract the RootPath Graph as follows: From each concept (c 2 RC) weidentify the parent concept (<c rdfs:subClassOf p>) anditeratively move up the concept graph until the root node isreached (owl:Thing), thereby returning a set of nodes thatformed the path from c to the root node: rootpath(c) ={c, p, ...,owl:Thing}. The graph’s vertices are therefore de-rived by taking the union of all paths to the root returnedfrom each concept within the seed set:

Vf =[

c2RC

rootpath(c) (3)

3.2 Entity GraphsAn entity graph (GE) is a type of semantic resource graph

where the vertices (V ) are entities and the set of edges (E)connecting these entities are relations between them derivedfrom the Web of Linked Data. We define the entity graphas GE [f, t

0, t00] = hVf , Ef i, such that GE [f, t0, t00] ⇢ Gentity

- where Gentity denotes the DBPedia entity graph contain-ing relations between entities at the data level. As we areprovided with a collection of time-delimited entities RE fora given forum, we query DBPedia for links between entitypairs and add such edges to the graph where such a linkexists:

Ef = {(ri, rj) : ri, rj 2 Vf , (ri, rj) 2 Eentity} (4)

Given this edge construction mechanism we only look forrelations one-hop away in the entity graph, that is: givenRE we only look for relations between elements in the set.This could be extended to include 2-hop relations, howeverwe are interested in how entities in the communities areconnected to one another directly. In this work, the verticesin the entity graph are thus those entities which are foundto be connected to one another directly via 1-hop distances.3http://dbpedia.org/Ontology

Page 9: The Semantic Evolution of Online Communities

Measuring Graph Dynamics

The Semantic Evolution of Online Communities

8

1.  Node Count: size of the graph 2.  Diameter: breadth of the graph 3.  Specialisation Count: class specialisations 4.  Graph Entropy: density of the graph 5.  Clustering Coefficient: cliquishness of the graph

Inspect Macro

Evolution

Computation of the measures is described within the paper

Page 10: The Semantic Evolution of Online Communities

The Semantic Evolution of Online Communities 9

Macro Evolution

Cumulative Time Interval

Entropy of the Concept Graph

Showing the mean graph entropy across all communities and the 95% Confidence Interval

0 20 40 60 80 100 120

150

200

250

300

Timestep

|V|

(a) Node Count

0 20 40 60 80 100 120

4.0

4.5

5.0

5.5

Timestep

H(G)

(b) Graph Entropy

0 20 40 60 80 100 120

150

200

250

Timestep

Specialisations

(c) Specialisations

Figure 1: Concept graphs’ evolution based on node counts, graph entropy and specialisations.

ing that the density of the graph grows as more entities areadded and thus more connections are possible between them.

Summary: We can summarise the following salient find-ings: (i) for concept graphs: node count, specialisation countand density (graph entropy) tend to converge to limit; (ii)for entity graphs, the diameter, graph entropy and cluster-ing coe�cient tend to converge to a limit, while the nodecount (number of entities) increases linearly; (iii) despitenew entities arriving at a constant linear rate, on average,the number of concepts tends to converge on a maxima.

5. MODELLING SEMANTIC EVOLUTIONIn the previous section, we found that the concept graphs

and entity graphs tend to evolve in a convergent manner:that is, for di↵erent measures they tend to evolve towards alimit. Such limiting evolution has been found previously inpopulation models where a given population has a carryingcapacity that the population evolves towards at di↵ering pro-portionate growth rates. These proportionate growth ratesslow down over time as the population tends towards a limit(the carrying capacity) - we see this in tapering curves in theaforementioned graphs. An immediate question that arisesfrom this e↵ect is: how do the communities di↵er in termsof evolution rates? To answer this question we used logis-tic population models that contain: (i) the growth rate ofthe graphs (r), and (ii) the carrying capacity of the graphs(E). Each of these variables can be used to characterise thecommunity forums (93 forums in total) in terms of their se-mantic evolution given a measure (e.g. node count in theconcept graph). To derive the variables r and E for a givencommunity forum and graph measure (m) we derive a set oftime steps (T ) which depict a change in a graph measure:

T = {a : a 2 [1, 119],m(G1,a+1) > m(G1,a)} (7)

Deriving the set of change time steps for a given communityallows the proportionate growth rate for a given time step(t 2 T ) to be derived: Rt = (Pt+1 � Pt)/Pt. This valueis equivalent to the following equation which defines theproportionate growth rate Rt in terms of the community’sgrowth rate (r) and carrying capacity (E), our unknownvariables: Rt = r(1 � Pt/E). Therefore if we measure theproportionate growth rate over the |T | distinct time stepsthen we can derive, via simultaneous equations, the growthrate of the graph and its carrying capacity, the very mea-sures that we can use to characterise the semantic evolution

of a given online community based on a single graph mea-sure. We exclude the derivation of the equations from thepaper, but it is su�cient to conclude that given |T | timesteps we would have a single equation for each time step(t 2 T ): Rtr

�1 + PtE�1 = 1. We can then solve for the

unknown variables r and E using the QR-decomposition ofa matrix: expressing the lefthand side of the simultaneousequations as a |T | ⇥ 2 matrix and the righthand side as a|T |-element vector where each element is 1. We inducedlogistic population models for each of the graph measures(aside from entity graph node count) and examined how thegrowth rate and carrying capacities were distributed - wefound all models to be suitable fits at the 1% significancelevel using the chi-squared goodness-of-fit test.We omitted the plots showing the distribution of growth

rates and carrying capacities for communities’ concept andentity graphs’ measures, however these distributions demon-strated the following:

• All communities evolve to cover roughly 80% of theontological classes (total number is 359), however somecommunities converge on the maximum quicker thanothers - demonstrated by a high growth rate for a smallnumber of communities.

• The majority of communities’ concept graphs becomedenser at a slow rate, while a few communities’ becomedenser quickly, thus suggesting that users tend to dis-cuss concepts that are orthogonal and not related toone another in the majority of communities.

• Communities’ entity graphs show variance in the rateby which entities are discussed for the first time withinthe communities, however such rates are linear: fittinga per-community linear regression model regressing theweek count on the node count yielded a lowest coe�-cient of determination of 0.3 for one community.

6. APPLICATIONSWe now demonstrate the utility of graph-based approaches

via two applications: community analysis, and churn rateprediction.

6.1 Community AnalysisTo more closely examine the di↵erences between commu-

nity forums in terms of their semantic evolution we char-acterised each community in our dataset by its semanticdynamics motif over the 120-week analysis period: that is,the observed evolution measures derived from the concept

Inspect Macro

Evolution

Page 11: The Semantic Evolution of Online Communities

Inspect Macro

Evolution

The Semantic Evolution of Online Communities 10

Macro Evolution: Concept Graphs

0 20 40 60 80 100 120

150

200

250

300

Timestep

|V|

(a) Node Count

0 20 40 60 80 100 1204.0

4.5

5.0

5.5

TimestepH(G)

(b) Graph Entropy

0 20 40 60 80 100 120

150

200

250

Timestep

Specialisations

(c) Specialisations

Figure 1: Concept graphs’ evolution based on node counts, graph entropy and specialisations.

ing that the density of the graph grows as more entities areadded and thus more connections are possible between them.

Summary: We can summarise the following salient find-ings: (i) for concept graphs: node count, specialisation countand density (graph entropy) tend to converge to limit; (ii)for entity graphs, the diameter, graph entropy and cluster-ing coe�cient tend to converge to a limit, while the nodecount (number of entities) increases linearly; (iii) despitenew entities arriving at a constant linear rate, on average,the number of concepts tends to converge on a maxima.

5. MODELLING SEMANTIC EVOLUTIONIn the previous section, we found that the concept graphs

and entity graphs tend to evolve in a convergent manner:that is, for di↵erent measures they tend to evolve towards alimit. Such limiting evolution has been found previously inpopulation models where a given population has a carryingcapacity that the population evolves towards at di↵ering pro-portionate growth rates. These proportionate growth ratesslow down over time as the population tends towards a limit(the carrying capacity) - we see this in tapering curves in theaforementioned graphs. An immediate question that arisesfrom this e↵ect is: how do the communities di↵er in termsof evolution rates? To answer this question we used logis-tic population models that contain: (i) the growth rate ofthe graphs (r), and (ii) the carrying capacity of the graphs(E). Each of these variables can be used to characterise thecommunity forums (93 forums in total) in terms of their se-mantic evolution given a measure (e.g. node count in theconcept graph). To derive the variables r and E for a givencommunity forum and graph measure (m) we derive a set oftime steps (T ) which depict a change in a graph measure:

T = {a : a 2 [1, 119],m(G1,a+1) > m(G1,a)} (7)

Deriving the set of change time steps for a given communityallows the proportionate growth rate for a given time step(t 2 T ) to be derived: Rt = (Pt+1 � Pt)/Pt. This valueis equivalent to the following equation which defines theproportionate growth rate Rt in terms of the community’sgrowth rate (r) and carrying capacity (E), our unknownvariables: Rt = r(1 � Pt/E). Therefore if we measure theproportionate growth rate over the |T | distinct time stepsthen we can derive, via simultaneous equations, the growthrate of the graph and its carrying capacity, the very mea-sures that we can use to characterise the semantic evolution

of a given online community based on a single graph mea-sure. We exclude the derivation of the equations from thepaper, but it is su�cient to conclude that given |T | timesteps we would have a single equation for each time step(t 2 T ): Rtr

�1 + PtE�1 = 1. We can then solve for the

unknown variables r and E using the QR-decomposition ofa matrix: expressing the lefthand side of the simultaneousequations as a |T | ⇥ 2 matrix and the righthand side as a|T |-element vector where each element is 1. We inducedlogistic population models for each of the graph measures(aside from entity graph node count) and examined how thegrowth rate and carrying capacities were distributed - wefound all models to be suitable fits at the 1% significancelevel using the chi-squared goodness-of-fit test.We omitted the plots showing the distribution of growth

rates and carrying capacities for communities’ concept andentity graphs’ measures, however these distributions demon-strated the following:

• All communities evolve to cover roughly 80% of theontological classes (total number is 359), however somecommunities converge on the maximum quicker thanothers - demonstrated by a high growth rate for a smallnumber of communities.

• The majority of communities’ concept graphs becomedenser at a slow rate, while a few communities’ becomedenser quickly, thus suggesting that users tend to dis-cuss concepts that are orthogonal and not related toone another in the majority of communities.

• Communities’ entity graphs show variance in the rateby which entities are discussed for the first time withinthe communities, however such rates are linear: fittinga per-community linear regression model regressing theweek count on the node count yielded a lowest coe�-cient of determination of 0.3 for one community.

6. APPLICATIONSWe now demonstrate the utility of graph-based approaches

via two applications: community analysis, and churn rateprediction.

6.1 Community AnalysisTo more closely examine the di↵erences between commu-

nity forums in terms of their semantic evolution we char-acterised each community in our dataset by its semanticdynamics motif over the 120-week analysis period: that is,the observed evolution measures derived from the concept

Node count, graph entropy, and specialisations converge to a limit

Page 12: The Semantic Evolution of Online Communities

Inspect Macro

Evolution

The Semantic Evolution of Online Communities 11

Macro Evolution: Entity Graphs

0 20 40 60 80 100 120

050

010

0015

00

Timestep

|V|

Linear Model (β=9.932)

(a) Node Count

0 20 40 60 80 100 120

12

34

56

7

Timestep

Diameter

(b) Diameter

0 20 40 60 80 100 120

12

34

Timestep

H(G)

(c) Graph Entropy

0 20 40 60 80 100 120

0.02

0.06

0.10

0.14

Timestep

CC

(d) Clustering Coe�cient

Figure 2: Entity graphs’ evolution based on node counts, diameter, graph entropy, and clustering coe�cient.

and entity graphs defined within a single vector for a givencommunity (f): mf = {m1,m2, . . . ,m13}. We began by de-riving a single matrixM 2 R93⇥13 containing the 93 commu-nities (rows) under analysis together with their 13 evolutionmeasures (columns), and performing principal componentanalysis over this matrix. The result of this clustering isshown in Fig. 3 where we have colour-coded the di↵erentcommunity forums by their hierarchical level in the plat-form (level 2 = most general, level 4 = most specific). Wefound that several of the more general forums appeared asoutliers in the plot and thus exhibited unique evolution dy-namics, while the two level 4 forums (forum 556 and 554)were bunched together suggesting that they follow similartrends. We further examined the semantic motifs of threeoutlier communities from di↵ering levels:

• Level 2: Forum 7 - After Hours. A random discussionforum.

• Level 3: Forum 227 - Television. Discussions abouttelevision.

• Level 4: Forum 554 - Wanted Motors. Discussionsabout cars and car parts.

Fig. 3 presents clear di↵erences between the forums: wenote that for the concept graph dynamics (CG) the rate ofthe node count is lowest for the After Hours forum but thatthe node count equilibrium is highest, indicating that themore general the forum the slower is the growth of the con-cept graph, but the higher the maxima of the graph size.Likewise for the specialisation count in the concept graphs:the After Hours forum exhibits a slower rate of growth butwith a greater carrying capacity that the graph is tending to-wards. In terms of the entity graph: the slope of node countgrowth described by the linear model is highest for AfterHours, indicating that compared to the other two forums,the rate at which new entities are cited by the community ofusers is much greater, while for the more topically-specificforum of the Wanted Motors forum this is a lot lower. Theentity graph equilibrium is also highest for After Hours andlowest for Wanted Motors, indicating that the more generala forum is the greater the carrying capacity of its entitygraph and the greater the number of entities that will bediscussed.

6.2 Churn Rate PredictionTo examine the link between the semantic evolution of

online communities and their social properties, we defineda prediction task in which we used the semantic evolution

dynamics of a given community at time step t to predictthe churn rate of community members at time t + 1. Wedefined the churn rate of a community as the proportion ofactive users during a given time period (i.e. week segment)that post for the last time. We used the semantic dynamicsmotifs from the prior experiment (as listed within Fig. 3)and also included graph measures at a given time period:i.e. graph entropy at time t, specialisation count at time t,etc. We derived these features for every time step for eachcommunity and derived the response variable as the churnrate at the following time step. We then compiled a train-ing dataset (up to week 120) and a test dataset (from week120). Each dataset had the following form: D = {(xi, yi)},where xi contained a 21-element time-delimited feature vec-tor for a given community and yi was the churn rate of thecommunity at the following time step. We trained a ridgeregression model ( ) using Dtrain and applied it to Dtest,testing the performance of: a) just concept graph features,b) just entity graph features, and c) all features. An autore-gressive model was used as the baseline - using the churnrate at time t as a single predictor variable for the churnrate at time t + 1. Performance was evaluated using theRoot Mean Square Error (RMSE).

Table 1: Root Mean Square Error when predicting

churn rates using: an Autoregressive model (R2 =0.341) and a Ridge Regression model using Concept

Graph, Entity Graph, and all features.

Baseline Concept Graph Entity Graph All Features

7.310 ⇥10

�35.315 ⇥10

�35.301 ⇥10

�34.941 ⇥10

�3

Table 1 presents the results from our prediction our exper-iment. We found that for all tested models (concept graph,entity graph, all features) we significantly outperformed thebaseline - tested using the sign test (↵ = 0.001). Entitygraph features outperform concept graphs but not signifi-cantly, while our best model is the use of all features togetherin a single model. These results empirically demonstrate theutility of semantic evolution dynamics in predicting commu-nity churn rates, and suggests a link between how the com-munities develop semantically and the likelihood of usersleaving the communities.

Diameter, graph entropy, and clustering coefficient converge to a limit

Page 13: The Semantic Evolution of Online Communities

Modelling the Semantic Evolution of Individual Communities

The Semantic Evolution of Online Communities

12

¨  For each community and measure, induce a logistic population model:

1.  Get the set of measure increase time steps (T) 2.  Derive the proportionate growth rate for each step 3.  Use the set of proportionate growth rates to solve for

unknown community-specific evolution variables (r & E)

0 20 40 60 80 100 120

150

200

250

300

Timestep

|V|

(a) Node Count

0 20 40 60 80 100 120

4.0

4.5

5.0

5.5

Timestep

H(G)

(b) Graph Entropy

0 20 40 60 80 100 120

150

200

250

Timestep

Specialisations

(c) Specialisations

Figure 1: Concept graphs’ evolution based on node counts, graph entropy and specialisations.

ing that the density of the graph grows as more entities areadded and thus more connections are possible between them.

Summary: We can summarise the following salient find-ings: (i) for concept graphs: node count, specialisation countand density (graph entropy) tend to converge to limit; (ii)for entity graphs, the diameter, graph entropy and cluster-ing coe�cient tend to converge to a limit, while the nodecount (number of entities) increases linearly; (iii) despitenew entities arriving at a constant linear rate, on average,the number of concepts tends to converge on a maxima.

5. MODELLING SEMANTIC EVOLUTIONIn the previous section, we found that the concept graphs

and entity graphs tend to evolve in a convergent manner:that is, for di↵erent measures they tend to evolve towards alimit. Such limiting evolution has been found previously inpopulation models where a given population has a carryingcapacity that the population evolves towards at di↵ering pro-portionate growth rates. These proportionate growth ratesslow down over time as the population tends towards a limit(the carrying capacity) - we see this in tapering curves in theaforementioned graphs. An immediate question that arisesfrom this e↵ect is: how do the communities di↵er in termsof evolution rates? To answer this question we used logis-tic population models that contain: (i) the growth rate ofthe graphs (r), and (ii) the carrying capacity of the graphs(E). Each of these variables can be used to characterise thecommunity forums (93 forums in total) in terms of their se-mantic evolution given a measure (e.g. node count in theconcept graph). To derive the variables r and E for a givencommunity forum and graph measure (m) we derive a set oftime steps (T ) which depict a change in a graph measure:

T = {a : a 2 [1, 119],m(G1,a+1) > m(G1,a)} (7)

Deriving the set of change time steps for a given communityallows the proportionate growth rate for a given time step(t 2 T ) to be derived: Rt = (Pt+1 � Pt)/Pt. This valueis equivalent to the following equation which defines theproportionate growth rate Rt in terms of the community’sgrowth rate (r) and carrying capacity (E), our unknownvariables: Rt = r(1 � Pt/E). Therefore if we measure theproportionate growth rate over the |T | distinct time stepsthen we can derive, via simultaneous equations, the growthrate of the graph and its carrying capacity, the very mea-sures that we can use to characterise the semantic evolution

of a given online community based on a single graph mea-sure. We exclude the derivation of the equations from thepaper, but it is su�cient to conclude that given |T | timesteps we would have a single equation for each time step(t 2 T ): Rtr

�1 + PtE�1 = 1. We can then solve for the

unknown variables r and E using the QR-decomposition ofa matrix: expressing the lefthand side of the simultaneousequations as a |T | ⇥ 2 matrix and the righthand side as a|T |-element vector where each element is 1. We inducedlogistic population models for each of the graph measures(aside from entity graph node count) and examined how thegrowth rate and carrying capacities were distributed - wefound all models to be suitable fits at the 1% significancelevel using the chi-squared goodness-of-fit test.We omitted the plots showing the distribution of growth

rates and carrying capacities for communities’ concept andentity graphs’ measures, however these distributions demon-strated the following:

• All communities evolve to cover roughly 80% of theontological classes (total number is 359), however somecommunities converge on the maximum quicker thanothers - demonstrated by a high growth rate for a smallnumber of communities.

• The majority of communities’ concept graphs becomedenser at a slow rate, while a few communities’ becomedenser quickly, thus suggesting that users tend to dis-cuss concepts that are orthogonal and not related toone another in the majority of communities.

• Communities’ entity graphs show variance in the rateby which entities are discussed for the first time withinthe communities, however such rates are linear: fittinga per-community linear regression model regressing theweek count on the node count yielded a lowest coe�-cient of determination of 0.3 for one community.

6. APPLICATIONSWe now demonstrate the utility of graph-based approaches

via two applications: community analysis, and churn rateprediction.

6.1 Community AnalysisTo more closely examine the di↵erences between commu-

nity forums in terms of their semantic evolution we char-acterised each community in our dataset by its semanticdynamics motif over the 120-week analysis period: that is,the observed evolution measures derived from the concept

Induce Community-

Specific Evolution Models

Proportionate Growth Rate @ t

Population Growth Rate

Population (measure) size @ t

Carrying Capacity (Equilibrium)

All communities’ models fit with χ2-test at α<0.01

Page 14: The Semantic Evolution of Online Communities

The Semantic Evolution of Online Communities 13

Semantic Evolution… what can we actually use it for?

Page 15: The Semantic Evolution of Online Communities

The Semantic Evolution of Online Communities

Application: Community Evolution Analysis

14

Apply Mined

Semantic Evolution Dynamics

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

−60 −40 −20 0 20 40 60

−40

−20

020

PC1

PC2

7

9

1011 12

18

19

20

21

22

23

2425 31

34

37

38

47

52

54

55 56

6064

68

82

86

93

99

105

107

108

109

116

120

124

125

126

127

136137

151

171

177

227

232

237

246

252

259

264

267

269

271

333

343

346

370

382

388389

392

410

411 443446453

464

468

471

474

475

476478

481

482

483

490

495

503

506 512

514

518

522529

532542544

545

547

554

556

Level 2Level 3Level 4 CG − Node Count − Rate

CG − Node Count − EquilbriumCG − Graph Entropy − Rate

CG − Graph Entropy − EqCG − Specialisation Count − RateCG − Specialisation − Equilibrium

EG − Node Count − SlopeEG − Diameter − Rate

EG − Diameter − EquilibriumEG − Entropy − Rate

EG − Entropy − EquilibriumEG − Clustering Coefficient − Rate

EG − Clustering Coefficient − Equilibrium f7 − After Hoursf227 − Televisionf554 − Wanted Motors

10−1 100 101 102 103

Figure 3: PCA plot of the communities based on their semantic motifs (left) where level 4 forums are clustered

together, and model values for the concept graphs (CG) and entity graphs (EG) for the three outlier forums

from the three levels (right).

In this work, we found that concept and entity graph den-sity in boards.ie does not grow linearly (unlike in social net-works [8]) but instead converges on a limit, which we charac-terised as the carrying capacity (E) of a given community’sconcept and entity graph entropy. We also discovered thatthe diameter of the entity graph in our online communityconverged on a limit over time as the rate of concepts arrivedslowed down, again in contrast to findings from the socialnetworking domain where diameters were found to shrink asmore nodes joined the network [9]. Indeed, this notion ofconvergence to a limit is common across all but one of thegraph measures that we examined and suggests that onlinecommunities have a finite number of topics that can be dis-cussed and that semantics will converge on a maxima overtime.

Our contributions can be summarized as follows: (i) Weused semantic graphs to firstly examine how concepts dis-cussed by communities changed over time at a macro-level,(ii) we used logistic population models to inspect how in-dividual communities evolved over time, and (iii) we de-ployed logistic population models to capture semantic graphchanges along di↵erent measures and applied our results tocommunity analysis and churn rate prediction. Thereby, ourwork forms a basis for combining studies of social and se-mantic network evolution in future work.

8. REFERENCES[1] Lars Backstrom, Dan Huttenlocher, Jon Kleinberg,

and Xiangyang Lan. Group formation in large socialnetworks: membership, growth, and evolution. InProceedings of the 12th ACM SIGKDD internationalconference on Knowledge discovery and data mining,pages 44–54. ACM, 2006.

[2] VIaclav Belak, Marcel Karnstedt, and Conor Hayes.Life-cycles and mutual e↵ects of scientificcommunities. Procedia - Social and BehavioralSciences, 22(0):37 – 48, 2011.

[3] Kon Shing Kenneth Chung, Mahendra Piraveenan,and Shahadat Uddin. Community evolution andengagement through assortative mixing in onlinesocial networks. 2012 IEEE/ACM International

Conference on Advances in Social Networks Analysisand Mining, 0:724–725, 2012.

[4] Cristian Danescu-Niculescu-Mizil, Robert West, DanJurafsky, Jure Leskovec, and Christopher Potts. Nocountry for old members: User lifecycle and linguisticchange in online communities. In Proceedings of theWorld Wide Web Conference, 2013.

[5] Leon Derczynski, Diana Maynard, Niraj Aswani, andKalina Bontcheva. Microblog-genre noise and impacton semantic annotation accuracy. In Proceedings of the24th ACM Conference on Hypertext and Social Media(HT 2013), 2013.

[6] Neil Zhenqiang Gong, Wenchang Xu, Ling Huang,Prateek Mittal, Emil Stefanov, Vyas Sekar, and DawnSong. Evolution of social-attribute networks:Measurements, modeling, and implications usinggoogle+. CoRR, abs/1209.0835, 2012.

[7] Alicia Iriberri and Gondy Leroy. A life-cycleperspective on online community success. ACMComput. Surv., 41(2):11:1–11:29, February 2009.

[8] Jure Leskovec, Lars Backstrom, Ravi Kumar, andAndrew Tomkins. Microscopic evolution of socialnetworks. In Proceedings of the 14th ACM SIGKDDinternational conference on Knowledge discovery anddata mining, pages 462–470. ACM, 2008.

[9] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos.Graphs over time: densification laws, shrinkingdiameters and possible explanations. In Proceedings ofthe eleventh ACM SIGKDD international conferenceon Knowledge discovery in data mining, KDD ’05,pages 177–187, New York, NY, USA, 2005. ACM.

[10] Alan Mislove, Massimiliano Marcon, Krishna P.Gummadi, Peter Druschel, and Bobby Bhattacharjee.Measurement and analysis of online social networks.In SIGCOMM conference on Internet measurement,IMC ’07, pages 29–42, 2007.

[11] Roberto Navigli and Mirella Lapata. An experimentalstudy of graph connectivity for unsupervised wordsense disambiguation. IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI),32(4):678–692, 2010.

Semantic Evolution Motif:

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

−60 −40 −20 0 20 40 60

−40

−20

020

PC1

PC2

7

9

1011 12

18

19

20

21

22

23

2425 31

34

37

38

47

52

54

55 56

6064

68

82

86

93

99

105

107

108

109

116

120

124

125

126

127

136137

151

171

177

227

232

237

246

252

259

264

267

269

271

333

343

346

370

382

388389

392

410

411 443446453

464

468

471

474

475

476478

481

482

483

490

495

503

506 512

514

518

522529

532542544

545

547

554

556

Level 2Level 3Level 4 CG − Node Count − Rate

CG − Node Count − EquilbriumCG − Graph Entropy − Rate

CG − Graph Entropy − EqCG − Specialisation Count − RateCG − Specialisation − Equilibrium

EG − Node Count − SlopeEG − Diameter − Rate

EG − Diameter − EquilibriumEG − Entropy − Rate

EG − Entropy − EquilibriumEG − Clustering Coefficient − Rate

EG − Clustering Coefficient − Equilibrium f7 − After Hoursf227 − Televisionf554 − Wanted Motors

10−1 100 101 102 103

Figure 3: PCA plot of the communities based on their semantic motifs (left) where level 4 forums are clustered

together, and model values for the concept graphs (CG) and entity graphs (EG) for the three outlier forums

from the three levels (right).

In this work, we found that concept and entity graph den-sity in boards.ie does not grow linearly (unlike in social net-works [8]) but instead converges on a limit, which we charac-terised as the carrying capacity (E) of a given community’sconcept and entity graph entropy. We also discovered thatthe diameter of the entity graph in our online communityconverged on a limit over time as the rate of concepts arrivedslowed down, again in contrast to findings from the socialnetworking domain where diameters were found to shrink asmore nodes joined the network [9]. Indeed, this notion ofconvergence to a limit is common across all but one of thegraph measures that we examined and suggests that onlinecommunities have a finite number of topics that can be dis-cussed and that semantics will converge on a maxima overtime.

Our contributions can be summarized as follows: (i) Weused semantic graphs to firstly examine how concepts dis-cussed by communities changed over time at a macro-level,(ii) we used logistic population models to inspect how in-dividual communities evolved over time, and (iii) we de-ployed logistic population models to capture semantic graphchanges along di↵erent measures and applied our results tocommunity analysis and churn rate prediction. Thereby, ourwork forms a basis for combining studies of social and se-mantic network evolution in future work.

8. REFERENCES[1] Lars Backstrom, Dan Huttenlocher, Jon Kleinberg,

and Xiangyang Lan. Group formation in large socialnetworks: membership, growth, and evolution. InProceedings of the 12th ACM SIGKDD internationalconference on Knowledge discovery and data mining,pages 44–54. ACM, 2006.

[2] VIaclav Belak, Marcel Karnstedt, and Conor Hayes.Life-cycles and mutual e↵ects of scientificcommunities. Procedia - Social and BehavioralSciences, 22(0):37 – 48, 2011.

[3] Kon Shing Kenneth Chung, Mahendra Piraveenan,and Shahadat Uddin. Community evolution andengagement through assortative mixing in onlinesocial networks. 2012 IEEE/ACM International

Conference on Advances in Social Networks Analysisand Mining, 0:724–725, 2012.

[4] Cristian Danescu-Niculescu-Mizil, Robert West, DanJurafsky, Jure Leskovec, and Christopher Potts. Nocountry for old members: User lifecycle and linguisticchange in online communities. In Proceedings of theWorld Wide Web Conference, 2013.

[5] Leon Derczynski, Diana Maynard, Niraj Aswani, andKalina Bontcheva. Microblog-genre noise and impacton semantic annotation accuracy. In Proceedings of the24th ACM Conference on Hypertext and Social Media(HT 2013), 2013.

[6] Neil Zhenqiang Gong, Wenchang Xu, Ling Huang,Prateek Mittal, Emil Stefanov, Vyas Sekar, and DawnSong. Evolution of social-attribute networks:Measurements, modeling, and implications usinggoogle+. CoRR, abs/1209.0835, 2012.

[7] Alicia Iriberri and Gondy Leroy. A life-cycleperspective on online community success. ACMComput. Surv., 41(2):11:1–11:29, February 2009.

[8] Jure Leskovec, Lars Backstrom, Ravi Kumar, andAndrew Tomkins. Microscopic evolution of socialnetworks. In Proceedings of the 14th ACM SIGKDDinternational conference on Knowledge discovery anddata mining, pages 462–470. ACM, 2008.

[9] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos.Graphs over time: densification laws, shrinkingdiameters and possible explanations. In Proceedings ofthe eleventh ACM SIGKDD international conferenceon Knowledge discovery in data mining, KDD ’05,pages 177–187, New York, NY, USA, 2005. ACM.

[10] Alan Mislove, Massimiliano Marcon, Krishna P.Gummadi, Peter Druschel, and Bobby Bhattacharjee.Measurement and analysis of online social networks.In SIGCOMM conference on Internet measurement,IMC ’07, pages 29–42, 2007.

[11] Roberto Navigli and Mirella Lapata. An experimentalstudy of graph connectivity for unsupervised wordsense disambiguation. IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI),32(4):678–692, 2010.

0 20 40 60 80 100 120

050

010

0015

00

Timestep

|V|

Linear Model (β=9.932)

(a) Node Count

0 20 40 60 80 100 1201

23

45

67

Timestep

Diameter

(b) Diameter

0 20 40 60 80 100 120

12

34

Timestep

H(G)

(c) Graph Entropy

0 20 40 60 80 100 120

0.02

0.06

0.10

0.14

Timestep

CC

(d) Clustering Coe�cient

Figure 2: Entity graphs’ evolution based on node counts, diameter, graph entropy, and clustering coe�cient.

and entity graphs defined within a single vector for a givencommunity (f): mf = {m1,m2, . . . ,m13}. We began by de-riving a single matrixM 2 R93⇥13 containing the 93 commu-nities (rows) under analysis together with their 13 evolutionmeasures (columns), and performing principal componentanalysis over this matrix. The result of this clustering isshown in Fig. 3 where we have colour-coded the di↵erentcommunity forums by their hierarchical level in the plat-form (level 2 = most general, level 4 = most specific). Wefound that several of the more general forums appeared asoutliers in the plot and thus exhibited unique evolution dy-namics, while the two level 4 forums (forum 556 and 554)were bunched together suggesting that they follow similartrends. We further examined the semantic motifs of threeoutlier communities from di↵ering levels:

• Level 2: Forum 7 - After Hours. A random discussionforum.

• Level 3: Forum 227 - Television. Discussions abouttelevision.

• Level 4: Forum 554 - Wanted Motors. Discussionsabout cars and car parts.

Fig. 3 presents clear di↵erences between the forums: wenote that for the concept graph dynamics (CG) the rate ofthe node count is lowest for the After Hours forum but thatthe node count equilibrium is highest, indicating that themore general the forum the slower is the growth of the con-cept graph, but the higher the maxima of the graph size.Likewise for the specialisation count in the concept graphs:the After Hours forum exhibits a slower rate of growth butwith a greater carrying capacity that the graph is tending to-wards. In terms of the entity graph: the slope of node countgrowth described by the linear model is highest for AfterHours, indicating that compared to the other two forums,the rate at which new entities are cited by the community ofusers is much greater, while for the more topically-specificforum of the Wanted Motors forum this is a lot lower. Theentity graph equilibrium is also highest for After Hours andlowest for Wanted Motors, indicating that the more generala forum is the greater the carrying capacity of its entitygraph and the greater the number of entities that will bediscussed.

6.2 Churn Rate PredictionTo examine the link between the semantic evolution of

online communities and their social properties, we defineda prediction task in which we used the semantic evolution

dynamics of a given community at time step t to predictthe churn rate of community members at time t + 1. Wedefined the churn rate of a community as the proportion ofactive users during a given time period (i.e. week segment)that post for the last time. We used the semantic dynamicsmotifs from the prior experiment (as listed within Fig. 3)and also included graph measures at a given time period:i.e. graph entropy at time t, specialisation count at time t,etc. We derived these features for every time step for eachcommunity and derived the response variable as the churnrate at the following time step. We then compiled a train-ing dataset (up to week 120) and a test dataset (from week120). Each dataset had the following form: D = {(xi, yi)},where xi contained a 21-element time-delimited feature vec-tor for a given community and yi was the churn rate of thecommunity at the following time step. We trained a ridgeregression model ( ) using Dtrain and applied it to Dtest,testing the performance of: a) just concept graph features,b) just entity graph features, and c) all features. An autore-gressive model was used as the baseline - using the churnrate at time t as a single predictor variable for the churnrate at time t + 1. Performance was evaluated using theRoot Mean Square Error (RMSE).

Table 1: Root Mean Square Error when predicting

churn rates using: an Autoregressive model (R2 =0.341) and a Ridge Regression model using Concept

Graph, Entity Graph, and all features.

Baseline Concept Graph Entity Graph All Features

7.310 ⇥10

�35.315 ⇥10

�35.301 ⇥10

�34.941 ⇥10

�3

Table 1 presents the results from our prediction our exper-iment. We found that for all tested models (concept graph,entity graph, all features) we significantly outperformed thebaseline - tested using the sign test (↵ = 0.001). Entitygraph features outperform concept graphs but not signifi-cantly, while our best model is the use of all features togetherin a single model. These results empirically demonstrate theutility of semantic evolution dynamics in predicting commu-nity churn rates, and suggests a link between how the com-munities develop semantically and the likelihood of usersleaving the communities.

7. DISCUSSION & CONCLUSIONS

0 20 40 60 80 100 120

050

010

0015

00

Timestep

|V|

Linear Model (β=9.932)

(a) Node Count

0 20 40 60 80 100 120

12

34

56

7

TimestepDiameter

(b) Diameter

0 20 40 60 80 100 120

12

34

Timestep

H(G)

(c) Graph Entropy

0 20 40 60 80 100 120

0.02

0.06

0.10

0.14

Timestep

CC

(d) Clustering Coe�cient

Figure 2: Entity graphs’ evolution based on node counts, diameter, graph entropy, and clustering coe�cient.

and entity graphs defined within a single vector for a givencommunity (f): mf = {m1,m2, . . . ,m13}. We began by de-riving a single matrixM 2 R93⇥13 containing the 93 commu-nities (rows) under analysis together with their 13 evolutionmeasures (columns), and performing principal componentanalysis over this matrix. The result of this clustering isshown in Fig. 3 where we have colour-coded the di↵erentcommunity forums by their hierarchical level in the plat-form (level 2 = most general, level 4 = most specific). Wefound that several of the more general forums appeared asoutliers in the plot and thus exhibited unique evolution dy-namics, while the two level 4 forums (forum 556 and 554)were bunched together suggesting that they follow similartrends. We further examined the semantic motifs of threeoutlier communities from di↵ering levels:

• Level 2: Forum 7 - After Hours. A random discussionforum.

• Level 3: Forum 227 - Television. Discussions abouttelevision.

• Level 4: Forum 554 - Wanted Motors. Discussionsabout cars and car parts.

Fig. 3 presents clear di↵erences between the forums: wenote that for the concept graph dynamics (CG) the rate ofthe node count is lowest for the After Hours forum but thatthe node count equilibrium is highest, indicating that themore general the forum the slower is the growth of the con-cept graph, but the higher the maxima of the graph size.Likewise for the specialisation count in the concept graphs:the After Hours forum exhibits a slower rate of growth butwith a greater carrying capacity that the graph is tending to-wards. In terms of the entity graph: the slope of node countgrowth described by the linear model is highest for AfterHours, indicating that compared to the other two forums,the rate at which new entities are cited by the community ofusers is much greater, while for the more topically-specificforum of the Wanted Motors forum this is a lot lower. Theentity graph equilibrium is also highest for After Hours andlowest for Wanted Motors, indicating that the more generala forum is the greater the carrying capacity of its entitygraph and the greater the number of entities that will bediscussed.

6.2 Churn Rate PredictionTo examine the link between the semantic evolution of

online communities and their social properties, we defineda prediction task in which we used the semantic evolution

dynamics of a given community at time step t to predictthe churn rate of community members at time t + 1. Wedefined the churn rate of a community as the proportion ofactive users during a given time period (i.e. week segment)that post for the last time. We used the semantic dynamicsmotifs from the prior experiment (as listed within Fig. 3)and also included graph measures at a given time period:i.e. graph entropy at time t, specialisation count at time t,etc. We derived these features for every time step for eachcommunity and derived the response variable as the churnrate at the following time step. We then compiled a train-ing dataset (up to week 120) and a test dataset (from week120). Each dataset had the following form: D = {(xi, yi)},where xi contained a 21-element time-delimited feature vec-tor for a given community and yi was the churn rate of thecommunity at the following time step. We trained a ridgeregression model ( ) using Dtrain and applied it to Dtest,testing the performance of: a) just concept graph features,b) just entity graph features, and c) all features. An autore-gressive model was used as the baseline - using the churnrate at time t as a single predictor variable for the churnrate at time t + 1. Performance was evaluated using theRoot Mean Square Error (RMSE).

Table 1: Root Mean Square Error when predicting

churn rates using: an Autoregressive model (R2 =0.341) and a Ridge Regression model using Concept

Graph, Entity Graph, and all features.

Baseline Concept Graph Entity Graph All Features

7.310 ⇥10

�35.315 ⇥10

�35.301 ⇥10

�34.941 ⇥10

�3

Table 1 presents the results from our prediction our exper-iment. We found that for all tested models (concept graph,entity graph, all features) we significantly outperformed thebaseline - tested using the sign test (↵ = 0.001). Entitygraph features outperform concept graphs but not signifi-cantly, while our best model is the use of all features togetherin a single model. These results empirically demonstrate theutility of semantic evolution dynamics in predicting commu-nity churn rates, and suggests a link between how the com-munities develop semantically and the likelihood of usersleaving the communities.

7. DISCUSSION & CONCLUSIONS

Page 16: The Semantic Evolution of Online Communities

Application: Community Evolution Analysis

The Semantic Evolution of Online Communities

15

Apply Mined

Semantic Evolution Dynamics

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

−60 −40 −20 0 20 40 60

−40

−20

020

PC1

PC2

7

9

1011 12

18

19

20

21

22

23

2425 31

34

37

38

47

52

54

55 56

6064

68

82

86

93

99

105

107

108

109

116

120

124

125

126

127

136137

151

171

177

227

232

237

246

252

259

264

267

269

271

333

343

346

370

382

388389

392

410

411 443446453

464

468

471

474

475

476478

481

482

483

490

495

503

506 512

514

518

522529

532542544

545

547

554

556

Level 2Level 3Level 4 CG − Node Count − Rate

CG − Node Count − EquilbriumCG − Graph Entropy − Rate

CG − Graph Entropy − EqCG − Specialisation Count − RateCG − Specialisation − Equilibrium

EG − Node Count − SlopeEG − Diameter − Rate

EG − Diameter − EquilibriumEG − Entropy − Rate

EG − Entropy − EquilibriumEG − Clustering Coefficient − Rate

EG − Clustering Coefficient − Equilibrium f7 − After Hoursf227 − Televisionf554 − Wanted Motors

10−1 100 101 102 103

Figure 3: PCA plot of the communities based on their semantic motifs (left) where level 4 forums are clustered

together, and model values for the concept graphs (CG) and entity graphs (EG) for the three outlier forums

from the three levels (right).

7. DISCUSSION & CONCLUSIONSIn this work, we found that concept and entity graph den-

sity in boards.ie does not grow linearly (unlike in social net-works [7]) but instead converges on a limit, which we charac-terised as the carrying capacity (E) of a given community’sconcept and entity graph entropy. We also discovered thatthe diameter of the entity graph in our online communityconverged on a limit over time as the rate of concepts arrivedslowed down, again in contrast to findings from the socialnetworking domain where diameters were found to shrink asmore nodes joined the network [8]. Indeed, this notion ofconvergence to a limit is common across all but one of thegraph measures that we examined and suggests that onlinecommunities have a finite number of topics that can be dis-cussed and that semantics will converge on a maxima overtime.

Our contributions can be summarized as follows: (i) Weused semantic graphs to firstly examine how concepts dis-cussed by communities changed over time at a macro-level,(ii) we used logistic population models to inspect how in-dividual communities evolved over time, and (iii) we de-ployed logistic population models to capture semantic graphchanges along di↵erent measures and applied our results tocommunity analysis and churn rate prediction. Thereby, ourwork forms a basis for combining studies of social and se-mantic network evolution in future work.

8. REFERENCES[1] Lars Backstrom, Dan Huttenlocher, Jon Kleinberg,

and Xiangyang Lan. Group formation in large socialnetworks: membership, growth, and evolution. InProceedings of the 12th ACM SIGKDD internationalconference on Knowledge discovery and data mining,pages 44–54. ACM, 2006.

[2] VIaclav Belak, Marcel Karnstedt, and Conor Hayes.Life-cycles and mutual e↵ects of scientificcommunities. Procedia - Social and BehavioralSciences, 22(0):37 – 48, 2011.

[3] Cristian Danescu-Niculescu-Mizil, Robert West, DanJurafsky, Jure Leskovec, and Christopher Potts. Nocountry for old members: User lifecycle and linguistic

change in online communities. In Proceedings of theWorld Wide Web Conference, 2013.

[4] Leon Derczynski, Diana Maynard, Niraj Aswani, andKalina Bontcheva. Microblog-genre noise and impacton semantic annotation accuracy. In Proceedings of the24th ACM Conference on Hypertext and Social Media(HT 2013), 2013.

[5] Neil Zhenqiang Gong, Wenchang Xu, Ling Huang,Prateek Mittal, Emil Stefanov, Vyas Sekar, and DawnSong. Evolution of social-attribute networks:Measurements, modeling, and implications usinggoogle+. CoRR, abs/1209.0835, 2012.

[6] Alicia Iriberri and Gondy Leroy. A life-cycleperspective on online community success. ACMComput. Surv., 41(2):11:1–11:29, February 2009.

[7] Jure Leskovec, Lars Backstrom, Ravi Kumar, andAndrew Tomkins. Microscopic evolution of socialnetworks. In Proceedings of the 14th ACM SIGKDDinternational conference on Knowledge discovery anddata mining, pages 462–470. ACM, 2008.

[8] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos.Graphs over time: densification laws, shrinkingdiameters and possible explanations. In Proceedings ofthe eleventh ACM SIGKDD international conferenceon Knowledge discovery in data mining, KDD ’05,pages 177–187, New York, NY, USA, 2005. ACM.

[9] Alan Mislove, Massimiliano Marcon, Krishna P.Gummadi, Peter Druschel, and Bobby Bhattacharjee.Measurement and analysis of online social networks.In SIGCOMM conference on Internet measurement,IMC ’07, pages 29–42, 2007.

[10] Roberto Navigli and Mirella Lapata. An experimentalstudy of graph connectivity for unsupervised wordsense disambiguation. IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI),32(4):678–692, 2010.

Entity Graph: fastest growth in the random forum Concept graph: slowest growth in random forum, largest equilibrium

Page 17: The Semantic Evolution of Online Communities

Application: Churn Rate Prediction

The Semantic Evolution of Online Communities

16

¨  Task: forecast community churn rate at t+1 ¤ Based on semantic evolution up to t

¨  Trained a ridge regression model ¨  Root Mean Square Error: Predicted vs. Actual churn

rate

Apply Mined

Semantic Evolution Dynamics

Training Test

0 120 150

Training

Test

0 20 40 60 80 100 120

050

010

0015

00

Timestep

|V|

Linear Model (β=9.932)

(a) Node Count

0 20 40 60 80 100 120

12

34

56

7

Timestep

Diameter

(b) Diameter

0 20 40 60 80 100 120

12

34

Timestep

H(G)

(c) Graph Entropy

0 20 40 60 80 100 120

0.02

0.06

0.10

0.14

Timestep

CC

(d) Clustering Coe�cient

Figure 2: Entity graphs’ evolution based on node counts, diameter, graph entropy, and clustering coe�cient.

and entity graphs defined within a single vector for a givencommunity (f): mf = {m1,m2, . . . ,m13}. We began by de-riving a single matrixM 2 R93⇥13 containing the 93 commu-nities (rows) under analysis together with their 13 evolutionmeasures (columns), and performing principal componentanalysis over this matrix. The result of this clustering isshown in Fig. 3 where we have colour-coded the di↵erentcommunity forums by their hierarchical level in the plat-form (level 2 = most general, level 4 = most specific). Wefound that several of the more general forums appeared asoutliers in the plot and thus exhibited unique evolution dy-namics, while the two level 4 forums (forum 556 and 554)were bunched together suggesting that they follow similartrends. We further examined the semantic motifs of threeoutlier communities from di↵ering levels:

• Level 2: Forum 7 - After Hours. A random discussionforum.

• Level 3: Forum 227 - Television. Discussions abouttelevision.

• Level 4: Forum 554 - Wanted Motors. Discussionsabout cars and car parts.

Fig. 3 presents clear di↵erences between the forums: wenote that for the concept graph dynamics (CG) the rate ofthe node count is lowest for the After Hours forum but thatthe node count equilibrium is highest, indicating that themore general the forum the slower is the growth of the con-cept graph, but the higher the maxima of the graph size.Likewise for the specialisation count in the concept graphs:the After Hours forum exhibits a slower rate of growth butwith a greater carrying capacity that the graph is tending to-wards. In terms of the entity graph: the slope of node countgrowth described by the linear model is highest for AfterHours, indicating that compared to the other two forums,the rate at which new entities are cited by the community ofusers is much greater, while for the more topically-specificforum of the Wanted Motors forum this is a lot lower. Theentity graph equilibrium is also highest for After Hours andlowest for Wanted Motors, indicating that the more generala forum is the greater the carrying capacity of its entitygraph and the greater the number of entities that will bediscussed.

6.2 Churn Rate PredictionTo examine the link between the semantic evolution of

online communities and their social properties, we defineda prediction task in which we used the semantic evolution

dynamics of a given community at time step t to predictthe churn rate of community members at time t + 1. Wedefined the churn rate of a community as the proportion ofactive users during a given time period (i.e. week segment)that post for the last time. We used the semantic dynamicsmotifs from the prior experiment (as listed within Fig. 3)and also included graph measures at a given time period:i.e. graph entropy at time t, specialisation count at time t,etc. We derived these features for every time step for eachcommunity and derived the response variable as the churnrate at the following time step. We then compiled a train-ing dataset (up to week 120) and a test dataset (from week120). Each dataset had the following form: D = {(xi, yi)},where xi contained a 21-element time-delimited feature vec-tor for a given community and yi was the churn rate of thecommunity at the following time step. We trained a ridgeregression model ( ) using Dtrain and applied it to Dtest,testing the performance of: a) just concept graph features,b) just entity graph features, and c) all features. An autore-gressive model was used as the baseline - using the churnrate at time t as a single predictor variable for the churnrate at time t + 1. Performance was evaluated using theRoot Mean Square Error (RMSE).

Table 1: Root Mean Square Error when predicting

churn rates using: an Autoregressive model (R2 =0.341) and a Ridge Regression model using Concept

Graph, Entity Graph, and all features.

Baseline Concept Graph Entity Graph All Features

7.310 ⇥10

�35.315 ⇥10

�35.301 ⇥10

�34.941 ⇥10

�3

Table 1 presents the results from our prediction our exper-iment. We found that for all tested models (concept graph,entity graph, all features) we significantly outperformed thebaseline - tested using the sign test (↵ = 0.001). Entitygraph features outperform concept graphs but not signifi-cantly, while our best model is the use of all features togetherin a single model. These results empirically demonstrate theutility of semantic evolution dynamics in predicting commu-nity churn rates, and suggests a link between how the com-munities develop semantically and the likelihood of usersleaving the communities.

7. DISCUSSION & CONCLUSIONS

Page 18: The Semantic Evolution of Online Communities

Application: Churn Rate Prediction

The Semantic Evolution of Online Communities

17

0 20 40 60 80 100 120

050

010

0015

00

Timestep

|V|

Linear Model (β=9.932)

(a) Node Count

0 20 40 60 80 100 120

12

34

56

7

Timestep

Diameter

(b) Diameter

0 20 40 60 80 100 120

12

34

Timestep

H(G)

(c) Graph Entropy

0 20 40 60 80 100 120

0.02

0.06

0.10

0.14

Timestep

CC

(d) Clustering Coe�cient

Figure 2: Entity graphs’ evolution based on node counts, diameter, graph entropy, and clustering coe�cient.

and entity graphs defined within a single vector for a givencommunity (f): mf = {m1,m2, . . . ,m13}. We began by de-riving a single matrixM 2 R93⇥13 containing the 93 commu-nities (rows) under analysis together with their 13 evolutionmeasures (columns), and performing principal componentanalysis over this matrix. The result of this clustering isshown in Fig. 3 where we have colour-coded the di↵erentcommunity forums by their hierarchical level in the plat-form (level 2 = most general, level 4 = most specific). Wefound that several of the more general forums appeared asoutliers in the plot and thus exhibited unique evolution dy-namics, while the two level 4 forums (forum 556 and 554)were bunched together suggesting that they follow similartrends. We further examined the semantic motifs of threeoutlier communities from di↵ering levels:

• Level 2: Forum 7 - After Hours. A random discussionforum.

• Level 3: Forum 227 - Television. Discussions abouttelevision.

• Level 4: Forum 554 - Wanted Motors. Discussionsabout cars and car parts.

Fig. 3 presents clear di↵erences between the forums: wenote that for the concept graph dynamics (CG) the rate ofthe node count is lowest for the After Hours forum but thatthe node count equilibrium is highest, indicating that themore general the forum the slower is the growth of the con-cept graph, but the higher the maxima of the graph size.Likewise for the specialisation count in the concept graphs:the After Hours forum exhibits a slower rate of growth butwith a greater carrying capacity that the graph is tending to-wards. In terms of the entity graph: the slope of node countgrowth described by the linear model is highest for AfterHours, indicating that compared to the other two forums,the rate at which new entities are cited by the community ofusers is much greater, while for the more topically-specificforum of the Wanted Motors forum this is a lot lower. Theentity graph equilibrium is also highest for After Hours andlowest for Wanted Motors, indicating that the more generala forum is the greater the carrying capacity of its entitygraph and the greater the number of entities that will bediscussed.

6.2 Churn Rate PredictionTo examine the link between the semantic evolution of

online communities and their social properties, we defineda prediction task in which we used the semantic evolution

dynamics of a given community at time step t to predictthe churn rate of community members at time t + 1. Wedefined the churn rate of a community as the proportion ofactive users during a given time period (i.e. week segment)that post for the last time. We used the semantic dynamicsmotifs from the prior experiment (as listed within Fig. 3)and also included graph measures at a given time period:i.e. graph entropy at time t, specialisation count at time t,etc. We derived these features for every time step for eachcommunity and derived the response variable as the churnrate at the following time step. We then compiled a train-ing dataset (up to week 120) and a test dataset (from week120). Each dataset had the following form: D = {(xi, yi)},where xi contained a 21-element time-delimited feature vec-tor for a given community and yi was the churn rate of thecommunity at the following time step. We trained a ridgeregression model ( ) using Dtrain and applied it to Dtest,testing the performance of: a) just concept graph features,b) just entity graph features, and c) all features. An autore-gressive model was used as the baseline - using the churnrate at time t as a single predictor variable for the churnrate at time t + 1. Performance was evaluated using theRoot Mean Square Error (RMSE).

Table 1: Root Mean Square Error when predicting

churn rates using: an Autoregressive model (R2 =0.341) and a Ridge Regression model using Concept

Graph, Entity Graph, and all features.

Baseline Concept Graph Entity Graph All Features

7.310 ⇥10

�35.315 ⇥10

�35.301 ⇥10

�34.941 ⇥10

�3

Table 1 presents the results from our prediction our exper-iment. We found that for all tested models (concept graph,entity graph, all features) we significantly outperformed thebaseline - tested using the sign test (↵ = 0.001). Entitygraph features outperform concept graphs but not signifi-cantly, while our best model is the use of all features togetherin a single model. These results empirically demonstrate theutility of semantic evolution dynamics in predicting commu-nity churn rates, and suggests a link between how the com-munities develop semantically and the likelihood of usersleaving the communities.

Significant reduction in error (Sign test with α=0.001)

Baseline: Autoregressive model with churn rate @ t as a predictor for t+1

Apply Mined

Semantic Evolution Dynamics

Page 19: The Semantic Evolution of Online Communities

Findings and Conclusions

The Semantic Evolution of Online Communities

18

¨  Semantic graphs of online communities do not grow linearly: instead, they evolve to a limit ¤  Unlike in social networks [Lekovec et al., 2008] ¤  Exception: entity graph size

¨  A finite number of topics are discussed within communities ¤  Variation between communities

¨  Our use of logistic population models has enabled: 1.  Characterisation of community-specific evolution dynamics 2.  Community analysis to inspect how communities evolved

differently 3.  Churn prediction based on semantic evolution

Page 20: The Semantic Evolution of Online Communities

Future Work

The Semantic Evolution of Online Communities

19

1.  Expanded to cover other online communities ¤  Question-answering, mined communities

2.  User profiling ¤  Capturing user-specific semantic evolution

In-edge weight distribution

(ServerFault)

● ●

1 2 3 4 50.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

k= 5

Lifecycle Stage

H

● ●

●●

● ●

● ●

● ●

2 4 6 8 10

0.3

0.4

0.5

0.6

0.7

k= 10

Lifecycle StageH

● ●

● ●

● ●

● ●

●●

●● ●

●●

●●

5 10 15 20

0.05

0.15

0.25

0.35

k= 20

Lifecycle Stage

H

●●

● ● ●

●●

● ●

●●

● ●●

● ●●

Non-churners

Churners

Page 21: The Semantic Evolution of Online Communities

Matthew Rowe @mrowebot [email protected] http://www.lancaster.ac.uk/staff/rowem/ Markus Strohmaier @mstrohm [email protected] http://markusstrohmaier.info/

Questions? 20

The Semantic Evolution of Online Communities