the semantic evolution of online communities
DESCRIPTION
World Wide Web Conference 2014TRANSCRIPT
THE SEMANTIC EVOLUTION OF ONLINE COMMUNITIES MATTHEW ROWE1 AND MARKUS STROHMAIER2
1. LANCASTER UNIVERSITY, LANCASTER, UK @MROWEBOT | [email protected] 2. UNIVERSITY OF KOBLENZ AND GESIS, COLOGNE, GERMANY @MSTROHM | [email protected] World Wide Web Conference 2014
Seoul, South Korea
Studies of Online Community Evolution
The Semantic Evolution of Online Communities
1
¨ Prior work has examined online community development based on: ¤ Social network properties
n (Gong et al., 2012), (Mislove et al., 2007)
¤ Social group formation n (Backstrom et al., 2006)
¤ Lexical term usage and uptake n (Danescu et al., 2013)
Organic fixie Pitchfork, fingerstache fashion axe 8-bit ethical. Neutra shabby chic brunch, mustache vegan twee typewriter dreamcatcher try-hard organic church-key!
subsequent interpretation, of churners in online commu-nities. Churners present a serious issue for communitymanagers and hosts as the leaving of certain users can havea detrimental effect on the community (i.e. experts leavinga question-answering community can cause an increase inunanswered queries).
In this section we define churn prediction as a binaryclassification task and use the previously examined indica-tors of lifecycle trajectories to predict whether a user is achurner or not. As we confine user lifecycle periods fromthe start of their lifecycle to the end we use the trajectoriesmined from this period to characterise how users develop.We define churners as any user who posts for the last timebefore the final 10% of the time window of our datasets,cutoff points are: 2012-07-09 for Facebook, 2010-05-11 forSAP, and 2010-12-23 for ServerFault. Our dataset is of thefollowing form: D = {(xi, yi)}, where yi denotes the classlabel of the user from one of two values: y ! {0, 1},4
while xi denotes an 11-element R-valued feature vector foreither a Facebook or SAP user, and a 10-element featurevector for a Server Fault user - given that we use a linearregression model for each user’s lexical community cross-entropy development. We model the feature vector of eachuser using the trajectory indicators from the previous section,in short Table II defines our set of features where we placeeach within a set depending on the dynamics it captures.
Table IIFEATURES USED FOR THE CHURN PREDICTION EXPERIMENTS. THE
INDICATORS OF LIFECYCLE TRAJECTORIES ARE USED TO
CHARACTERISE USER EVOLUTION ALONG THE DIFFERENT USER
PROPERTIES.
Property Indicator Model Feature(s) PlatformIn-degree Period Entropy Linear Regression ! All
Period Cross-Ent Exponential Decay " All
Comm’ Cross-Ent Quad’ Regress’ a1, a2 AllOut-degree Period Entropy Linear Regression ! All
Period Cross-Ent Exponential Decay " AllComm’ Cross-Ent Linear Regression ! All
Lexical Period Entropy Linear Regression ! AllPeriod Cross-Ent Exponential Decay " All
Comm’ Cross-Ent Quad’ Regress’ a1, a2 Fb, SAPComm’ Cross-Ent Linear Regression ! SF
A. Prediction Model Definition
The observed feature vector of user ui (xi) containsthe indicator trajectories of the user along the differentproperties. We use the logistic regression model to predictthe conditional probability of user ui churning as follows:
Pr(Y = 1 | xi) =1
1 + e!!!xi
(9)
The model’s coefficients (!) define the weight attachedto each identity trajectory feature within the linear model(f(i) = !!xi). In order to derive the model’s coefficients
41 indicating churner, 0 not.
we use maximum likelihood fitting through the R statisticalsoftware package5 to select the maximum likelihood esti-mation ! of the model’s coefficients. Following fitting, thederived model is used to predict the churn probability ofeach user within the test dataset.
B. Experimental Setup
For our experiments we first standardised the datasetsby combining the test (20%) and training (80%) datasetstogether and setting each indicator feature to have 0 meanand a standard deviation of 1, we then divided the datasetagain into the respective test and training splits maintainingthe same instances as before. We wanted to test the effects ofobserving different user properties and development dynam-ics on churn prediction. We therefore tested each user prop-erty in isolation, for instance using the in-degree propertyand the entropy, period cross-entropy and community cross-entropy trajectory indicators; and then each developmentmodel in isolation, for instance using the entropy modeland examining in-degree, out-degree, and term distributions;finally we combined all features together within a singlemodel. In doing so we could isolate any effects of keyfeatures on prediction performance, and thus inform modelselection for specific platforms (i.e. identifying the bestperforming model for Facebook, SAP and Server Fault).
As we used the logistic regression model for our pre-diction model we are provided with a function whose co-domain is a churn probability value for a given user withinthe closed interval [0, 1]. Therefore we evaluated the per-formance of each induced model using two evaluation mea-sures: (i) precision@k (P ), and (ii) area under the receiveroperator characteristic curve (AUC). To derive precision@kwe ranked the users by their churn probability according tothe induced model and then assessed the precision of thetop-k ranks, setting k = {1, 5, 10, 20, 50, 100}, and takingthe mean of these precision values. This assesses the extentto which the upper portion of the predicted churners arecorrect. We used the baseline measure of the probability ofa randomly selected user being a churner, thus correspondingto the probability of success in a single Bernouilli trial(setting p = |churners|/|Dtest|). To derive the area underthe receiver operator characteristic curve we varied theconfidence of an indicator function (f(x)) through discretesettings of confidence bounds ! = {0, 0.05, . . . , 0.95, 0.1},thereby setting the class label for given instance (x) asfollows:
f(x) =
!
1, if Pr(Y = 1 | xi) > ! (10a)
0, otherwise (10b)
For each different setting of ! we measured the truepositive rate (TPR/recall) and the false positive rate (FPR),and from these measures plotted the receiver operator char-acteristic (ROC) curve. A model which maximises the area
5http://www.r-project.org/
= {w1,w2,…,wn}
The Semantic Evolution of Online Communities 2
Work has yet to examine the semantic evolution of online communities
1. Understand how semantic concepts emerge
2. Model the development of semantic structures over time
3. Examine how communities differ from one another in their evolution
Assessing Semantic Evolution
The Semantic Evolution of Online Communities
3
Define: Community Semantics!Entities, and their classes, discussed within a community over a given time-period, and the structure connecting those concepts!
Time
…
Assessing Semantic Evolution
The Semantic Evolution of Online Communities
4
¨ Assessing community semantics enables: 1. Characterisation of semantic evolution dynamics 2. Comparison based on community evolution 3. Forecasting churn rates from evolution signals
Our Contributions
Define: Community Semantics!Entities, and their classes, discussed within a community over a given time-period, and the structure connecting those concepts!
The Semantic Evolution of Online Communities
Approach: Examining Semantic Evolution
5
Retrieve Posts and Extract Entities
Construct Time-
delimited Semantic Graphs
Inspect Macro
Evolution
Induce Community-
Specific Evolution Models
Apply Mined
Semantic Evolution Dynamics
We have two flavours: (i) concept graphs (ii) entity graphs
Applying graph measures over cumulative time-sensitive graphs
Inducing logistic population models for each community and graph measure
Using model dynamics as community motifs to: (i) inspect communities (ii) predict churn rate
Experiment Dataset: Boards.ie All posts during 2005-2008 93 Communities: online forums
Extracting Entities from Online Community Posts
The Semantic Evolution of Online Communities
6
¨ We are provided with a set of post quadruples:
¨ We constrain content to within a given interval [t’,t’’)
¨ Extract each post’s entities within the time interval using TextRazor ¤ No quota limit, and good prior performance (Derczynski et al., 2013)
¨ Result: time-sensitive community-discussed entities (DBPedia URIs)
Retrieve Posts and Extract Entities
clustering of users) and found high degrees of local cluster-ing on the di↵erent platforms which contained densely pop-ulated subgroups of similar users. Recent work by Gong etal. [5] inspected the evolution of social networks on Google+as the platform was growing in memberships, in particu-lar they focused on social-attribute networks (i.e. bipar-tite graphs containing people and their attributes as nodes),finding that the platform exhibited unique growth and char-acteristics of the networks as more people joined Google+.Leskovec et al. [7] modelled the development of social net-works across four platforms (Flickr, Delicious, Yahoo! An-swers and LinkedIn) by modelling the process of node ar-rival (users joining), edges being created and waiting timesbetween edge creation. Previous work mostly ignored termbased and semantic information and concentrated on howthe social networks evolved, not the communities. Althoughsuch works a↵ord insights into the evolution of social net-works, they do not consider how a community of users evolvesemantically.
3. CHARACTERISING ONLINE COMMU-NITIES WITH SEMANTIC GRAPHS
For our experiments we used data from the Irish commu-nity message board Boards.ie.2 This is a general-discussioncommunity message board that includes a set of hierarchi-cally nested forums (F ) in which posts are made - i.e. fo-rum A can be a parent of forum B, and thus B containsspecialised topics of discussion over A. Posts are providedas a set of quadruples <u, s, t, f> 2 P , where user u postedmessage s at time t in forum f . A message (s) is com-posed of terms that we use to build the semantic modelsfor individual communities. The information discussed ina community, and thus its semantics, can change and alterover time, therefore we constrain a community’s model tospecific time snapshots - e.g. t0 ! t00 where t0 < t00 - forthis we use the following construct that filters through allrelevant posts’ contents within the allotted time window:
St0t00 = {s : <u, s, t, f> 2 P, t0 t < t00} (1)
Information discussed within online communities can berepresented in terms of its semantics, using information fromeither the schema-level (i.e. ontological classes and relationsbetween them) or the data level (i.e. using entities and howthey are related to one another). For the former we considerconcepts to be classes found within the DBPedia Ontology,that is: the types of entities that users are discussing (e.g.people, locations, etc.), while for the latter case we con-sider DBPedia resources: i.e. entities themselves (e.g. dbpe-
dia:Barrack_Obama). Given our set of post contents, St0t00 ,we derive concepts and entities from a forum over a time pe-riod as follows: we process each post content s 2 St0t00 usingan entity extraction tool (s) to return the set of entitiesrelated to the content of s. Given the entities (RE) returnedfor a given community forum over an allotted time periodwe then construct two types of semantic resource graphs:concept graphs, which function at the schema-level and con-tain class information; and entity graphs, which function atthe data level and contain information that relates entitiesto one another.
2http://www.boards.ie
3.1 Concept GraphsA concept graph (GC) is a type of semantic resource graph
that contains the types of entities found within a given fo-rum as vertices (V ) and the relations between these classesas edges (E). For a given community forum f we have a setof entities RE that were extracted over some period timet0 ! t00. Each entity, given that we are using DBPediaresource URIs, is typed according to one or more classesfrom the DBPedia ontology.3 Therefore to construct the setof concepts that are cited within a given forum f over theallotted time period we retrieve the classes that each en-tity is a type of and store these in the following set: RC .From this set we generate a time-dependent concept graph:GC [f, t
0, t00] = hVf , Ef i, such that GC [f, t0, t00] ⇢ Gtype -
where Gtype denotes the DBPedia type graph formed fromthe class structure of the DBPedia ontology. In this con-text the set of concepts denotes the seed set and is used topopulate the vertices in the graph and then construct edgesbetween the vertices based on existing links between theconcepts in the DBPedia type graph (Gtype):
Ef = {(ci, cj) : ci, cj 2 Vf , (ci, cj) 2 Etype} (2)
In order to derive the set of vertices we must consider howthe seed set can be used for this process as it is often the casethat the set is comprised of concepts which are not directlyconnected to one another in the concept graph. To connectsuch concepts, and derive a fully connected concept graph(i.e. with no disconnected components) we extract the RootPath Graph as follows: From each concept (c 2 RC) weidentify the parent concept (<c rdfs:subClassOf p>) anditeratively move up the concept graph until the root node isreached (owl:Thing), thereby returning a set of nodes thatformed the path from c to the root node: rootpath(c) ={c, p, ...,owl:Thing}. The graph’s vertices are therefore de-rived by taking the union of all paths to the root returnedfrom each concept within the seed set:
Vf =[
c2RC
rootpath(c) (3)
3.2 Entity GraphsAn entity graph (GE) is a type of semantic resource graph
where the vertices (V ) are entities and the set of edges (E)connecting these entities are relations between them derivedfrom the Web of Linked Data. We define the entity graphas GE [f, t
0, t00] = hVf , Ef i, such that GE [f, t0, t00] ⇢ Gentity
- where Gentity denotes the DBPedia entity graph contain-ing relations between entities at the data level. As we areprovided with a collection of time-delimited entities RE fora given forum, we query DBPedia for links between entitypairs and add such edges to the graph where such a linkexists:
Ef = {(ri, rj) : ri, rj 2 Vf , (ri, rj) 2 Eentity} (4)
Given this edge construction mechanism we only look forrelations one-hop away in the entity graph, that is: givenRE we only look for relations between elements in the set.This could be extended to include 2-hop relations, howeverwe are interested in how entities in the communities areconnected to one another directly. In this work, the verticesin the entity graph are thus those entities which are foundto be connected to one another directly via 1-hop distances.3http://dbpedia.org/Ontology
clustering of users) and found high degrees of local cluster-ing on the di↵erent platforms which contained densely pop-ulated subgroups of similar users. Recent work by Gong etal. [5] inspected the evolution of social networks on Google+as the platform was growing in memberships, in particu-lar they focused on social-attribute networks (i.e. bipar-tite graphs containing people and their attributes as nodes),finding that the platform exhibited unique growth and char-acteristics of the networks as more people joined Google+.Leskovec et al. [7] modelled the development of social net-works across four platforms (Flickr, Delicious, Yahoo! An-swers and LinkedIn) by modelling the process of node ar-rival (users joining), edges being created and waiting timesbetween edge creation. Previous work mostly ignored termbased and semantic information and concentrated on howthe social networks evolved, not the communities. Althoughsuch works a↵ord insights into the evolution of social net-works, they do not consider how a community of users evolvesemantically.
3. CHARACTERISING ONLINE COMMU-NITIES WITH SEMANTIC GRAPHS
For our experiments we used data from the Irish commu-nity message board Boards.ie.2 This is a general-discussioncommunity message board that includes a set of hierarchi-cally nested forums (F ) in which posts are made - i.e. fo-rum A can be a parent of forum B, and thus B containsspecialised topics of discussion over A. Posts are providedas a set of quadruples <u, s, t, f> 2 P , where user u postedmessage s at time t in forum f . A message (s) is com-posed of terms that we use to build the semantic modelsfor individual communities. The information discussed ina community, and thus its semantics, can change and alterover time, therefore we constrain a community’s model tospecific time snapshots - e.g. t0 ! t00 where t0 < t00 - forthis we use the following construct that filters through allrelevant posts’ contents within the allotted time window:
St0t00 = {s : <u, s, t, f> 2 P, t0 t < t00} (1)
Information discussed within online communities can berepresented in terms of its semantics, using information fromeither the schema-level (i.e. ontological classes and relationsbetween them) or the data level (i.e. using entities and howthey are related to one another). For the former we considerconcepts to be classes found within the DBPedia Ontology,that is: the types of entities that users are discussing (e.g.people, locations, etc.), while for the latter case we con-sider DBPedia resources: i.e. entities themselves (e.g. dbpe-
dia:Barrack_Obama). Given our set of post contents, St0t00 ,we derive concepts and entities from a forum over a time pe-riod as follows: we process each post content s 2 St0t00 usingan entity extraction tool (s) to return the set of entitiesrelated to the content of s. Given the entities (RE) returnedfor a given community forum over an allotted time periodwe then construct two types of semantic resource graphs:concept graphs, which function at the schema-level and con-tain class information; and entity graphs, which function atthe data level and contain information that relates entitiesto one another.
2http://www.boards.ie
3.1 Concept GraphsA concept graph (GC) is a type of semantic resource graph
that contains the types of entities found within a given fo-rum as vertices (V ) and the relations between these classesas edges (E). For a given community forum f we have a setof entities RE that were extracted over some period timet0 ! t00. Each entity, given that we are using DBPediaresource URIs, is typed according to one or more classesfrom the DBPedia ontology.3 Therefore to construct the setof concepts that are cited within a given forum f over theallotted time period we retrieve the classes that each en-tity is a type of and store these in the following set: RC .From this set we generate a time-dependent concept graph:GC [f, t
0, t00] = hVf , Ef i, such that GC [f, t0, t00] ⇢ Gtype -
where Gtype denotes the DBPedia type graph formed fromthe class structure of the DBPedia ontology. In this con-text the set of concepts denotes the seed set and is used topopulate the vertices in the graph and then construct edgesbetween the vertices based on existing links between theconcepts in the DBPedia type graph (Gtype):
Ef = {(ci, cj) : ci, cj 2 Vf , (ci, cj) 2 Etype} (2)
In order to derive the set of vertices we must consider howthe seed set can be used for this process as it is often the casethat the set is comprised of concepts which are not directlyconnected to one another in the concept graph. To connectsuch concepts, and derive a fully connected concept graph(i.e. with no disconnected components) we extract the RootPath Graph as follows: From each concept (c 2 RC) weidentify the parent concept (<c rdfs:subClassOf p>) anditeratively move up the concept graph until the root node isreached (owl:Thing), thereby returning a set of nodes thatformed the path from c to the root node: rootpath(c) ={c, p, ...,owl:Thing}. The graph’s vertices are therefore de-rived by taking the union of all paths to the root returnedfrom each concept within the seed set:
Vf =[
c2RC
rootpath(c) (3)
3.2 Entity GraphsAn entity graph (GE) is a type of semantic resource graph
where the vertices (V ) are entities and the set of edges (E)connecting these entities are relations between them derivedfrom the Web of Linked Data. We define the entity graphas GE [f, t
0, t00] = hVf , Ef i, such that GE [f, t0, t00] ⇢ Gentity
- where Gentity denotes the DBPedia entity graph contain-ing relations between entities at the data level. As we areprovided with a collection of time-delimited entities RE fora given forum, we query DBPedia for links between entitypairs and add such edges to the graph where such a linkexists:
Ef = {(ri, rj) : ri, rj 2 Vf , (ri, rj) 2 Eentity} (4)
Given this edge construction mechanism we only look forrelations one-hop away in the entity graph, that is: givenRE we only look for relations between elements in the set.This could be extended to include 2-hop relations, howeverwe are interested in how entities in the communities areconnected to one another directly. In this work, the verticesin the entity graph are thus those entities which are foundto be connected to one another directly via 1-hop distances.3http://dbpedia.org/Ontology
User u posted content s at time t in forum f
Building Semantic Resource Graphs
The Semantic Evolution of Online Communities
7
1. Concept graphs ¤ Nodes: Union of rootpaths from each entity’s class to
owl:Thing class
¤ Edges: Between nodes within the DBPedia Ontology
2. Entity graphs ¤ Nodes: Entities provided by time-specific post extractions ¤ Edges: Relations between entity pairs within the DBPedia linked
data graph
Construct Time-
delimited Semantic Graphs
clustering of users) and found high degrees of local cluster-ing on the di↵erent platforms which contained densely pop-ulated subgroups of similar users. Recent work by Gong etal. [5] inspected the evolution of social networks on Google+as the platform was growing in memberships, in particu-lar they focused on social-attribute networks (i.e. bipar-tite graphs containing people and their attributes as nodes),finding that the platform exhibited unique growth and char-acteristics of the networks as more people joined Google+.Leskovec et al. [7] modelled the development of social net-works across four platforms (Flickr, Delicious, Yahoo! An-swers and LinkedIn) by modelling the process of node ar-rival (users joining), edges being created and waiting timesbetween edge creation. Previous work mostly ignored termbased and semantic information and concentrated on howthe social networks evolved, not the communities. Althoughsuch works a↵ord insights into the evolution of social net-works, they do not consider how a community of users evolvesemantically.
3. CHARACTERISING ONLINE COMMU-NITIES WITH SEMANTIC GRAPHS
For our experiments we used data from the Irish commu-nity message board Boards.ie.2 This is a general-discussioncommunity message board that includes a set of hierarchi-cally nested forums (F ) in which posts are made - i.e. fo-rum A can be a parent of forum B, and thus B containsspecialised topics of discussion over A. Posts are providedas a set of quadruples <u, s, t, f> 2 P , where user u postedmessage s at time t in forum f . A message (s) is com-posed of terms that we use to build the semantic modelsfor individual communities. The information discussed ina community, and thus its semantics, can change and alterover time, therefore we constrain a community’s model tospecific time snapshots - e.g. t0 ! t00 where t0 < t00 - forthis we use the following construct that filters through allrelevant posts’ contents within the allotted time window:
St0t00 = {s : <u, s, t, f> 2 P, t0 t < t00} (1)
Information discussed within online communities can berepresented in terms of its semantics, using information fromeither the schema-level (i.e. ontological classes and relationsbetween them) or the data level (i.e. using entities and howthey are related to one another). For the former we considerconcepts to be classes found within the DBPedia Ontology,that is: the types of entities that users are discussing (e.g.people, locations, etc.), while for the latter case we con-sider DBPedia resources: i.e. entities themselves (e.g. dbpe-
dia:Barrack_Obama). Given our set of post contents, St0t00 ,we derive concepts and entities from a forum over a time pe-riod as follows: we process each post content s 2 St0t00 usingan entity extraction tool (s) to return the set of entitiesrelated to the content of s. Given the entities (RE) returnedfor a given community forum over an allotted time periodwe then construct two types of semantic resource graphs:concept graphs, which function at the schema-level and con-tain class information; and entity graphs, which function atthe data level and contain information that relates entitiesto one another.
2http://www.boards.ie
3.1 Concept GraphsA concept graph (GC) is a type of semantic resource graph
that contains the types of entities found within a given fo-rum as vertices (V ) and the relations between these classesas edges (E). For a given community forum f we have a setof entities RE that were extracted over some period timet0 ! t00. Each entity, given that we are using DBPediaresource URIs, is typed according to one or more classesfrom the DBPedia ontology.3 Therefore to construct the setof concepts that are cited within a given forum f over theallotted time period we retrieve the classes that each en-tity is a type of and store these in the following set: RC .From this set we generate a time-dependent concept graph:GC [f, t
0, t00] = hVf , Ef i, such that GC [f, t0, t00] ⇢ Gtype -
where Gtype denotes the DBPedia type graph formed fromthe class structure of the DBPedia ontology. In this con-text the set of concepts denotes the seed set and is used topopulate the vertices in the graph and then construct edgesbetween the vertices based on existing links between theconcepts in the DBPedia type graph (Gtype):
Ef = {(ci, cj) : ci, cj 2 Vf , (ci, cj) 2 Etype} (2)
In order to derive the set of vertices we must consider howthe seed set can be used for this process as it is often the casethat the set is comprised of concepts which are not directlyconnected to one another in the concept graph. To connectsuch concepts, and derive a fully connected concept graph(i.e. with no disconnected components) we extract the RootPath Graph as follows: From each concept (c 2 RC) weidentify the parent concept (<c rdfs:subClassOf p>) anditeratively move up the concept graph until the root node isreached (owl:Thing), thereby returning a set of nodes thatformed the path from c to the root node: rootpath(c) ={c, p, ...,owl:Thing}. The graph’s vertices are therefore de-rived by taking the union of all paths to the root returnedfrom each concept within the seed set:
Vf =[
c2RC
rootpath(c) (3)
3.2 Entity GraphsAn entity graph (GE) is a type of semantic resource graph
where the vertices (V ) are entities and the set of edges (E)connecting these entities are relations between them derivedfrom the Web of Linked Data. We define the entity graphas GE [f, t
0, t00] = hVf , Ef i, such that GE [f, t0, t00] ⇢ Gentity
- where Gentity denotes the DBPedia entity graph contain-ing relations between entities at the data level. As we areprovided with a collection of time-delimited entities RE fora given forum, we query DBPedia for links between entitypairs and add such edges to the graph where such a linkexists:
Ef = {(ri, rj) : ri, rj 2 Vf , (ri, rj) 2 Eentity} (4)
Given this edge construction mechanism we only look forrelations one-hop away in the entity graph, that is: givenRE we only look for relations between elements in the set.This could be extended to include 2-hop relations, howeverwe are interested in how entities in the communities areconnected to one another directly. In this work, the verticesin the entity graph are thus those entities which are foundto be connected to one another directly via 1-hop distances.3http://dbpedia.org/Ontology
clustering of users) and found high degrees of local cluster-ing on the di↵erent platforms which contained densely pop-ulated subgroups of similar users. Recent work by Gong etal. [5] inspected the evolution of social networks on Google+as the platform was growing in memberships, in particu-lar they focused on social-attribute networks (i.e. bipar-tite graphs containing people and their attributes as nodes),finding that the platform exhibited unique growth and char-acteristics of the networks as more people joined Google+.Leskovec et al. [7] modelled the development of social net-works across four platforms (Flickr, Delicious, Yahoo! An-swers and LinkedIn) by modelling the process of node ar-rival (users joining), edges being created and waiting timesbetween edge creation. Previous work mostly ignored termbased and semantic information and concentrated on howthe social networks evolved, not the communities. Althoughsuch works a↵ord insights into the evolution of social net-works, they do not consider how a community of users evolvesemantically.
3. CHARACTERISING ONLINE COMMU-NITIES WITH SEMANTIC GRAPHS
For our experiments we used data from the Irish commu-nity message board Boards.ie.2 This is a general-discussioncommunity message board that includes a set of hierarchi-cally nested forums (F ) in which posts are made - i.e. fo-rum A can be a parent of forum B, and thus B containsspecialised topics of discussion over A. Posts are providedas a set of quadruples <u, s, t, f> 2 P , where user u postedmessage s at time t in forum f . A message (s) is com-posed of terms that we use to build the semantic modelsfor individual communities. The information discussed ina community, and thus its semantics, can change and alterover time, therefore we constrain a community’s model tospecific time snapshots - e.g. t0 ! t00 where t0 < t00 - forthis we use the following construct that filters through allrelevant posts’ contents within the allotted time window:
St0t00 = {s : <u, s, t, f> 2 P, t0 t < t00} (1)
Information discussed within online communities can berepresented in terms of its semantics, using information fromeither the schema-level (i.e. ontological classes and relationsbetween them) or the data level (i.e. using entities and howthey are related to one another). For the former we considerconcepts to be classes found within the DBPedia Ontology,that is: the types of entities that users are discussing (e.g.people, locations, etc.), while for the latter case we con-sider DBPedia resources: i.e. entities themselves (e.g. dbpe-
dia:Barrack_Obama). Given our set of post contents, St0t00 ,we derive concepts and entities from a forum over a time pe-riod as follows: we process each post content s 2 St0t00 usingan entity extraction tool (s) to return the set of entitiesrelated to the content of s. Given the entities (RE) returnedfor a given community forum over an allotted time periodwe then construct two types of semantic resource graphs:concept graphs, which function at the schema-level and con-tain class information; and entity graphs, which function atthe data level and contain information that relates entitiesto one another.
2http://www.boards.ie
3.1 Concept GraphsA concept graph (GC) is a type of semantic resource graph
that contains the types of entities found within a given fo-rum as vertices (V ) and the relations between these classesas edges (E). For a given community forum f we have a setof entities RE that were extracted over some period timet0 ! t00. Each entity, given that we are using DBPediaresource URIs, is typed according to one or more classesfrom the DBPedia ontology.3 Therefore to construct the setof concepts that are cited within a given forum f over theallotted time period we retrieve the classes that each en-tity is a type of and store these in the following set: RC .From this set we generate a time-dependent concept graph:GC [f, t
0, t00] = hVf , Ef i, such that GC [f, t0, t00] ⇢ Gtype -
where Gtype denotes the DBPedia type graph formed fromthe class structure of the DBPedia ontology. In this con-text the set of concepts denotes the seed set and is used topopulate the vertices in the graph and then construct edgesbetween the vertices based on existing links between theconcepts in the DBPedia type graph (Gtype):
Ef = {(ci, cj) : ci, cj 2 Vf , (ci, cj) 2 Etype} (2)
In order to derive the set of vertices we must consider howthe seed set can be used for this process as it is often the casethat the set is comprised of concepts which are not directlyconnected to one another in the concept graph. To connectsuch concepts, and derive a fully connected concept graph(i.e. with no disconnected components) we extract the RootPath Graph as follows: From each concept (c 2 RC) weidentify the parent concept (<c rdfs:subClassOf p>) anditeratively move up the concept graph until the root node isreached (owl:Thing), thereby returning a set of nodes thatformed the path from c to the root node: rootpath(c) ={c, p, ...,owl:Thing}. The graph’s vertices are therefore de-rived by taking the union of all paths to the root returnedfrom each concept within the seed set:
Vf =[
c2RC
rootpath(c) (3)
3.2 Entity GraphsAn entity graph (GE) is a type of semantic resource graph
where the vertices (V ) are entities and the set of edges (E)connecting these entities are relations between them derivedfrom the Web of Linked Data. We define the entity graphas GE [f, t
0, t00] = hVf , Ef i, such that GE [f, t0, t00] ⇢ Gentity
- where Gentity denotes the DBPedia entity graph contain-ing relations between entities at the data level. As we areprovided with a collection of time-delimited entities RE fora given forum, we query DBPedia for links between entitypairs and add such edges to the graph where such a linkexists:
Ef = {(ri, rj) : ri, rj 2 Vf , (ri, rj) 2 Eentity} (4)
Given this edge construction mechanism we only look forrelations one-hop away in the entity graph, that is: givenRE we only look for relations between elements in the set.This could be extended to include 2-hop relations, howeverwe are interested in how entities in the communities areconnected to one another directly. In this work, the verticesin the entity graph are thus those entities which are foundto be connected to one another directly via 1-hop distances.3http://dbpedia.org/Ontology
clustering of users) and found high degrees of local cluster-ing on the di↵erent platforms which contained densely pop-ulated subgroups of similar users. Recent work by Gong etal. [5] inspected the evolution of social networks on Google+as the platform was growing in memberships, in particu-lar they focused on social-attribute networks (i.e. bipar-tite graphs containing people and their attributes as nodes),finding that the platform exhibited unique growth and char-acteristics of the networks as more people joined Google+.Leskovec et al. [7] modelled the development of social net-works across four platforms (Flickr, Delicious, Yahoo! An-swers and LinkedIn) by modelling the process of node ar-rival (users joining), edges being created and waiting timesbetween edge creation. Previous work mostly ignored termbased and semantic information and concentrated on howthe social networks evolved, not the communities. Althoughsuch works a↵ord insights into the evolution of social net-works, they do not consider how a community of users evolvesemantically.
3. CHARACTERISING ONLINE COMMU-NITIES WITH SEMANTIC GRAPHS
For our experiments we used data from the Irish commu-nity message board Boards.ie.2 This is a general-discussioncommunity message board that includes a set of hierarchi-cally nested forums (F ) in which posts are made - i.e. fo-rum A can be a parent of forum B, and thus B containsspecialised topics of discussion over A. Posts are providedas a set of quadruples <u, s, t, f> 2 P , where user u postedmessage s at time t in forum f . A message (s) is com-posed of terms that we use to build the semantic modelsfor individual communities. The information discussed ina community, and thus its semantics, can change and alterover time, therefore we constrain a community’s model tospecific time snapshots - e.g. t0 ! t00 where t0 < t00 - forthis we use the following construct that filters through allrelevant posts’ contents within the allotted time window:
St0t00 = {s : <u, s, t, f> 2 P, t0 t < t00} (1)
Information discussed within online communities can berepresented in terms of its semantics, using information fromeither the schema-level (i.e. ontological classes and relationsbetween them) or the data level (i.e. using entities and howthey are related to one another). For the former we considerconcepts to be classes found within the DBPedia Ontology,that is: the types of entities that users are discussing (e.g.people, locations, etc.), while for the latter case we con-sider DBPedia resources: i.e. entities themselves (e.g. dbpe-
dia:Barrack_Obama). Given our set of post contents, St0t00 ,we derive concepts and entities from a forum over a time pe-riod as follows: we process each post content s 2 St0t00 usingan entity extraction tool (s) to return the set of entitiesrelated to the content of s. Given the entities (RE) returnedfor a given community forum over an allotted time periodwe then construct two types of semantic resource graphs:concept graphs, which function at the schema-level and con-tain class information; and entity graphs, which function atthe data level and contain information that relates entitiesto one another.
2http://www.boards.ie
3.1 Concept GraphsA concept graph (GC) is a type of semantic resource graph
that contains the types of entities found within a given fo-rum as vertices (V ) and the relations between these classesas edges (E). For a given community forum f we have a setof entities RE that were extracted over some period timet0 ! t00. Each entity, given that we are using DBPediaresource URIs, is typed according to one or more classesfrom the DBPedia ontology.3 Therefore to construct the setof concepts that are cited within a given forum f over theallotted time period we retrieve the classes that each en-tity is a type of and store these in the following set: RC .From this set we generate a time-dependent concept graph:GC [f, t
0, t00] = hVf , Ef i, such that GC [f, t0, t00] ⇢ Gtype -
where Gtype denotes the DBPedia type graph formed fromthe class structure of the DBPedia ontology. In this con-text the set of concepts denotes the seed set and is used topopulate the vertices in the graph and then construct edgesbetween the vertices based on existing links between theconcepts in the DBPedia type graph (Gtype):
Ef = {(ci, cj) : ci, cj 2 Vf , (ci, cj) 2 Etype} (2)
In order to derive the set of vertices we must consider howthe seed set can be used for this process as it is often the casethat the set is comprised of concepts which are not directlyconnected to one another in the concept graph. To connectsuch concepts, and derive a fully connected concept graph(i.e. with no disconnected components) we extract the RootPath Graph as follows: From each concept (c 2 RC) weidentify the parent concept (<c rdfs:subClassOf p>) anditeratively move up the concept graph until the root node isreached (owl:Thing), thereby returning a set of nodes thatformed the path from c to the root node: rootpath(c) ={c, p, ...,owl:Thing}. The graph’s vertices are therefore de-rived by taking the union of all paths to the root returnedfrom each concept within the seed set:
Vf =[
c2RC
rootpath(c) (3)
3.2 Entity GraphsAn entity graph (GE) is a type of semantic resource graph
where the vertices (V ) are entities and the set of edges (E)connecting these entities are relations between them derivedfrom the Web of Linked Data. We define the entity graphas GE [f, t
0, t00] = hVf , Ef i, such that GE [f, t0, t00] ⇢ Gentity
- where Gentity denotes the DBPedia entity graph contain-ing relations between entities at the data level. As we areprovided with a collection of time-delimited entities RE fora given forum, we query DBPedia for links between entitypairs and add such edges to the graph where such a linkexists:
Ef = {(ri, rj) : ri, rj 2 Vf , (ri, rj) 2 Eentity} (4)
Given this edge construction mechanism we only look forrelations one-hop away in the entity graph, that is: givenRE we only look for relations between elements in the set.This could be extended to include 2-hop relations, howeverwe are interested in how entities in the communities areconnected to one another directly. In this work, the verticesin the entity graph are thus those entities which are foundto be connected to one another directly via 1-hop distances.3http://dbpedia.org/Ontology
Measuring Graph Dynamics
The Semantic Evolution of Online Communities
8
1. Node Count: size of the graph 2. Diameter: breadth of the graph 3. Specialisation Count: class specialisations 4. Graph Entropy: density of the graph 5. Clustering Coefficient: cliquishness of the graph
Inspect Macro
Evolution
Computation of the measures is described within the paper
The Semantic Evolution of Online Communities 9
Macro Evolution
Cumulative Time Interval
Entropy of the Concept Graph
Showing the mean graph entropy across all communities and the 95% Confidence Interval
0 20 40 60 80 100 120
150
200
250
300
Timestep
|V|
(a) Node Count
0 20 40 60 80 100 120
4.0
4.5
5.0
5.5
Timestep
H(G)
(b) Graph Entropy
0 20 40 60 80 100 120
150
200
250
Timestep
Specialisations
(c) Specialisations
Figure 1: Concept graphs’ evolution based on node counts, graph entropy and specialisations.
ing that the density of the graph grows as more entities areadded and thus more connections are possible between them.
Summary: We can summarise the following salient find-ings: (i) for concept graphs: node count, specialisation countand density (graph entropy) tend to converge to limit; (ii)for entity graphs, the diameter, graph entropy and cluster-ing coe�cient tend to converge to a limit, while the nodecount (number of entities) increases linearly; (iii) despitenew entities arriving at a constant linear rate, on average,the number of concepts tends to converge on a maxima.
5. MODELLING SEMANTIC EVOLUTIONIn the previous section, we found that the concept graphs
and entity graphs tend to evolve in a convergent manner:that is, for di↵erent measures they tend to evolve towards alimit. Such limiting evolution has been found previously inpopulation models where a given population has a carryingcapacity that the population evolves towards at di↵ering pro-portionate growth rates. These proportionate growth ratesslow down over time as the population tends towards a limit(the carrying capacity) - we see this in tapering curves in theaforementioned graphs. An immediate question that arisesfrom this e↵ect is: how do the communities di↵er in termsof evolution rates? To answer this question we used logis-tic population models that contain: (i) the growth rate ofthe graphs (r), and (ii) the carrying capacity of the graphs(E). Each of these variables can be used to characterise thecommunity forums (93 forums in total) in terms of their se-mantic evolution given a measure (e.g. node count in theconcept graph). To derive the variables r and E for a givencommunity forum and graph measure (m) we derive a set oftime steps (T ) which depict a change in a graph measure:
T = {a : a 2 [1, 119],m(G1,a+1) > m(G1,a)} (7)
Deriving the set of change time steps for a given communityallows the proportionate growth rate for a given time step(t 2 T ) to be derived: Rt = (Pt+1 � Pt)/Pt. This valueis equivalent to the following equation which defines theproportionate growth rate Rt in terms of the community’sgrowth rate (r) and carrying capacity (E), our unknownvariables: Rt = r(1 � Pt/E). Therefore if we measure theproportionate growth rate over the |T | distinct time stepsthen we can derive, via simultaneous equations, the growthrate of the graph and its carrying capacity, the very mea-sures that we can use to characterise the semantic evolution
of a given online community based on a single graph mea-sure. We exclude the derivation of the equations from thepaper, but it is su�cient to conclude that given |T | timesteps we would have a single equation for each time step(t 2 T ): Rtr
�1 + PtE�1 = 1. We can then solve for the
unknown variables r and E using the QR-decomposition ofa matrix: expressing the lefthand side of the simultaneousequations as a |T | ⇥ 2 matrix and the righthand side as a|T |-element vector where each element is 1. We inducedlogistic population models for each of the graph measures(aside from entity graph node count) and examined how thegrowth rate and carrying capacities were distributed - wefound all models to be suitable fits at the 1% significancelevel using the chi-squared goodness-of-fit test.We omitted the plots showing the distribution of growth
rates and carrying capacities for communities’ concept andentity graphs’ measures, however these distributions demon-strated the following:
• All communities evolve to cover roughly 80% of theontological classes (total number is 359), however somecommunities converge on the maximum quicker thanothers - demonstrated by a high growth rate for a smallnumber of communities.
• The majority of communities’ concept graphs becomedenser at a slow rate, while a few communities’ becomedenser quickly, thus suggesting that users tend to dis-cuss concepts that are orthogonal and not related toone another in the majority of communities.
• Communities’ entity graphs show variance in the rateby which entities are discussed for the first time withinthe communities, however such rates are linear: fittinga per-community linear regression model regressing theweek count on the node count yielded a lowest coe�-cient of determination of 0.3 for one community.
6. APPLICATIONSWe now demonstrate the utility of graph-based approaches
via two applications: community analysis, and churn rateprediction.
6.1 Community AnalysisTo more closely examine the di↵erences between commu-
nity forums in terms of their semantic evolution we char-acterised each community in our dataset by its semanticdynamics motif over the 120-week analysis period: that is,the observed evolution measures derived from the concept
Inspect Macro
Evolution
Inspect Macro
Evolution
The Semantic Evolution of Online Communities 10
Macro Evolution: Concept Graphs
0 20 40 60 80 100 120
150
200
250
300
Timestep
|V|
(a) Node Count
0 20 40 60 80 100 1204.0
4.5
5.0
5.5
TimestepH(G)
(b) Graph Entropy
0 20 40 60 80 100 120
150
200
250
Timestep
Specialisations
(c) Specialisations
Figure 1: Concept graphs’ evolution based on node counts, graph entropy and specialisations.
ing that the density of the graph grows as more entities areadded and thus more connections are possible between them.
Summary: We can summarise the following salient find-ings: (i) for concept graphs: node count, specialisation countand density (graph entropy) tend to converge to limit; (ii)for entity graphs, the diameter, graph entropy and cluster-ing coe�cient tend to converge to a limit, while the nodecount (number of entities) increases linearly; (iii) despitenew entities arriving at a constant linear rate, on average,the number of concepts tends to converge on a maxima.
5. MODELLING SEMANTIC EVOLUTIONIn the previous section, we found that the concept graphs
and entity graphs tend to evolve in a convergent manner:that is, for di↵erent measures they tend to evolve towards alimit. Such limiting evolution has been found previously inpopulation models where a given population has a carryingcapacity that the population evolves towards at di↵ering pro-portionate growth rates. These proportionate growth ratesslow down over time as the population tends towards a limit(the carrying capacity) - we see this in tapering curves in theaforementioned graphs. An immediate question that arisesfrom this e↵ect is: how do the communities di↵er in termsof evolution rates? To answer this question we used logis-tic population models that contain: (i) the growth rate ofthe graphs (r), and (ii) the carrying capacity of the graphs(E). Each of these variables can be used to characterise thecommunity forums (93 forums in total) in terms of their se-mantic evolution given a measure (e.g. node count in theconcept graph). To derive the variables r and E for a givencommunity forum and graph measure (m) we derive a set oftime steps (T ) which depict a change in a graph measure:
T = {a : a 2 [1, 119],m(G1,a+1) > m(G1,a)} (7)
Deriving the set of change time steps for a given communityallows the proportionate growth rate for a given time step(t 2 T ) to be derived: Rt = (Pt+1 � Pt)/Pt. This valueis equivalent to the following equation which defines theproportionate growth rate Rt in terms of the community’sgrowth rate (r) and carrying capacity (E), our unknownvariables: Rt = r(1 � Pt/E). Therefore if we measure theproportionate growth rate over the |T | distinct time stepsthen we can derive, via simultaneous equations, the growthrate of the graph and its carrying capacity, the very mea-sures that we can use to characterise the semantic evolution
of a given online community based on a single graph mea-sure. We exclude the derivation of the equations from thepaper, but it is su�cient to conclude that given |T | timesteps we would have a single equation for each time step(t 2 T ): Rtr
�1 + PtE�1 = 1. We can then solve for the
unknown variables r and E using the QR-decomposition ofa matrix: expressing the lefthand side of the simultaneousequations as a |T | ⇥ 2 matrix and the righthand side as a|T |-element vector where each element is 1. We inducedlogistic population models for each of the graph measures(aside from entity graph node count) and examined how thegrowth rate and carrying capacities were distributed - wefound all models to be suitable fits at the 1% significancelevel using the chi-squared goodness-of-fit test.We omitted the plots showing the distribution of growth
rates and carrying capacities for communities’ concept andentity graphs’ measures, however these distributions demon-strated the following:
• All communities evolve to cover roughly 80% of theontological classes (total number is 359), however somecommunities converge on the maximum quicker thanothers - demonstrated by a high growth rate for a smallnumber of communities.
• The majority of communities’ concept graphs becomedenser at a slow rate, while a few communities’ becomedenser quickly, thus suggesting that users tend to dis-cuss concepts that are orthogonal and not related toone another in the majority of communities.
• Communities’ entity graphs show variance in the rateby which entities are discussed for the first time withinthe communities, however such rates are linear: fittinga per-community linear regression model regressing theweek count on the node count yielded a lowest coe�-cient of determination of 0.3 for one community.
6. APPLICATIONSWe now demonstrate the utility of graph-based approaches
via two applications: community analysis, and churn rateprediction.
6.1 Community AnalysisTo more closely examine the di↵erences between commu-
nity forums in terms of their semantic evolution we char-acterised each community in our dataset by its semanticdynamics motif over the 120-week analysis period: that is,the observed evolution measures derived from the concept
Node count, graph entropy, and specialisations converge to a limit
Inspect Macro
Evolution
The Semantic Evolution of Online Communities 11
Macro Evolution: Entity Graphs
0 20 40 60 80 100 120
050
010
0015
00
Timestep
|V|
Linear Model (β=9.932)
(a) Node Count
0 20 40 60 80 100 120
12
34
56
7
Timestep
Diameter
(b) Diameter
0 20 40 60 80 100 120
12
34
Timestep
H(G)
(c) Graph Entropy
0 20 40 60 80 100 120
0.02
0.06
0.10
0.14
Timestep
CC
(d) Clustering Coe�cient
Figure 2: Entity graphs’ evolution based on node counts, diameter, graph entropy, and clustering coe�cient.
and entity graphs defined within a single vector for a givencommunity (f): mf = {m1,m2, . . . ,m13}. We began by de-riving a single matrixM 2 R93⇥13 containing the 93 commu-nities (rows) under analysis together with their 13 evolutionmeasures (columns), and performing principal componentanalysis over this matrix. The result of this clustering isshown in Fig. 3 where we have colour-coded the di↵erentcommunity forums by their hierarchical level in the plat-form (level 2 = most general, level 4 = most specific). Wefound that several of the more general forums appeared asoutliers in the plot and thus exhibited unique evolution dy-namics, while the two level 4 forums (forum 556 and 554)were bunched together suggesting that they follow similartrends. We further examined the semantic motifs of threeoutlier communities from di↵ering levels:
• Level 2: Forum 7 - After Hours. A random discussionforum.
• Level 3: Forum 227 - Television. Discussions abouttelevision.
• Level 4: Forum 554 - Wanted Motors. Discussionsabout cars and car parts.
Fig. 3 presents clear di↵erences between the forums: wenote that for the concept graph dynamics (CG) the rate ofthe node count is lowest for the After Hours forum but thatthe node count equilibrium is highest, indicating that themore general the forum the slower is the growth of the con-cept graph, but the higher the maxima of the graph size.Likewise for the specialisation count in the concept graphs:the After Hours forum exhibits a slower rate of growth butwith a greater carrying capacity that the graph is tending to-wards. In terms of the entity graph: the slope of node countgrowth described by the linear model is highest for AfterHours, indicating that compared to the other two forums,the rate at which new entities are cited by the community ofusers is much greater, while for the more topically-specificforum of the Wanted Motors forum this is a lot lower. Theentity graph equilibrium is also highest for After Hours andlowest for Wanted Motors, indicating that the more generala forum is the greater the carrying capacity of its entitygraph and the greater the number of entities that will bediscussed.
6.2 Churn Rate PredictionTo examine the link between the semantic evolution of
online communities and their social properties, we defineda prediction task in which we used the semantic evolution
dynamics of a given community at time step t to predictthe churn rate of community members at time t + 1. Wedefined the churn rate of a community as the proportion ofactive users during a given time period (i.e. week segment)that post for the last time. We used the semantic dynamicsmotifs from the prior experiment (as listed within Fig. 3)and also included graph measures at a given time period:i.e. graph entropy at time t, specialisation count at time t,etc. We derived these features for every time step for eachcommunity and derived the response variable as the churnrate at the following time step. We then compiled a train-ing dataset (up to week 120) and a test dataset (from week120). Each dataset had the following form: D = {(xi, yi)},where xi contained a 21-element time-delimited feature vec-tor for a given community and yi was the churn rate of thecommunity at the following time step. We trained a ridgeregression model ( ) using Dtrain and applied it to Dtest,testing the performance of: a) just concept graph features,b) just entity graph features, and c) all features. An autore-gressive model was used as the baseline - using the churnrate at time t as a single predictor variable for the churnrate at time t + 1. Performance was evaluated using theRoot Mean Square Error (RMSE).
Table 1: Root Mean Square Error when predicting
churn rates using: an Autoregressive model (R2 =0.341) and a Ridge Regression model using Concept
Graph, Entity Graph, and all features.
Baseline Concept Graph Entity Graph All Features
7.310 ⇥10
�35.315 ⇥10
�35.301 ⇥10
�34.941 ⇥10
�3
Table 1 presents the results from our prediction our exper-iment. We found that for all tested models (concept graph,entity graph, all features) we significantly outperformed thebaseline - tested using the sign test (↵ = 0.001). Entitygraph features outperform concept graphs but not signifi-cantly, while our best model is the use of all features togetherin a single model. These results empirically demonstrate theutility of semantic evolution dynamics in predicting commu-nity churn rates, and suggests a link between how the com-munities develop semantically and the likelihood of usersleaving the communities.
Diameter, graph entropy, and clustering coefficient converge to a limit
Modelling the Semantic Evolution of Individual Communities
The Semantic Evolution of Online Communities
12
¨ For each community and measure, induce a logistic population model:
1. Get the set of measure increase time steps (T) 2. Derive the proportionate growth rate for each step 3. Use the set of proportionate growth rates to solve for
unknown community-specific evolution variables (r & E)
0 20 40 60 80 100 120
150
200
250
300
Timestep
|V|
(a) Node Count
0 20 40 60 80 100 120
4.0
4.5
5.0
5.5
Timestep
H(G)
(b) Graph Entropy
0 20 40 60 80 100 120
150
200
250
Timestep
Specialisations
(c) Specialisations
Figure 1: Concept graphs’ evolution based on node counts, graph entropy and specialisations.
ing that the density of the graph grows as more entities areadded and thus more connections are possible between them.
Summary: We can summarise the following salient find-ings: (i) for concept graphs: node count, specialisation countand density (graph entropy) tend to converge to limit; (ii)for entity graphs, the diameter, graph entropy and cluster-ing coe�cient tend to converge to a limit, while the nodecount (number of entities) increases linearly; (iii) despitenew entities arriving at a constant linear rate, on average,the number of concepts tends to converge on a maxima.
5. MODELLING SEMANTIC EVOLUTIONIn the previous section, we found that the concept graphs
and entity graphs tend to evolve in a convergent manner:that is, for di↵erent measures they tend to evolve towards alimit. Such limiting evolution has been found previously inpopulation models where a given population has a carryingcapacity that the population evolves towards at di↵ering pro-portionate growth rates. These proportionate growth ratesslow down over time as the population tends towards a limit(the carrying capacity) - we see this in tapering curves in theaforementioned graphs. An immediate question that arisesfrom this e↵ect is: how do the communities di↵er in termsof evolution rates? To answer this question we used logis-tic population models that contain: (i) the growth rate ofthe graphs (r), and (ii) the carrying capacity of the graphs(E). Each of these variables can be used to characterise thecommunity forums (93 forums in total) in terms of their se-mantic evolution given a measure (e.g. node count in theconcept graph). To derive the variables r and E for a givencommunity forum and graph measure (m) we derive a set oftime steps (T ) which depict a change in a graph measure:
T = {a : a 2 [1, 119],m(G1,a+1) > m(G1,a)} (7)
Deriving the set of change time steps for a given communityallows the proportionate growth rate for a given time step(t 2 T ) to be derived: Rt = (Pt+1 � Pt)/Pt. This valueis equivalent to the following equation which defines theproportionate growth rate Rt in terms of the community’sgrowth rate (r) and carrying capacity (E), our unknownvariables: Rt = r(1 � Pt/E). Therefore if we measure theproportionate growth rate over the |T | distinct time stepsthen we can derive, via simultaneous equations, the growthrate of the graph and its carrying capacity, the very mea-sures that we can use to characterise the semantic evolution
of a given online community based on a single graph mea-sure. We exclude the derivation of the equations from thepaper, but it is su�cient to conclude that given |T | timesteps we would have a single equation for each time step(t 2 T ): Rtr
�1 + PtE�1 = 1. We can then solve for the
unknown variables r and E using the QR-decomposition ofa matrix: expressing the lefthand side of the simultaneousequations as a |T | ⇥ 2 matrix and the righthand side as a|T |-element vector where each element is 1. We inducedlogistic population models for each of the graph measures(aside from entity graph node count) and examined how thegrowth rate and carrying capacities were distributed - wefound all models to be suitable fits at the 1% significancelevel using the chi-squared goodness-of-fit test.We omitted the plots showing the distribution of growth
rates and carrying capacities for communities’ concept andentity graphs’ measures, however these distributions demon-strated the following:
• All communities evolve to cover roughly 80% of theontological classes (total number is 359), however somecommunities converge on the maximum quicker thanothers - demonstrated by a high growth rate for a smallnumber of communities.
• The majority of communities’ concept graphs becomedenser at a slow rate, while a few communities’ becomedenser quickly, thus suggesting that users tend to dis-cuss concepts that are orthogonal and not related toone another in the majority of communities.
• Communities’ entity graphs show variance in the rateby which entities are discussed for the first time withinthe communities, however such rates are linear: fittinga per-community linear regression model regressing theweek count on the node count yielded a lowest coe�-cient of determination of 0.3 for one community.
6. APPLICATIONSWe now demonstrate the utility of graph-based approaches
via two applications: community analysis, and churn rateprediction.
6.1 Community AnalysisTo more closely examine the di↵erences between commu-
nity forums in terms of their semantic evolution we char-acterised each community in our dataset by its semanticdynamics motif over the 120-week analysis period: that is,the observed evolution measures derived from the concept
Induce Community-
Specific Evolution Models
Proportionate Growth Rate @ t
Population Growth Rate
Population (measure) size @ t
Carrying Capacity (Equilibrium)
All communities’ models fit with χ2-test at α<0.01
The Semantic Evolution of Online Communities 13
Semantic Evolution… what can we actually use it for?
The Semantic Evolution of Online Communities
Application: Community Evolution Analysis
14
Apply Mined
Semantic Evolution Dynamics
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●●
●
●●
●
●
●
●
−60 −40 −20 0 20 40 60
−40
−20
020
PC1
PC2
7
9
1011 12
18
19
20
21
22
23
2425 31
34
37
38
47
52
54
55 56
6064
68
82
86
93
99
105
107
108
109
116
120
124
125
126
127
136137
151
171
177
227
232
237
246
252
259
264
267
269
271
333
343
346
370
382
388389
392
410
411 443446453
464
468
471
474
475
476478
481
482
483
490
495
503
506 512
514
518
522529
532542544
545
547
554
556
●
●
●
Level 2Level 3Level 4 CG − Node Count − Rate
CG − Node Count − EquilbriumCG − Graph Entropy − Rate
CG − Graph Entropy − EqCG − Specialisation Count − RateCG − Specialisation − Equilibrium
EG − Node Count − SlopeEG − Diameter − Rate
EG − Diameter − EquilibriumEG − Entropy − Rate
EG − Entropy − EquilibriumEG − Clustering Coefficient − Rate
EG − Clustering Coefficient − Equilibrium f7 − After Hoursf227 − Televisionf554 − Wanted Motors
10−1 100 101 102 103
Figure 3: PCA plot of the communities based on their semantic motifs (left) where level 4 forums are clustered
together, and model values for the concept graphs (CG) and entity graphs (EG) for the three outlier forums
from the three levels (right).
In this work, we found that concept and entity graph den-sity in boards.ie does not grow linearly (unlike in social net-works [8]) but instead converges on a limit, which we charac-terised as the carrying capacity (E) of a given community’sconcept and entity graph entropy. We also discovered thatthe diameter of the entity graph in our online communityconverged on a limit over time as the rate of concepts arrivedslowed down, again in contrast to findings from the socialnetworking domain where diameters were found to shrink asmore nodes joined the network [9]. Indeed, this notion ofconvergence to a limit is common across all but one of thegraph measures that we examined and suggests that onlinecommunities have a finite number of topics that can be dis-cussed and that semantics will converge on a maxima overtime.
Our contributions can be summarized as follows: (i) Weused semantic graphs to firstly examine how concepts dis-cussed by communities changed over time at a macro-level,(ii) we used logistic population models to inspect how in-dividual communities evolved over time, and (iii) we de-ployed logistic population models to capture semantic graphchanges along di↵erent measures and applied our results tocommunity analysis and churn rate prediction. Thereby, ourwork forms a basis for combining studies of social and se-mantic network evolution in future work.
8. REFERENCES[1] Lars Backstrom, Dan Huttenlocher, Jon Kleinberg,
and Xiangyang Lan. Group formation in large socialnetworks: membership, growth, and evolution. InProceedings of the 12th ACM SIGKDD internationalconference on Knowledge discovery and data mining,pages 44–54. ACM, 2006.
[2] VIaclav Belak, Marcel Karnstedt, and Conor Hayes.Life-cycles and mutual e↵ects of scientificcommunities. Procedia - Social and BehavioralSciences, 22(0):37 – 48, 2011.
[3] Kon Shing Kenneth Chung, Mahendra Piraveenan,and Shahadat Uddin. Community evolution andengagement through assortative mixing in onlinesocial networks. 2012 IEEE/ACM International
Conference on Advances in Social Networks Analysisand Mining, 0:724–725, 2012.
[4] Cristian Danescu-Niculescu-Mizil, Robert West, DanJurafsky, Jure Leskovec, and Christopher Potts. Nocountry for old members: User lifecycle and linguisticchange in online communities. In Proceedings of theWorld Wide Web Conference, 2013.
[5] Leon Derczynski, Diana Maynard, Niraj Aswani, andKalina Bontcheva. Microblog-genre noise and impacton semantic annotation accuracy. In Proceedings of the24th ACM Conference on Hypertext and Social Media(HT 2013), 2013.
[6] Neil Zhenqiang Gong, Wenchang Xu, Ling Huang,Prateek Mittal, Emil Stefanov, Vyas Sekar, and DawnSong. Evolution of social-attribute networks:Measurements, modeling, and implications usinggoogle+. CoRR, abs/1209.0835, 2012.
[7] Alicia Iriberri and Gondy Leroy. A life-cycleperspective on online community success. ACMComput. Surv., 41(2):11:1–11:29, February 2009.
[8] Jure Leskovec, Lars Backstrom, Ravi Kumar, andAndrew Tomkins. Microscopic evolution of socialnetworks. In Proceedings of the 14th ACM SIGKDDinternational conference on Knowledge discovery anddata mining, pages 462–470. ACM, 2008.
[9] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos.Graphs over time: densification laws, shrinkingdiameters and possible explanations. In Proceedings ofthe eleventh ACM SIGKDD international conferenceon Knowledge discovery in data mining, KDD ’05,pages 177–187, New York, NY, USA, 2005. ACM.
[10] Alan Mislove, Massimiliano Marcon, Krishna P.Gummadi, Peter Druschel, and Bobby Bhattacharjee.Measurement and analysis of online social networks.In SIGCOMM conference on Internet measurement,IMC ’07, pages 29–42, 2007.
[11] Roberto Navigli and Mirella Lapata. An experimentalstudy of graph connectivity for unsupervised wordsense disambiguation. IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI),32(4):678–692, 2010.
Semantic Evolution Motif:
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●●
●
●●
●
●
●
●
−60 −40 −20 0 20 40 60
−40
−20
020
PC1
PC2
7
9
1011 12
18
19
20
21
22
23
2425 31
34
37
38
47
52
54
55 56
6064
68
82
86
93
99
105
107
108
109
116
120
124
125
126
127
136137
151
171
177
227
232
237
246
252
259
264
267
269
271
333
343
346
370
382
388389
392
410
411 443446453
464
468
471
474
475
476478
481
482
483
490
495
503
506 512
514
518
522529
532542544
545
547
554
556
●
●
●
Level 2Level 3Level 4 CG − Node Count − Rate
CG − Node Count − EquilbriumCG − Graph Entropy − Rate
CG − Graph Entropy − EqCG − Specialisation Count − RateCG − Specialisation − Equilibrium
EG − Node Count − SlopeEG − Diameter − Rate
EG − Diameter − EquilibriumEG − Entropy − Rate
EG − Entropy − EquilibriumEG − Clustering Coefficient − Rate
EG − Clustering Coefficient − Equilibrium f7 − After Hoursf227 − Televisionf554 − Wanted Motors
10−1 100 101 102 103
Figure 3: PCA plot of the communities based on their semantic motifs (left) where level 4 forums are clustered
together, and model values for the concept graphs (CG) and entity graphs (EG) for the three outlier forums
from the three levels (right).
In this work, we found that concept and entity graph den-sity in boards.ie does not grow linearly (unlike in social net-works [8]) but instead converges on a limit, which we charac-terised as the carrying capacity (E) of a given community’sconcept and entity graph entropy. We also discovered thatthe diameter of the entity graph in our online communityconverged on a limit over time as the rate of concepts arrivedslowed down, again in contrast to findings from the socialnetworking domain where diameters were found to shrink asmore nodes joined the network [9]. Indeed, this notion ofconvergence to a limit is common across all but one of thegraph measures that we examined and suggests that onlinecommunities have a finite number of topics that can be dis-cussed and that semantics will converge on a maxima overtime.
Our contributions can be summarized as follows: (i) Weused semantic graphs to firstly examine how concepts dis-cussed by communities changed over time at a macro-level,(ii) we used logistic population models to inspect how in-dividual communities evolved over time, and (iii) we de-ployed logistic population models to capture semantic graphchanges along di↵erent measures and applied our results tocommunity analysis and churn rate prediction. Thereby, ourwork forms a basis for combining studies of social and se-mantic network evolution in future work.
8. REFERENCES[1] Lars Backstrom, Dan Huttenlocher, Jon Kleinberg,
and Xiangyang Lan. Group formation in large socialnetworks: membership, growth, and evolution. InProceedings of the 12th ACM SIGKDD internationalconference on Knowledge discovery and data mining,pages 44–54. ACM, 2006.
[2] VIaclav Belak, Marcel Karnstedt, and Conor Hayes.Life-cycles and mutual e↵ects of scientificcommunities. Procedia - Social and BehavioralSciences, 22(0):37 – 48, 2011.
[3] Kon Shing Kenneth Chung, Mahendra Piraveenan,and Shahadat Uddin. Community evolution andengagement through assortative mixing in onlinesocial networks. 2012 IEEE/ACM International
Conference on Advances in Social Networks Analysisand Mining, 0:724–725, 2012.
[4] Cristian Danescu-Niculescu-Mizil, Robert West, DanJurafsky, Jure Leskovec, and Christopher Potts. Nocountry for old members: User lifecycle and linguisticchange in online communities. In Proceedings of theWorld Wide Web Conference, 2013.
[5] Leon Derczynski, Diana Maynard, Niraj Aswani, andKalina Bontcheva. Microblog-genre noise and impacton semantic annotation accuracy. In Proceedings of the24th ACM Conference on Hypertext and Social Media(HT 2013), 2013.
[6] Neil Zhenqiang Gong, Wenchang Xu, Ling Huang,Prateek Mittal, Emil Stefanov, Vyas Sekar, and DawnSong. Evolution of social-attribute networks:Measurements, modeling, and implications usinggoogle+. CoRR, abs/1209.0835, 2012.
[7] Alicia Iriberri and Gondy Leroy. A life-cycleperspective on online community success. ACMComput. Surv., 41(2):11:1–11:29, February 2009.
[8] Jure Leskovec, Lars Backstrom, Ravi Kumar, andAndrew Tomkins. Microscopic evolution of socialnetworks. In Proceedings of the 14th ACM SIGKDDinternational conference on Knowledge discovery anddata mining, pages 462–470. ACM, 2008.
[9] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos.Graphs over time: densification laws, shrinkingdiameters and possible explanations. In Proceedings ofthe eleventh ACM SIGKDD international conferenceon Knowledge discovery in data mining, KDD ’05,pages 177–187, New York, NY, USA, 2005. ACM.
[10] Alan Mislove, Massimiliano Marcon, Krishna P.Gummadi, Peter Druschel, and Bobby Bhattacharjee.Measurement and analysis of online social networks.In SIGCOMM conference on Internet measurement,IMC ’07, pages 29–42, 2007.
[11] Roberto Navigli and Mirella Lapata. An experimentalstudy of graph connectivity for unsupervised wordsense disambiguation. IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI),32(4):678–692, 2010.
0 20 40 60 80 100 120
050
010
0015
00
Timestep
|V|
Linear Model (β=9.932)
(a) Node Count
0 20 40 60 80 100 1201
23
45
67
Timestep
Diameter
(b) Diameter
0 20 40 60 80 100 120
12
34
Timestep
H(G)
(c) Graph Entropy
0 20 40 60 80 100 120
0.02
0.06
0.10
0.14
Timestep
CC
(d) Clustering Coe�cient
Figure 2: Entity graphs’ evolution based on node counts, diameter, graph entropy, and clustering coe�cient.
and entity graphs defined within a single vector for a givencommunity (f): mf = {m1,m2, . . . ,m13}. We began by de-riving a single matrixM 2 R93⇥13 containing the 93 commu-nities (rows) under analysis together with their 13 evolutionmeasures (columns), and performing principal componentanalysis over this matrix. The result of this clustering isshown in Fig. 3 where we have colour-coded the di↵erentcommunity forums by their hierarchical level in the plat-form (level 2 = most general, level 4 = most specific). Wefound that several of the more general forums appeared asoutliers in the plot and thus exhibited unique evolution dy-namics, while the two level 4 forums (forum 556 and 554)were bunched together suggesting that they follow similartrends. We further examined the semantic motifs of threeoutlier communities from di↵ering levels:
• Level 2: Forum 7 - After Hours. A random discussionforum.
• Level 3: Forum 227 - Television. Discussions abouttelevision.
• Level 4: Forum 554 - Wanted Motors. Discussionsabout cars and car parts.
Fig. 3 presents clear di↵erences between the forums: wenote that for the concept graph dynamics (CG) the rate ofthe node count is lowest for the After Hours forum but thatthe node count equilibrium is highest, indicating that themore general the forum the slower is the growth of the con-cept graph, but the higher the maxima of the graph size.Likewise for the specialisation count in the concept graphs:the After Hours forum exhibits a slower rate of growth butwith a greater carrying capacity that the graph is tending to-wards. In terms of the entity graph: the slope of node countgrowth described by the linear model is highest for AfterHours, indicating that compared to the other two forums,the rate at which new entities are cited by the community ofusers is much greater, while for the more topically-specificforum of the Wanted Motors forum this is a lot lower. Theentity graph equilibrium is also highest for After Hours andlowest for Wanted Motors, indicating that the more generala forum is the greater the carrying capacity of its entitygraph and the greater the number of entities that will bediscussed.
6.2 Churn Rate PredictionTo examine the link between the semantic evolution of
online communities and their social properties, we defineda prediction task in which we used the semantic evolution
dynamics of a given community at time step t to predictthe churn rate of community members at time t + 1. Wedefined the churn rate of a community as the proportion ofactive users during a given time period (i.e. week segment)that post for the last time. We used the semantic dynamicsmotifs from the prior experiment (as listed within Fig. 3)and also included graph measures at a given time period:i.e. graph entropy at time t, specialisation count at time t,etc. We derived these features for every time step for eachcommunity and derived the response variable as the churnrate at the following time step. We then compiled a train-ing dataset (up to week 120) and a test dataset (from week120). Each dataset had the following form: D = {(xi, yi)},where xi contained a 21-element time-delimited feature vec-tor for a given community and yi was the churn rate of thecommunity at the following time step. We trained a ridgeregression model ( ) using Dtrain and applied it to Dtest,testing the performance of: a) just concept graph features,b) just entity graph features, and c) all features. An autore-gressive model was used as the baseline - using the churnrate at time t as a single predictor variable for the churnrate at time t + 1. Performance was evaluated using theRoot Mean Square Error (RMSE).
Table 1: Root Mean Square Error when predicting
churn rates using: an Autoregressive model (R2 =0.341) and a Ridge Regression model using Concept
Graph, Entity Graph, and all features.
Baseline Concept Graph Entity Graph All Features
7.310 ⇥10
�35.315 ⇥10
�35.301 ⇥10
�34.941 ⇥10
�3
Table 1 presents the results from our prediction our exper-iment. We found that for all tested models (concept graph,entity graph, all features) we significantly outperformed thebaseline - tested using the sign test (↵ = 0.001). Entitygraph features outperform concept graphs but not signifi-cantly, while our best model is the use of all features togetherin a single model. These results empirically demonstrate theutility of semantic evolution dynamics in predicting commu-nity churn rates, and suggests a link between how the com-munities develop semantically and the likelihood of usersleaving the communities.
7. DISCUSSION & CONCLUSIONS
0 20 40 60 80 100 120
050
010
0015
00
Timestep
|V|
Linear Model (β=9.932)
(a) Node Count
0 20 40 60 80 100 120
12
34
56
7
TimestepDiameter
(b) Diameter
0 20 40 60 80 100 120
12
34
Timestep
H(G)
(c) Graph Entropy
0 20 40 60 80 100 120
0.02
0.06
0.10
0.14
Timestep
CC
(d) Clustering Coe�cient
Figure 2: Entity graphs’ evolution based on node counts, diameter, graph entropy, and clustering coe�cient.
and entity graphs defined within a single vector for a givencommunity (f): mf = {m1,m2, . . . ,m13}. We began by de-riving a single matrixM 2 R93⇥13 containing the 93 commu-nities (rows) under analysis together with their 13 evolutionmeasures (columns), and performing principal componentanalysis over this matrix. The result of this clustering isshown in Fig. 3 where we have colour-coded the di↵erentcommunity forums by their hierarchical level in the plat-form (level 2 = most general, level 4 = most specific). Wefound that several of the more general forums appeared asoutliers in the plot and thus exhibited unique evolution dy-namics, while the two level 4 forums (forum 556 and 554)were bunched together suggesting that they follow similartrends. We further examined the semantic motifs of threeoutlier communities from di↵ering levels:
• Level 2: Forum 7 - After Hours. A random discussionforum.
• Level 3: Forum 227 - Television. Discussions abouttelevision.
• Level 4: Forum 554 - Wanted Motors. Discussionsabout cars and car parts.
Fig. 3 presents clear di↵erences between the forums: wenote that for the concept graph dynamics (CG) the rate ofthe node count is lowest for the After Hours forum but thatthe node count equilibrium is highest, indicating that themore general the forum the slower is the growth of the con-cept graph, but the higher the maxima of the graph size.Likewise for the specialisation count in the concept graphs:the After Hours forum exhibits a slower rate of growth butwith a greater carrying capacity that the graph is tending to-wards. In terms of the entity graph: the slope of node countgrowth described by the linear model is highest for AfterHours, indicating that compared to the other two forums,the rate at which new entities are cited by the community ofusers is much greater, while for the more topically-specificforum of the Wanted Motors forum this is a lot lower. Theentity graph equilibrium is also highest for After Hours andlowest for Wanted Motors, indicating that the more generala forum is the greater the carrying capacity of its entitygraph and the greater the number of entities that will bediscussed.
6.2 Churn Rate PredictionTo examine the link between the semantic evolution of
online communities and their social properties, we defineda prediction task in which we used the semantic evolution
dynamics of a given community at time step t to predictthe churn rate of community members at time t + 1. Wedefined the churn rate of a community as the proportion ofactive users during a given time period (i.e. week segment)that post for the last time. We used the semantic dynamicsmotifs from the prior experiment (as listed within Fig. 3)and also included graph measures at a given time period:i.e. graph entropy at time t, specialisation count at time t,etc. We derived these features for every time step for eachcommunity and derived the response variable as the churnrate at the following time step. We then compiled a train-ing dataset (up to week 120) and a test dataset (from week120). Each dataset had the following form: D = {(xi, yi)},where xi contained a 21-element time-delimited feature vec-tor for a given community and yi was the churn rate of thecommunity at the following time step. We trained a ridgeregression model ( ) using Dtrain and applied it to Dtest,testing the performance of: a) just concept graph features,b) just entity graph features, and c) all features. An autore-gressive model was used as the baseline - using the churnrate at time t as a single predictor variable for the churnrate at time t + 1. Performance was evaluated using theRoot Mean Square Error (RMSE).
Table 1: Root Mean Square Error when predicting
churn rates using: an Autoregressive model (R2 =0.341) and a Ridge Regression model using Concept
Graph, Entity Graph, and all features.
Baseline Concept Graph Entity Graph All Features
7.310 ⇥10
�35.315 ⇥10
�35.301 ⇥10
�34.941 ⇥10
�3
Table 1 presents the results from our prediction our exper-iment. We found that for all tested models (concept graph,entity graph, all features) we significantly outperformed thebaseline - tested using the sign test (↵ = 0.001). Entitygraph features outperform concept graphs but not signifi-cantly, while our best model is the use of all features togetherin a single model. These results empirically demonstrate theutility of semantic evolution dynamics in predicting commu-nity churn rates, and suggests a link between how the com-munities develop semantically and the likelihood of usersleaving the communities.
7. DISCUSSION & CONCLUSIONS
Application: Community Evolution Analysis
The Semantic Evolution of Online Communities
15
Apply Mined
Semantic Evolution Dynamics
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●●
●
●●
●
●
●
●
−60 −40 −20 0 20 40 60
−40
−20
020
PC1
PC2
7
9
1011 12
18
19
20
21
22
23
2425 31
34
37
38
47
52
54
55 56
6064
68
82
86
93
99
105
107
108
109
116
120
124
125
126
127
136137
151
171
177
227
232
237
246
252
259
264
267
269
271
333
343
346
370
382
388389
392
410
411 443446453
464
468
471
474
475
476478
481
482
483
490
495
503
506 512
514
518
522529
532542544
545
547
554
556
●
●
●
Level 2Level 3Level 4 CG − Node Count − Rate
CG − Node Count − EquilbriumCG − Graph Entropy − Rate
CG − Graph Entropy − EqCG − Specialisation Count − RateCG − Specialisation − Equilibrium
EG − Node Count − SlopeEG − Diameter − Rate
EG − Diameter − EquilibriumEG − Entropy − Rate
EG − Entropy − EquilibriumEG − Clustering Coefficient − Rate
EG − Clustering Coefficient − Equilibrium f7 − After Hoursf227 − Televisionf554 − Wanted Motors
10−1 100 101 102 103
Figure 3: PCA plot of the communities based on their semantic motifs (left) where level 4 forums are clustered
together, and model values for the concept graphs (CG) and entity graphs (EG) for the three outlier forums
from the three levels (right).
7. DISCUSSION & CONCLUSIONSIn this work, we found that concept and entity graph den-
sity in boards.ie does not grow linearly (unlike in social net-works [7]) but instead converges on a limit, which we charac-terised as the carrying capacity (E) of a given community’sconcept and entity graph entropy. We also discovered thatthe diameter of the entity graph in our online communityconverged on a limit over time as the rate of concepts arrivedslowed down, again in contrast to findings from the socialnetworking domain where diameters were found to shrink asmore nodes joined the network [8]. Indeed, this notion ofconvergence to a limit is common across all but one of thegraph measures that we examined and suggests that onlinecommunities have a finite number of topics that can be dis-cussed and that semantics will converge on a maxima overtime.
Our contributions can be summarized as follows: (i) Weused semantic graphs to firstly examine how concepts dis-cussed by communities changed over time at a macro-level,(ii) we used logistic population models to inspect how in-dividual communities evolved over time, and (iii) we de-ployed logistic population models to capture semantic graphchanges along di↵erent measures and applied our results tocommunity analysis and churn rate prediction. Thereby, ourwork forms a basis for combining studies of social and se-mantic network evolution in future work.
8. REFERENCES[1] Lars Backstrom, Dan Huttenlocher, Jon Kleinberg,
and Xiangyang Lan. Group formation in large socialnetworks: membership, growth, and evolution. InProceedings of the 12th ACM SIGKDD internationalconference on Knowledge discovery and data mining,pages 44–54. ACM, 2006.
[2] VIaclav Belak, Marcel Karnstedt, and Conor Hayes.Life-cycles and mutual e↵ects of scientificcommunities. Procedia - Social and BehavioralSciences, 22(0):37 – 48, 2011.
[3] Cristian Danescu-Niculescu-Mizil, Robert West, DanJurafsky, Jure Leskovec, and Christopher Potts. Nocountry for old members: User lifecycle and linguistic
change in online communities. In Proceedings of theWorld Wide Web Conference, 2013.
[4] Leon Derczynski, Diana Maynard, Niraj Aswani, andKalina Bontcheva. Microblog-genre noise and impacton semantic annotation accuracy. In Proceedings of the24th ACM Conference on Hypertext and Social Media(HT 2013), 2013.
[5] Neil Zhenqiang Gong, Wenchang Xu, Ling Huang,Prateek Mittal, Emil Stefanov, Vyas Sekar, and DawnSong. Evolution of social-attribute networks:Measurements, modeling, and implications usinggoogle+. CoRR, abs/1209.0835, 2012.
[6] Alicia Iriberri and Gondy Leroy. A life-cycleperspective on online community success. ACMComput. Surv., 41(2):11:1–11:29, February 2009.
[7] Jure Leskovec, Lars Backstrom, Ravi Kumar, andAndrew Tomkins. Microscopic evolution of socialnetworks. In Proceedings of the 14th ACM SIGKDDinternational conference on Knowledge discovery anddata mining, pages 462–470. ACM, 2008.
[8] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos.Graphs over time: densification laws, shrinkingdiameters and possible explanations. In Proceedings ofthe eleventh ACM SIGKDD international conferenceon Knowledge discovery in data mining, KDD ’05,pages 177–187, New York, NY, USA, 2005. ACM.
[9] Alan Mislove, Massimiliano Marcon, Krishna P.Gummadi, Peter Druschel, and Bobby Bhattacharjee.Measurement and analysis of online social networks.In SIGCOMM conference on Internet measurement,IMC ’07, pages 29–42, 2007.
[10] Roberto Navigli and Mirella Lapata. An experimentalstudy of graph connectivity for unsupervised wordsense disambiguation. IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI),32(4):678–692, 2010.
Entity Graph: fastest growth in the random forum Concept graph: slowest growth in random forum, largest equilibrium
Application: Churn Rate Prediction
The Semantic Evolution of Online Communities
16
¨ Task: forecast community churn rate at t+1 ¤ Based on semantic evolution up to t
¨ Trained a ridge regression model ¨ Root Mean Square Error: Predicted vs. Actual churn
rate
Apply Mined
Semantic Evolution Dynamics
Training Test
0 120 150
Training
Test
0 20 40 60 80 100 120
050
010
0015
00
Timestep
|V|
Linear Model (β=9.932)
(a) Node Count
0 20 40 60 80 100 120
12
34
56
7
Timestep
Diameter
(b) Diameter
0 20 40 60 80 100 120
12
34
Timestep
H(G)
(c) Graph Entropy
0 20 40 60 80 100 120
0.02
0.06
0.10
0.14
Timestep
CC
(d) Clustering Coe�cient
Figure 2: Entity graphs’ evolution based on node counts, diameter, graph entropy, and clustering coe�cient.
and entity graphs defined within a single vector for a givencommunity (f): mf = {m1,m2, . . . ,m13}. We began by de-riving a single matrixM 2 R93⇥13 containing the 93 commu-nities (rows) under analysis together with their 13 evolutionmeasures (columns), and performing principal componentanalysis over this matrix. The result of this clustering isshown in Fig. 3 where we have colour-coded the di↵erentcommunity forums by their hierarchical level in the plat-form (level 2 = most general, level 4 = most specific). Wefound that several of the more general forums appeared asoutliers in the plot and thus exhibited unique evolution dy-namics, while the two level 4 forums (forum 556 and 554)were bunched together suggesting that they follow similartrends. We further examined the semantic motifs of threeoutlier communities from di↵ering levels:
• Level 2: Forum 7 - After Hours. A random discussionforum.
• Level 3: Forum 227 - Television. Discussions abouttelevision.
• Level 4: Forum 554 - Wanted Motors. Discussionsabout cars and car parts.
Fig. 3 presents clear di↵erences between the forums: wenote that for the concept graph dynamics (CG) the rate ofthe node count is lowest for the After Hours forum but thatthe node count equilibrium is highest, indicating that themore general the forum the slower is the growth of the con-cept graph, but the higher the maxima of the graph size.Likewise for the specialisation count in the concept graphs:the After Hours forum exhibits a slower rate of growth butwith a greater carrying capacity that the graph is tending to-wards. In terms of the entity graph: the slope of node countgrowth described by the linear model is highest for AfterHours, indicating that compared to the other two forums,the rate at which new entities are cited by the community ofusers is much greater, while for the more topically-specificforum of the Wanted Motors forum this is a lot lower. Theentity graph equilibrium is also highest for After Hours andlowest for Wanted Motors, indicating that the more generala forum is the greater the carrying capacity of its entitygraph and the greater the number of entities that will bediscussed.
6.2 Churn Rate PredictionTo examine the link between the semantic evolution of
online communities and their social properties, we defineda prediction task in which we used the semantic evolution
dynamics of a given community at time step t to predictthe churn rate of community members at time t + 1. Wedefined the churn rate of a community as the proportion ofactive users during a given time period (i.e. week segment)that post for the last time. We used the semantic dynamicsmotifs from the prior experiment (as listed within Fig. 3)and also included graph measures at a given time period:i.e. graph entropy at time t, specialisation count at time t,etc. We derived these features for every time step for eachcommunity and derived the response variable as the churnrate at the following time step. We then compiled a train-ing dataset (up to week 120) and a test dataset (from week120). Each dataset had the following form: D = {(xi, yi)},where xi contained a 21-element time-delimited feature vec-tor for a given community and yi was the churn rate of thecommunity at the following time step. We trained a ridgeregression model ( ) using Dtrain and applied it to Dtest,testing the performance of: a) just concept graph features,b) just entity graph features, and c) all features. An autore-gressive model was used as the baseline - using the churnrate at time t as a single predictor variable for the churnrate at time t + 1. Performance was evaluated using theRoot Mean Square Error (RMSE).
Table 1: Root Mean Square Error when predicting
churn rates using: an Autoregressive model (R2 =0.341) and a Ridge Regression model using Concept
Graph, Entity Graph, and all features.
Baseline Concept Graph Entity Graph All Features
7.310 ⇥10
�35.315 ⇥10
�35.301 ⇥10
�34.941 ⇥10
�3
Table 1 presents the results from our prediction our exper-iment. We found that for all tested models (concept graph,entity graph, all features) we significantly outperformed thebaseline - tested using the sign test (↵ = 0.001). Entitygraph features outperform concept graphs but not signifi-cantly, while our best model is the use of all features togetherin a single model. These results empirically demonstrate theutility of semantic evolution dynamics in predicting commu-nity churn rates, and suggests a link between how the com-munities develop semantically and the likelihood of usersleaving the communities.
7. DISCUSSION & CONCLUSIONS
Application: Churn Rate Prediction
The Semantic Evolution of Online Communities
17
0 20 40 60 80 100 120
050
010
0015
00
Timestep
|V|
Linear Model (β=9.932)
(a) Node Count
0 20 40 60 80 100 120
12
34
56
7
Timestep
Diameter
(b) Diameter
0 20 40 60 80 100 120
12
34
Timestep
H(G)
(c) Graph Entropy
0 20 40 60 80 100 120
0.02
0.06
0.10
0.14
Timestep
CC
(d) Clustering Coe�cient
Figure 2: Entity graphs’ evolution based on node counts, diameter, graph entropy, and clustering coe�cient.
and entity graphs defined within a single vector for a givencommunity (f): mf = {m1,m2, . . . ,m13}. We began by de-riving a single matrixM 2 R93⇥13 containing the 93 commu-nities (rows) under analysis together with their 13 evolutionmeasures (columns), and performing principal componentanalysis over this matrix. The result of this clustering isshown in Fig. 3 where we have colour-coded the di↵erentcommunity forums by their hierarchical level in the plat-form (level 2 = most general, level 4 = most specific). Wefound that several of the more general forums appeared asoutliers in the plot and thus exhibited unique evolution dy-namics, while the two level 4 forums (forum 556 and 554)were bunched together suggesting that they follow similartrends. We further examined the semantic motifs of threeoutlier communities from di↵ering levels:
• Level 2: Forum 7 - After Hours. A random discussionforum.
• Level 3: Forum 227 - Television. Discussions abouttelevision.
• Level 4: Forum 554 - Wanted Motors. Discussionsabout cars and car parts.
Fig. 3 presents clear di↵erences between the forums: wenote that for the concept graph dynamics (CG) the rate ofthe node count is lowest for the After Hours forum but thatthe node count equilibrium is highest, indicating that themore general the forum the slower is the growth of the con-cept graph, but the higher the maxima of the graph size.Likewise for the specialisation count in the concept graphs:the After Hours forum exhibits a slower rate of growth butwith a greater carrying capacity that the graph is tending to-wards. In terms of the entity graph: the slope of node countgrowth described by the linear model is highest for AfterHours, indicating that compared to the other two forums,the rate at which new entities are cited by the community ofusers is much greater, while for the more topically-specificforum of the Wanted Motors forum this is a lot lower. Theentity graph equilibrium is also highest for After Hours andlowest for Wanted Motors, indicating that the more generala forum is the greater the carrying capacity of its entitygraph and the greater the number of entities that will bediscussed.
6.2 Churn Rate PredictionTo examine the link between the semantic evolution of
online communities and their social properties, we defineda prediction task in which we used the semantic evolution
dynamics of a given community at time step t to predictthe churn rate of community members at time t + 1. Wedefined the churn rate of a community as the proportion ofactive users during a given time period (i.e. week segment)that post for the last time. We used the semantic dynamicsmotifs from the prior experiment (as listed within Fig. 3)and also included graph measures at a given time period:i.e. graph entropy at time t, specialisation count at time t,etc. We derived these features for every time step for eachcommunity and derived the response variable as the churnrate at the following time step. We then compiled a train-ing dataset (up to week 120) and a test dataset (from week120). Each dataset had the following form: D = {(xi, yi)},where xi contained a 21-element time-delimited feature vec-tor for a given community and yi was the churn rate of thecommunity at the following time step. We trained a ridgeregression model ( ) using Dtrain and applied it to Dtest,testing the performance of: a) just concept graph features,b) just entity graph features, and c) all features. An autore-gressive model was used as the baseline - using the churnrate at time t as a single predictor variable for the churnrate at time t + 1. Performance was evaluated using theRoot Mean Square Error (RMSE).
Table 1: Root Mean Square Error when predicting
churn rates using: an Autoregressive model (R2 =0.341) and a Ridge Regression model using Concept
Graph, Entity Graph, and all features.
Baseline Concept Graph Entity Graph All Features
7.310 ⇥10
�35.315 ⇥10
�35.301 ⇥10
�34.941 ⇥10
�3
Table 1 presents the results from our prediction our exper-iment. We found that for all tested models (concept graph,entity graph, all features) we significantly outperformed thebaseline - tested using the sign test (↵ = 0.001). Entitygraph features outperform concept graphs but not signifi-cantly, while our best model is the use of all features togetherin a single model. These results empirically demonstrate theutility of semantic evolution dynamics in predicting commu-nity churn rates, and suggests a link between how the com-munities develop semantically and the likelihood of usersleaving the communities.
Significant reduction in error (Sign test with α=0.001)
Baseline: Autoregressive model with churn rate @ t as a predictor for t+1
Apply Mined
Semantic Evolution Dynamics
Findings and Conclusions
The Semantic Evolution of Online Communities
18
¨ Semantic graphs of online communities do not grow linearly: instead, they evolve to a limit ¤ Unlike in social networks [Lekovec et al., 2008] ¤ Exception: entity graph size
¨ A finite number of topics are discussed within communities ¤ Variation between communities
¨ Our use of logistic population models has enabled: 1. Characterisation of community-specific evolution dynamics 2. Community analysis to inspect how communities evolved
differently 3. Churn prediction based on semantic evolution
Future Work
The Semantic Evolution of Online Communities
19
1. Expanded to cover other online communities ¤ Question-answering, mined communities
2. User profiling ¤ Capturing user-specific semantic evolution
In-edge weight distribution
(ServerFault)
●
●
● ●
●
1 2 3 4 50.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
k= 5
Lifecycle Stage
H
●
●
● ●
●
●●
●
●
●
●
●
● ●
●
● ●
● ●
●
2 4 6 8 10
0.3
0.4
0.5
0.6
0.7
k= 10
Lifecycle StageH
●
●
● ●
●
● ●
● ●
●
● ●
●●
●● ●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
5 10 15 20
0.05
0.15
0.25
0.35
k= 20
Lifecycle Stage
H
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●●
● ●
●●
●
●
● ●●
● ●●
●
●
Non-churners
Churners
Matthew Rowe @mrowebot [email protected] http://www.lancaster.ac.uk/staff/rowem/ Markus Strohmaier @mstrohm [email protected] http://markusstrohmaier.info/
Questions? 20
The Semantic Evolution of Online Communities