monitoring and analysis of online communities
DESCRIPTION
Invited talk at the Web Science Doctoral Summer School, Galway, 2011TRANSCRIPT
1
Harith Alani Knowledge Media institute, The Open University, UK
Web Science Summer School Galway, 2011
http://twitter.com/halani http://delicious.com/halani http://www.linkedin.com/pub/harith-alani/9/739/534
Monitoring and Analysis of Online
Communities
Market value of Web Analytics
2
Agenda
• Community monitoring
• Offline and online social networking
• Modeling and tracking behaviour
• Analysing community features
• Predicting discussion activity
3
Online community monitoring
• Analysing and understanding activities and dynamics
• Studying impact of social and technical features
• Forecast future growth and evolution
• Tracking behaviour and influence
• Tracking reputation and buzz
• Listening to customer opinion
• Profiling the user base
• Gauging customer sentiment
4
5
Measuring social media
Deloitte, Beeline Labs, & Society for New Communication Research surveyed 140 companies with online communities, 2008
6
Measuring social media
Deloitte, Beeline Labs, & Society for New Communication Research surveyed 140 companies with online communities, 2008
7
“B2B Marketing Goes Social: A White Horse Survey Report” – March 2010 – study of 104 companies
Measuring social media
8
Measuring social media
“Social media usage, attitudes and measurability: What do marketers think?” – KingFishMedia, 2010
9
Tools for monitoring social media
http://www.ubervu.com/ 10
• Analytics: – Mention volume
– Sentiment
– Discussion clouds – Activity graphs and
metrics
– Language and geolocation filtering
– Filter by social platform
– Comparisons
http://www.viralheat.com/home! 11
• Analytics: – Influencing users
– Sentiment and opinion analysis
– Viral content analysis – Detecting sales leads
– Filter by geo-location
Monitoring and Analysis of Online Communities With a Web Science flavour
12
Online vs. Offline social networking
13
• Digital social networking increases physical social isolation
• Causes – Genetic alterations – Weakened immune system – Less resistant to cancer – Higher risk of heart disease – Higher blood pressure – Faster dementia – Narrower arteries
Aric Sigman, “Well Connected? The Biological Implications of 'Social Networking’”, Biologist, 56(1), 2009
14
Online vs. offline social networking: The Bad News!
• Digital networking increase social interaction – Transforms little boxed societies to networked and networking
societies – Create more opportunities to network – New methods to communicate, easily, and widely – Supports and increases F2F contact! – The stronger the offline social tie, the more intense the online
communication – The stronger the offline social tie, the more diverse online
communications – F2F is medium of choice in weaker social ties
Keith Hampton and Barry Wellman, Long Distance Community in the Network Society: Contact and Support Beyond Netville, American Behavioral Scientist 45 (3), November, 2001.
Barry Wellman, The Glocal Village: Internet and Community, Idea’s - The Arts & Science Review, University of Toronto, 1(1),2004
15
Online vs. offline social networking: The Good News!
Physical online & digital offline
16
Sensor & Social Networks
17 17
Sensor & Social Networks
18
www.nabaztag.com
The Canine Twitterer
“Having my daily workout. Already did 15 leg lifts!”
Tag-Along Marketing The New York Times, November 6, 2010
“Everything is in place for location-based social networking to be the next big thing. Tech companies are building the platforms, venture capitalists are providing the cash and marketers are eager to develop advertising. “
Location Sensors & Social Networking
19
Monitoring online/offline social activity
20
Where is everybody?
Monitoring online/offline social activity
• Generating opportunities for F2F networking
21
Monitoring online/offline social activity
22
“There are more than 250 million active users currently accessing Facebook through their mobile devices“ “People that use Facebook on their mobile devices are twice as active on Facebook than non-mobile users”
http://www.facebook.com/press/info.php?statistics
Tracking of F2F contact networks
23
TraceEncounters - 2004
Sociometer, MIT, 2002 - F2F and productivity
- F2F dynamics
- Who are key players?
- F2F and office distance
24
SocioPatterns platform
24 http://www.sociopatterns.org/!
Offline social networks
25 by Ciro Cattuto
From a small conference at ISI, Turin
26
• Similarity features – Country of
origin – Seniority – .. Age? Role?
Projects? Interests?
SR
SR
students
students
JR • What other info can we get to help us understand these network dynamics?
Offline social networks
Offline + online social networking
27 ESWC2010
Where should I go?
Where have I met this guy?
Anyone I know here?
Who should I talk to?
<?xml version="1.0"?>!<rdf:RDF! xmlns="http://tagora.ecs.soton.ac.uk/schemas/tagging#"! xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"! xmlns:xsd="http://www.w3.org/2001/XMLSchema#"! xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"! xmlns:owl="http://www.w3.org/2002/07/owl#"! xml:base="http://tagora.ecs.soton.ac.uk/schemas/tagging">! <owl:Ontology rdf:about=""/>! <owl:Class rdf:ID="Post"/>! <owl:Class rdf:ID="TagInfo"/>! <owl:Class rdf:ID="GlobalCooccurrenceInfo"/>! <owl:Class rdf:ID="DomainCooccurrenceInfo"/>! <owl:Class rdf:ID="UserTag"/>! <owl:Class rdf:ID="UserCooccurrenceInfo"/>! <owl:Class rdf:ID="Resource"/>! <owl:Class rdf:ID="GlobalTag"/>! <owl:Class rdf:ID="Tagger"/>! <owl:Class rdf:ID="DomainTag"/>! <owl:ObjectProperty rdf:ID="hasPostTag">! <rdfs:domain rdf:resource="#TagInfo"/>! </owl:ObjectProperty>! <owl:ObjectProperty rdf:ID="hasDomainTag">! <rdfs:domain rdf:resource="#UserTag"/>! </owl:ObjectProperty>! <owl:ObjectProperty rdf:ID="isFilteredTo">! <rdfs:range rdf:resource="#GlobalTag"/>! <rdfs:domain rdf:resource="#GlobalTag"/>! </owl:ObjectProperty>! <owl:ObjectProperty rdf:ID="hasResource">! <rdfs:domain rdf:resource="#Post"/>! <rdfs:range =…!
Live Social Semantics (LSS): RFIDs + Social Web + Semantic Web
• Integration of physical presence and online information • Semantic user profile generation • Logging of face-to-face contact • Social network browsing • Analysis of online vs offline social networks
SW sources
29
proceedings chair
chair author
CoP
conference
30
Social and information networks
30
31
Merging social networks
31 FOAF
32
Tag Filtering Service
Semantic modeling Semantic analysis Collective intelligence Statistical analysis Syntactical analysis
33
Tag Filtering Service
34
From Tags to Semantics
34
35
Tags to User Interests
35
36
From raw tags and social relations to Structured Data
User raw data
Structured data
Collective intelligence
ontologies
Semantic data
37
RFIDs for tracking social contact
37
38
Convergence with online social networks
38
People contact à RFID à RDF Triples
39
F2FContact
hasContact
contactWith
contactDate contactDura0on
XMLSchema#date XMLSchema#0me
contactPlace
Place
foaf#Person1
foaf#Person2
40
41
42
Real-time F2F networks with SNS links
http://www.vimeo.com/6590604
43
Deployed at:
Live Social Semantics
Data analysis • Face-to-face interactions across scientific conferences
• Networking behaviour of frequent users
• Correlations between scientific seniority and social networking
• Comparison of F2F contact network with Twitter and Facebook
• Social networking with online and offline friends
Analysis of LSS Results
The New Yorker 2/11/2008
44
Characteristics of F2F contact network
• Degree is number of people with whom the person had at least one F2F contact
• Strength is the time spent in a F2F contact • Edge weight is total time spent by a pair of users in F2F contact
45
Network characteristics
ESWC 2009 HT 2009 ESWC 2010
Number of users 175 113 158
Average degree 54 39 55
Avg. strength (mn) 143 123 130
Avg. weight (mn)
2.65 3.15 2.35
Weights ≤ 1 mn
70% 67% 74%
Weights ≤ 5 mn
90% 89% 93%
Weights ≤ 10 mn 95% 94% 96%
Characteristics of F2F contact events Contact characteristics
ESWC 2009 HT 2009 ESWC 2010
Number of contact events
16258 9875 14671
Average contact length (s)
46 42 42
Contacts ≤ 1mn 87% 89% 88%
Contacts ≤ 2mn 94% 96% 95%
Contacts ≤ 5mn 99% 99% 99%
Contacts ≤ 10mn 99.8% 99.8% 99.8%
F2F contact pattern is very similar for all three conferences
F2F contacts of returning users
101 102101
102
103 104 105103
104
ESW
C20
10
101 102 103 104 105
ESWC2009101102103104
Degree
Total interaction time
Links’ weights
47
• Degree: number of other participants with whom an attendee has interacted
• Total time: total time spent in
interaction by an attendee
• Link weight: total time spent in F2F interaction by a pair of returning attendees in 2010, versus the same quantity measured in 2009
Time spent on F2F networking by frequent users is stable, even when the list of people they networked with changed
ESWC 2009 & ESWC 2010
Pearson Correlation
Degree 0.37
Total F2F interaction time
0.76
Link weight 0.75
Average seniority of neighbours in F2F networks
0 5 10seniority (number of papers)
0
1
2
3
4
5
Ave
rage
seni
ority
of n
eigh
bors
sennsenn,wsenn,max
48
• No clear pattern is observed if the unweighted average over all neighbours in the aggregated network is considered
• A correlation is observed when each neighbour is weighted by the time spent with the main person
• The correlation becomes much stronger when considering for each individual only the neighbour with whom the most time was spent
Avg seniority of the neighbours
with weighted averages
Seniority of user with strongest link
Conference attendees tend to networks with others of similar levels of scientific seniority
Presence of A<endees HT2009
Importance of the bar? Popularity of sessions? par0cular talks?
Number of cliques HT2009
Offline networking vs online networking
51
• people who have a large number of friends on Twitter and/or Facebook don’t seem to be the most socially active in the offline world in comparison to other SNS users
Users with Facebook and Twitter accounts in ESWC 2010
Twitterers Spearman Correlation (ρ)
Tweets – F2F Degree - 0.15
Tweets – F2F Strength - 0.15
Twitter Following – F2F Degree
- 0.21
No strong correlation between amount of F2F contact activity and size of online social networks
users
!"
!#$"
!#%"
!#&"
!#'"
("
(#$"
(" &" ((" (&" $(" $&" )(" )&" %("
*+,-./"01221+./3"
45678.9"
*+..:3"
Scientific seniority vs Twitter followers
52
• Comparison between people’s scientific seniority and the number of people following them on Twitter
People who have the highest number of Twitter followers are not necessarily the most scientifically senior, although they do have high visibility and experience
users
Twitter users Correlation H-index – Twitter Followers 0.32
H-index – Tweets - 0.13
Conference Chairs
all participants
2009
chairs 2009
all participants
2010
chairs 2010
average degree average strength
55 8590
77.7 19590
54 7807
77.6 22520
average weight average number of events per edge
159 3.44
500 8
141 3.37
674 12
• Conf chairs interact with more distinct people (larger average degree)
• Conf chairs spend more time in F2F interaction (almost three times as much as a random participant)
Networking with online and offline ‘friends’ Characteristics all users coauthors Facebook
friends Twitter
followers average contact duration (s)
42 75 63 72
average edge weight (s)
141 4470 830 1010
average number of events per edge
3.37 60 13 14
• Individuals sharing an online or professional social link meet much more often than other individuals
• Average number of encounters, and total time spent in interaction, is highest for co-authors
F2F contacts with Facebook & Twitter friends were respectively %50 and %71 longer, and %286 and %315 more frequent than with others They spent %79 more time in F2F contacts with their co-authors, and they met them %1680 more times than they met non co-authors
Twitterers vs Non-Twitterers
• Time spent in conference rooms – Twitter users spent on average 11.4% more time in the
conf rooms than non-twitter users (mean is 26% higher)
• Number of people met F2F during the conference – Twitter users met on average 9% more people F2F
(mean 8% higher)
• Duration of F2F contacts – Twitter users spent on average 63% more time in F2F
contact than non twitter users (mean is 20% higher)
55
56 Web Science Summer School
Galway, 2011
Analysis of behaviour in online
communities
Behaviour of individuals – micro level analysis
57
!"
!#$"
!#%"
!#&"
!#'"
("
(#$"
(" )" *" (+" (," $(" $)" $*" ++" +," %(" %)"-./0123" 4$4"526722" 4$4"8972069:"
:2;<9:=">?@20AB?"C">D?@;<"E7DB<2>#"F72G"
?:;@7>HIJ>"
@0"K88"92;L"6DD1">?@20AB?M";01">D?@;<">@60;<>""
>:=">?@20A>9N"
DO9>@127M":@6:"E7DB<2"
89O1209>M"PQM"12R2<DE27>#"S:DT>"9:2"0239">9;7"72>2;7?:27N"
Why monitor behaviour?
• Understand impact of behaviour on community evolution • Forecast community future
• Learn when intervention might be needed
• Learn which behaviour should be encouraged or discouraged
• Find what could trigger certain behaviours
• What is the best mix of behaviour to increase engagement in the community
• To see which users need more support, which ones should be confined, and which ones should be promoted
58
Behaviour analysis
• Behaviour compositions in Boards.ie:
Jeffrey Chan, Conor Hayes, and Elizabeth Daly. Decomposing discussion forums using common user roles. In Proc. Web Science Conf. (WebSci10), Raleigh, NC: US, 2010
Ontology
Encoding Rules in Ontologies with SPIN
Approach for inferring User Roles
62
Structural, social network, reciprocity, persistence, participation
Feature levels change with the dynamics of the community
Associate Roles with a collection of feature-to-level Mappings e.g. in-degree -> high, out-degree -> high
Run our rules over each user’s features and derive the role composition
Data from Boards.ie • Forum 246 (Commuting and Transport): Demonstrates a clear increase in
activity over time.
• Forum 388 (Rugby): Exhibits periodic increase and decrease in activity and hence it provides good examples of healthy/unhealthy evolutions.
• Forum 411 (Mobile Phones and PDAs): Increase in activity over time with some fluctuation - i.e. reduction and increase over various time windows.
• For the time in 2004-01 to 2006-12
Features
• In-degree Ratio: The proportion of users U that reply to user υi, thus indicating the concentration of users that reply to υi
• Posts Replied Ratio: Proportion of posts by user υi that yield a reply, used to gauge the popularity of the user’s content based on replies
• Thread Initiation Ratio: Proportion of threads that have been started by υi.
• Bi-directional Threads Ratio: Proportion of threads where user υi replies to a user and receives a reply, thus forming a reciprocal communication
• Bi-directional Neighbours Ratio: The proportion of neighbours where a reciprocal interaction has taken place - e.g. υi replied to υi and υi replied to υi.
• Average Posts per Thread: The average number of posts made in every thread that user υi has participated in
• Standard Deviation of Posts per Thread: The standard deviation of the number of posts in every thread that user υi has participated in. This gauges the distribution of the discussion lengths.
Role Skeleton
Results
• Correlation of individual features in each of the three forums
Commuting and Transport Rugby Mobile Phones and PDAs
Results (a
) For
um 2
46: C
omm
utin
g an
d Tr
ansp
ort
(b) F
orum
388
: Rug
by
(c) F
orum
411
: Mob
ile P
hone
s an
d P
DA
s
• Variation in behaviour composition & activity
• Behaviour composition in/stability influences forum activity
Prediction analysis – preliminary results!
• Predicting rise/fall in post submission numbers
• Binary classification
• Features : Community composition, roles and percentages of users associated with each
• Cross-community predictions are less reliable than individual community analysis due to the idiosyncratic behaviour observed in each individual community
Forum P R F1 ROC
246 0.799 0.769 0.780 0.800
388 0.603 0.615 0.605 0.775
411 0.765 0.692 0.714 0.617
All 0.583 0.667 0.607 0.466
Observations so far
• Growing communities contain more elitists and popular participants
• Shrinking communities contain many taciturns and ignored users
• A stable composition, with a mix of roles, is associated with increased community activity
• Different communities may require different behaviour compositions to increase activity/health
What features make online communities tick
71
• How many do you recognise? Use?
• Which ones still exist?
• Which are strong and healthy?
• Which are aging and withering?
• What health signs should we look for?
• How can we predict their future evolution?
Rise and fall of social networks
72
Predicting engagement
• Which posts will receive a reply? – What are the most influential features here?
• How much discussion will it generate? – What are the key factors of lengthy discussions?
73
Common online community features
74
initial tweet that generates a reply. Features which describe seed posts can bedivided into two sets: user features - attributes that define the user making thepost; and, content features - attributes that are based solely on the post itself.We wish to explore the application of such features in identifying seed posts, todo this we train several machine learning classifiers and report on our findings.However we first describe the features used.
4.1 Feature Extraction
The likelihood of posts eliciting replies depends upon popularity, a highly subjec-tive term influenced by external factors. Properties influencing popularity includeuser attributes - describing the reputation of the user - and attributes of a post’scontent - generally referred to as content features. In Table 1 we define user andcontent features and study their influence on the discussion “continuation”.
Table 1. User and Content Features
User FeaturesIn Degree: Number of followers of U #
Out Degree: Number of users U follows #List Degree: Number of lists U appears on. Lists group users by topic #Post Count: Total number of posts the user has ever posted #
User Age: Number of minutes from user join date #Post Rate: Posting frequency of the user PostCount
UserAge
Content FeaturesPost length: Length of the post in characters #Complexity: Cumulative entropy of the unique words in post p !
of total word length n and pi the frequency of each word!
i![1,n] pi(log !"log pi)
!Uppercase count: Number of uppercase words #
Readability: Gunning fog index using average sentence length (ASL) [7]and the percentage of complex words (PCW). 0.4(ASL+ PCW )
Verb Count: Number of verbs #Noun Count: Number of nouns #
Adjective Count: Number of adjectives #Referral Count: Number of @user #
Time in the day: Normalised time in the day measured in minutes #Informativeness: Terminological novelty of the post wrt other posts
The cumulative tfIdf value of each term t in post p!
t!p tfidf(t, p)Polarity: Cumulation of polar term weights in p (using
Sentiwordnet3 lexicon) normalised by polar terms count Po+Ne|terms|
4.2 Experiments
Experiments are intended to test the performance of di!erent classification mod-els in identifying seed posts. Therefore we used four classifiers: discriminativeclassifiers Perceptron and SVM, the generative classifier Naive Bayes and thedecision-tree classifier J48. For each classifier we used three feature settings:user features, content features and user+content features.
Datasets For our experiments we used two datasets of tweets available on theWeb: Haiti earthquake tweets4 and the State of the Union Address tweets.5 The
4 http://infochimps.com/datasets/twitter-haiti-earthquake-data5 http://infochimps.com/datasets/tweets-during-state-of-the-union-address
• How do all these features influence activity generation in an online community? – Such knowledge leads to better use and management of the community
Experiment for identifying seed posts
• Twitter data on the Haiti earthquake, and the Union Address
• Evaluated a binary classification task – Is this post a seed post or not?
75
Dataset Users Tweets Seeds Non-seeds Replies
Haiti 44,497 65,022 1,405 60,686 2,931
Union Address 66,300 80,272 7,228 55,169 17,875
Identifying seeds with different type of features
76
use f-measure, as defined in Equation 1 as the harmonic mean between precisionand recall, setting ! = 1 to weight precision and recall equally. We also plot theReceiver Operator Curve of our trained models to show graphical comparisonsof performance.
F! =(1 + !2) ! P ! R
!2 ! P + R(1)
For our experiments we divided each dataset up into 3 sets: a training set, avalidation set and a testing set using a 70/20/10 split. We trained our classifi-cation models using the training split and then applied them to the validationset, labelling the posts within this split. From these initial results we performedmodel selection by choosing the best performing model - based on maximisingthe F1 score - and used this model together with the best performing features,using a ranking heuristic, to classify posts contained within our test split. Wefirst report on the results obtained from our model selection phase, before movingonto our results from using the best model with the top-k features.
Table 3. Results from the classification of seed posts using varying feature sets andclassification models
(a) Haiti DatasetP R F1 ROC
User Perc 0.794 0.528 0.634 0.727SVM 0.843 0.159 0.267 0.566NB 0.948 0.269 0.420 0.785J48 0.906 0.679 0.776 0.822
Content Perc 0.875 0.077 0.142 0.606SVM 0.552 0.727 0.627 0.589NB 0.721 0.638 0.677 0.769J48 0.685 0.705 0.695 0.711
All Perc 0.794 0.528 0.634 0.726SVM 0.483 0.996 0.651 0.502NB 0.962 0.280 0.434 0.852J48 0.824 0.775 0.798 0.836
(b) Union Address DatasetP R F1 ROC
User Perc 0.658 0.697 0.677 0.673SVM 0.510 0.946 0.663 0.512NB 0.844 0.086 0.157 0.707J48 0.851 0.722 0.782 0.830
Content Perc 0.467 0.698 0.560 0.457SVM 0.650 0.589 0.618 0.638NB 0.762 0.212 0.332 0.649J48 0.740 0.533 0.619 0.736
All Perc 0.630 0.762 0.690 0.672SVM 0.499 0.990 0.664 0.506NB 0.874 0.212 0.341 0.737J48 0.890 0.810 0.848 0.877
4.3 Results
Our findings from Table 3 demonstrate the e!ectiveness of using solely userfeatures for identifying seed posts. In both the Haiti and Union Address datasetstraining a classification model using user features shows improved performanceover the same models trained using content features. In the case of the Uniondataset we are able to achieve an F1 score of 0.782, coupled with high precision,when using the J48 decision-tree classifier - where the latter figure (precision)indicates conservative estimates using only user features. We also achieve similarhigh-levels of precision when using the same classifier on the Haiti dataset. Theplots of the Receiver Operator Characteristic (ROC) curves in Figure 2 showsimilar levels of performance for each classifier over the two corpora.When usingsolely user features J48 is shown to dominate the ROC space, subsuming theplots from the other models. A similar behaviour is exhibited for the NaiveBayes classifier where SVM and Perceptron are each outperformed. The plotsalso demonstrate the poor recall levels when using only content features, whereeach model fails to yield the same performance as the use of only user features.
• User features are most important in Twitter
• But combining user & content features gives best results
Impact of different features
• What features have the highest impact on identification of seed posts?
• Rank features by information gain ratio wrt seed post class label
77
which we found to be 0.674 indicating a good correlation between the two listsand their respective ranks.
Table 4. Features ranked by Information Gain Ratio wrt Seed Post class label. Thefeature name is paired within its IG in brackets.
Rank Haiti Union Address1 user-list-degree (0.275) user-list-degree (0.319)2 user-in-degree (0.221) content-time-in-day (0.152)3 content-informativeness (0.154) user-in-degree (0.133)4 user-num-posts (0.111) user-num-posts (0.104)5 content-time-in-day (0.089) user-post-rate (0.075)6 user-post-rate (0.075) user-out-degree (0.056)7 content-polarity (0.064) content-referral-count (0.030)8 user-out-degree (0.040) user-age (0.015)9 content-referral-count (0.038) content-polarity (0.015)10 content-length (0.020) content-length (0.010)11 content-readability (0.018) content-complexity (0.004)12 user-age (0.015) content-noun-count (0.002)13 content-uppercase-count (0.012) content-readability (0.001)14 content-noun-count (0.010) content-verb-count (0.001)15 content-adj-count (0.005) content-adj-count (0.0)16 content-complexity (0.0) content-informativeness (0.0)17 content-verb-count (0.0) content-uppercase-count (0.0)
Fig. 3. Contributions of top-5 features to identifying Non-seeds (N) and Seeds(S).Upper plots are for the Haiti dataset and the lower plots are for the Union Addressdataset.
The top-most ranks from each dataset are dominated by user features includ-ing the list-degree, in-degree, num-of-posts and post-rate. Such features describea user’s reputation, where higher values are associated with seed posts. Figure3 shows the contributions of each of the top-5 features to class decisions in thetraining set, where the list-degree and in-degree of the user are seen to correlateheavily with seed posts. Using these rankings our next experiment explored thee!ects of training a classification model using only the top-k features, observing
Positive/negative impact of features
• What is the correlation between seed posts and features?
78
which we found to be 0.674 indicating a good correlation between the two listsand their respective ranks.
Table 4. Features ranked by Information Gain Ratio wrt Seed Post class label. Thefeature name is paired within its IG in brackets.
Rank Haiti Union Address1 user-list-degree (0.275) user-list-degree (0.319)2 user-in-degree (0.221) content-time-in-day (0.152)3 content-informativeness (0.154) user-in-degree (0.133)4 user-num-posts (0.111) user-num-posts (0.104)5 content-time-in-day (0.089) user-post-rate (0.075)6 user-post-rate (0.075) user-out-degree (0.056)7 content-polarity (0.064) content-referral-count (0.030)8 user-out-degree (0.040) user-age (0.015)9 content-referral-count (0.038) content-polarity (0.015)10 content-length (0.020) content-length (0.010)11 content-readability (0.018) content-complexity (0.004)12 user-age (0.015) content-noun-count (0.002)13 content-uppercase-count (0.012) content-readability (0.001)14 content-noun-count (0.010) content-verb-count (0.001)15 content-adj-count (0.005) content-adj-count (0.0)16 content-complexity (0.0) content-informativeness (0.0)17 content-verb-count (0.0) content-uppercase-count (0.0)
Fig. 3. Contributions of top-5 features to identifying Non-seeds (N) and Seeds(S).Upper plots are for the Haiti dataset and the lower plots are for the Union Addressdataset.
The top-most ranks from each dataset are dominated by user features includ-ing the list-degree, in-degree, num-of-posts and post-rate. Such features describea user’s reputation, where higher values are associated with seed posts. Figure3 shows the contributions of each of the top-5 features to class decisions in thetraining set, where the list-degree and in-degree of the user are seen to correlateheavily with seed posts. Using these rankings our next experiment explored thee!ects of training a classification model using only the top-k features, observing
H
aiti
Uni
on A
ddre
ss
Identifying Seed Posts
79
• Can we identify seed posts using the top-k features?
– Stability is reached with 5 features
– Classification with 5 features is sufficient for identifying posts that generate responses
Predicting Discussion Activity
• Reply rates: – Haiti 1-74 responses, Union Address 1-75 responses
• Compare rankings – Ground truth vs predicted
• Experiments – Using Haiti and Union Address datasets – Evaluate predicted rank k where k={1,5,10,20,50,100) – Support Vector Regression with user, content, user+content
features
80
Dataset Training size
Test size Test Vol Mean
Test Vol SD
Haiti 980 210 1.664 3.017
Union Address 5,067 1,161 1.761 2.342
Predicting Discussion Activity
81
Haiti dataset Union Address dataset
• Content features are key for top ranks
• Use features more important for higher ranks
82
Identifying Seed Posts in Boards.ie
• Used the same features as before – User features
• In-degree, out-degree, post count, user age, post rate – Content features
• Post Length, complexity, readability, referral count, time in day, informativeness, polarity
• New features designed to capture user affinity – Forum Entropy
• Concentration of forum activity • Higher entropy = large forum spread
– Forum Likelihood • Likelihood of forum post given user history • Combines post history with incoming data
83
Experiment for identifying seed posts
• Used all posts from Boards.ie in 2006
• Built features using a 6-month window prior to seed post date
• Evaluated a binary classification task – Is this post a seed post or not? – Precision, Recall, F1 and Accuracy – Tested: user, content, focus features, and their combinations
Posts Seeds Non-Seeds Replies Users
1,942,030 90,765 21,800 1,829,465 29,908
84
Identifying seeds with different type of features
activity levels, and because it has already been used in otherinvestigations (e.g., [14]).
Boards.ie does not provide explicit social relations be-tween community members, unlike for example Facebook andTwitter. We followed the same strategy proposed in [3] forextracting social networks from Digg, and built the Boards.iesocial network for users, weighting edges cumulatively by thenumber of replies between any two users.
TABLE IDESCRIPTION OF THE BOARDS.IE DATASET
Posts Seeds Non-Seeds Replies Users1,942,030 90,765 21,800 1,829,465 29,908
In order to take derive our features we required a windowof n-days from which the social graph can be compiled andrelevant measurements taken. Based on previous work overthe same dataset in [14], we used a similar window of 188days (roughly 6-months) prior to the post date of a given seedor non-seed post. For instance, if a seed post p is made attime t, then our window from which the features (i.e., userand focus features) are derived is from t − 188 to t − 1. Inusing this heuristic we ensure that the features compiled foreach post are independent of future outcomes and will notbias our predictions - for example a user may increase theiractivity following the seed post which would not be a trueindicator of their behaviour at the time the post was made.Table I summarises the dataset and the number of posts (seeds,non-seeds and replies) and users contained within.
V. CLASSIFICATION: DETECTING SEED POSTS
Predicting discussion activity levels are often hindered byincluding posts that yield no replies. We alleviate this problemby differentiating between seed posts and non-seeds through abinary classification task. Once seed posts have been identifiedwe then attempt to predict the level of discussion that suchposts will generate. To this end, we look for the best classifierfor identifying seed and non-seed posts and then search for thefeatures that played key roles in distinguishing seed posts fromnon-seeds, thereby observing key features that are associatedwith discussions.
A. Experimental SetupFor our experiments we are using the previously described
dataset collected from Boards.ie containing both seeds andnon-seeds throughout 2006. For our collection of posts webuilt the content, user, and focus features listed in section IIIfrom the past 6 months of data leading up to the date on whichthe post was published - thereby ensuring no bias from futureevents in our dataset. We split the dataset into 3 sets using a70/20/10% random split, providing a training set, a validationset and a test set.
Our first task was to perform model selection by testing fourdifferent classifiers: SVM, Naive Bayes, Maximum Entropyand J48 decision tree, when trained on various individual fea-ture sets and their combinations: user features, content features
and focus features. This model selection phase was performedby training each classifier, together with the combination offeatures, using the 70% training split and labelling instancesin the held out 20% validation split.
Once we had identified the best performing model - i.e.,the classifier and combination of feature set that produces thehighest F1 value - our second task was to perform featureassessment, thereby identifying key features that contributesignificantly to seed post prediction accuracy. For this wetrained the best performing model from the model selectionphase over the training split and tested its classification accu-racy over the 10% test split, dropping individual features fromthe model and recording the reduction in accuracy followingthe omission of a given feature. Given that we are performinga binary classification task we use the standard performancemeasures for such a scenario: precision, recall and f-measure- setting β = 1 for an equal weighting of precision andrecall. We also measure the area under the Receiver OperatorCharacteristic curve to gauge the relationship between recalland fallout - i.e., false negative rate.
TABLE IIRESULTS FROM THE CLASSIFICATION OF SEED POSTS USING
VARYING FEATURE SETS AND CLASSIFICATION MODELS
P R F1 ROC
User SVM 0.775 0.810 0.774 0.581Naive Bayes 0.691 0.767 0.719 0.540Max Ent 0.776 0.806 0.722 0.556J48 0.778 0.809 0.734 0.582
Content SVM 0.739 0.804 0.729 0.511Naive Bayes 0.730 0.794 0.740 0.616Max Ent 0.758 0.806 0.730 0.678J48 0.795 0.822 0.783 0.617
Focus SVM 0.649 0.805 0.719 0.500Naive Bayes 0.710 0.737 0.722 0.588Max Ent 0.649 0.805 0.719 0.586J48 0.649 0.805 0.719 0.500
User + Content SVM 0.790 0.808 0.727 0.509Naive Bayes 0.712 0.772 0.732 0.593Max Ent 0.767 0.807 0.734 0.671J48 0.795 0.821 0.779 0.675
User + Focus SVM 0.776 0.810 0.776 0.583Naive Bayes 0.699 0.778 0.724 0.585Max Ent 0.771 0.806 0.722 0.607J48 0.777 0.810 0.742 0.617
Content + Focus SVM 0.750 0.805 0.729 0.511Naive Bayes 0.732 0.787 0.746 0.658Max Ent 0.762 0.807 0.731 0.692J48 0.798 0.823 0.787 0.662
All SVM 0.791 0.808 0.727 0.510Naive Bayes 0.724 0.780 0.740 0.637Max Ent 0.768 0.808 0.733 0.688J48 0.798 0.824 0.792 0.692
B. Results: Model Selection
1) Model Selection with Individual Features: The resultsfrom our first experiments are shown in Table II. Lookingfirst at individual feature sets - e.g., SVM together withuser features - we see that content features yield improvedpredictive performance over user and focus features. On dis-cussion forums content appears to play a more central role
Positive/negative impact of features on Boards.ie
• What are the most important features for predicting seed posts?
• Correlations: – Referral counts (non-seeds) – Forum likelihood (seeds) – Informativeness (non-seeds) – Readability (seeds) – User age (non-seeds)
85
TABLE IIIREDUCTION IN F1 LEVELS AS INDIVIDUAL FEATURES ARE
DROPPED FROM THE J48 CLASSIFIER
Feature Dropped F1
- 0.815Post Count 0.815In-Degree 0.811*Out-Degree 0.811*User Age 0.807***Post Rate 0.815Forum Entropy 0.815Forum Likelihood 0.798***Post Length 0.810**Complexity 0.811**Readability 0.802***Referral Count 0.793***Time in Day 0.810**Informativeness 0.801***Polarity 0.808***Signif. codes: p-value < 0.001 *** 0.01 ** 0.05 * 0.1 .
hyperlinks (e.g., ads and spams). This contrasts with work inTwitter which found that tweets containing many links weremore likely to get ‘retweeted’ [11].
The boxplot for Forum Likelihood shows a correlation be-tween seed posts and higher values of the likelihood measure,suggesting that users who frequently post in the same forumsare more likely to start a discussion. Also, If a user often postsin discussion forums, while concentrating on only a few selectforums, then the likelihood that a new post is within one ofthose forums is high.
Fig. 3. Boxplots showing the correlation of feature values with seedand non-seed posts within the training split
VI. REGRESSION: PREDICTING DISCUSSION ACTIVITY
Early detection of lengthy discussions helps analysts andmanagers to focus attention to where activity and topicaldebates are about to occur. In this section we predict thelevel of discussion activity that seed posts will generate andwhat features are key indicators of lengthy discussions. Weuse regression models that induce a function describing therelationship between the level of discussion activity and ouruser, content and focus features. By learning such a functionwe can identify patterns in the data and correlations betweenour dependent variable and the range of predictor variablesthat we have.
Fig. 4. Discussion Activity Length Distribution
A. Experimental Setup
Forecasting the exact number of replies (discussion activity)is limited if the distribution of known reply lengths has alarge skew to either the minimum or maximum. For predictingpopular tweets, Hang et al [12] adopted a multiclass classifi-cation setting to deal with the large skew in the dataset bypredicting retweet count ranges. We have a similar scenarioin our Boards.ie dataset, where a large number of seed postsyield fewer than 20 replies (Figure VI). In such cases utilisingstandard regression error measures such as Relative AbsoluteError produces inaccurate assessments of the predictions dueto using a simple predictor based on the mean of the targetvariables.
In our experiments we instead use the Normalised Dis-counted Cumulative Gain (nDCG) at varying rank positions,looking at the performance of our predictions over the top-k documents where k = {1, 5, 10, 20, 50, 100}. NDCG isderived by dividing the Discounted Cumulative Gain (DCG)of the predicted ranking by the actual rank defined by (iDCG).DCG is well suited to our setting, given that we wish topredict the most popular posts and then expand that selectionto assess growing ranks, as the measure penalises elements inthe ranking that appear lower down when in fact they shouldbe higher up. We define DCG formally, based on the definitionfrom [9], as:
DCGk =k�
i=1
relilog2(1 + i)
(5)
For our experiments we first identify the best performingregression model before moving onto analysing the coeffi-cients of that model and the patterns in the data that leadto increased discussion activity. For our model selection phasewe test three regression models: Linear regression, Isotonic
• Can we predict the level of discussion activity?
Predicting Discussion Activity in Boards.ie
86
87
• What impact do features have on discussion length? – Assessed Linear Regression model with focus and content
features – Forum Likelihood (pos) – Content Length (+/neutral) – Complexity (pos) – Readability (+/neutral) – Referral Count (neg) – Time in Day (+/neutral) – Informativeness (-/neutral) – Polarity (neg)
Predicting Discussion Activity in Boards.ie
Stay tuned • More communities
– SAP, IBM, StackOverflow, Reddit – Compare impact of features on their dynamics
• Better behaviour analysis – Less features, more forums/communities, more graphs! – Healthy? posts, reciprocation, discussions, sentiment mixture
• Churn analysis – Correlation of features/behaviour to ‘bounce rate’
• Intervention! – Opportunities and mechanisms to influence behaviour
88
Upcoming events
89
Intelligent Web Services Meet Social Computing AAAI Spring Symposium 2012,
March 26-28, Stanford, California
http://vitvar.com/events/aaai-ss12 Deadline: Octover 7, 2011
Social Object Networks IEEE Social Computing, 2011 October 9-10, Boston, USA
http://ir.ii.uam.es/socialobjects2011/!Deadline: August 5, 2011
Questionnaire on user needs
http://socsem.open.ac.uk/limesurvey/index.php?sid=55487
Questionnaire is to identify the needs that community users have within online communities and to learn the factors and issues that influence those needs.
90
My social semantics team
Thanks to
91
Sofia Angeletou Research Associate
Matthew Rowe Research Associate
Alain Barrat CPT Marseille & ISI
Martin Szomszor CeRC, City University, UK
Wouter van Den Broeck ISI, Turin
Ciro Cattuto ISI, Turin
Live Social Semantics team
Gianluca Correndo, Uni Southampton Ivan Cantador, UAM, Madrid
STI International ESWC09/10 & HT09 chairs and organisers
All LSS participants
Acknowledgements