monitoring and analysis of online communities

1

Harith Alani Knowledge Media institute, The Open University, UK

Web Science Summer School Galway, 2011

http://twitter.com/halani http://delicious.com/halani http://www.linkedin.com/pub/harith-alani/9/739/534

Monitoring and Analysis of Online

Communities

Market value of Web Analytics

2

Agenda

•  Community monitoring

•  Offline and online social networking

•  Modeling and tracking behaviour

•  Analysing community features

•  Predicting discussion activity

3

Online community monitoring

•  Analysing and understanding activities and dynamics

•  Studying impact of social and technical features

•  Forecast future growth and evolution

•  Tracking behaviour and influence

•  Tracking reputation and buzz

•  Listening to customer opinion

•  Profiling the user base

•  Gauging customer sentiment

4

5

Measuring social media

Deloitte, Beeline Labs, & Society for New Communication Research surveyed 140 companies with online communities, 2008

6


Deloitte, Beeline Labs, & Society for New Communication Research surveyed 140 companies with online communities, 2008

7

“B2B Marketing Goes Social: A White Horse Survey Report” – March 2010 – study of 104 companies


8


“Social media usage, attitudes and measurability: What do marketers think?” – KingFishMedia, 2010

9

Tools for monitoring social media

http://www.ubervu.com/ 10

•  Analytics: – Mention volume

– Sentiment

– Discussion clouds – Activity graphs and

metrics

–  Language and geolocation filtering

– Filter by social platform

– Comparisons

http://www.viralheat.com/home! 11

•  Analytics: –  Influencing users

– Sentiment and opinion analysis

– Viral content analysis – Detecting sales leads

– Filter by geo-location

Monitoring and Analysis of Online Communities With a Web Science flavour

12

Online vs. Offline social networking

13

•  Digital social networking increases physical social isolation

•  Causes – Genetic alterations – Weakened immune system –  Less resistant to cancer – Higher risk of heart disease – Higher blood pressure – Faster dementia – Narrower arteries

Aric Sigman, “Well Connected? The Biological Implications of 'Social Networking’”, Biologist, 56(1), 2009

14

Online vs. offline social networking: The Bad News!

•  Digital networking increase social interaction – Transforms little boxed societies to networked and networking

societies – Create more opportunities to network – New methods to communicate, easily, and widely – Supports and increases F2F contact! – The stronger the offline social tie, the more intense the online

communication – The stronger the offline social tie, the more diverse online

communications – F2F is medium of choice in weaker social ties

Keith Hampton and Barry Wellman, Long Distance Community in the Network Society: Contact and Support Beyond Netville, American Behavioral Scientist 45 (3), November, 2001.

Barry Wellman, The Glocal Village: Internet and Community, Idea’s - The Arts & Science Review, University of Toronto, 1(1),2004

15

Online vs. offline social networking: The Good News!

Physical online & digital offline

16

Sensor & Social Networks

17 17

Sensor & Social Networks

18

www.nabaztag.com

The Canine Twitterer

“Having my daily workout. Already did 15 leg lifts!”

Tag-Along Marketing The New York Times, November 6, 2010

“Everything is in place for location-based social networking to be the next big thing. Tech companies are building the platforms, venture capitalists are providing the cash and marketers are eager to develop advertising. “

Location Sensors & Social Networking

19

Monitoring online/offline social activity

20

Where is everybody?


•  Generating opportunities for F2F networking

21


22

“There are more than 250 million active users currently accessing Facebook through their mobile devices“ “People that use Facebook on their mobile devices are twice as active on Facebook than non-mobile users”

http://www.facebook.com/press/info.php?statistics

Tracking of F2F contact networks

23

TraceEncounters - 2004

Sociometer, MIT, 2002 -  F2F and productivity

-  F2F dynamics

-  Who are key players?

-  F2F and office distance

24

SocioPatterns platform

24 http://www.sociopatterns.org/!

Offline social networks

25 by Ciro Cattuto

From a small conference at ISI, Turin

26

•  Similarity features – Country of

origin – Seniority –  .. Age? Role?

Projects? Interests?

SR

SR

students

students

JR •  What other info can we get to help us understand these network dynamics?

Offline social networks

Offline + online social networking

27 ESWC2010

Where should I go?

Where have I met this guy?

Anyone I know here?

Who should I talk to?

<?xml version="1.0"?>!<rdf:RDF! xmlns="http://tagora.ecs.soton.ac.uk/schemas/tagging#"! xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"! xmlns:xsd="http://www.w3.org/2001/XMLSchema#"! xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"! xmlns:owl="http://www.w3.org/2002/07/owl#"! xml:base="http://tagora.ecs.soton.ac.uk/schemas/tagging">! <owl:Ontology rdf:about=""/>! <owl:Class rdf:ID="Post"/>! <owl:Class rdf:ID="TagInfo"/>! <owl:Class rdf:ID="GlobalCooccurrenceInfo"/>! <owl:Class rdf:ID="DomainCooccurrenceInfo"/>! <owl:Class rdf:ID="UserTag"/>! <owl:Class rdf:ID="UserCooccurrenceInfo"/>! <owl:Class rdf:ID="Resource"/>! <owl:Class rdf:ID="GlobalTag"/>! <owl:Class rdf:ID="Tagger"/>! <owl:Class rdf:ID="DomainTag"/>! <owl:ObjectProperty rdf:ID="hasPostTag">! <rdfs:domain rdf:resource="#TagInfo"/>! </owl:ObjectProperty>! <owl:ObjectProperty rdf:ID="hasDomainTag">! <rdfs:domain rdf:resource="#UserTag"/>! </owl:ObjectProperty>! <owl:ObjectProperty rdf:ID="isFilteredTo">! <rdfs:range rdf:resource="#GlobalTag"/>! <rdfs:domain rdf:resource="#GlobalTag"/>! </owl:ObjectProperty>! <owl:ObjectProperty rdf:ID="hasResource">! <rdfs:domain rdf:resource="#Post"/>! <rdfs:range =…!

Live Social Semantics (LSS): RFIDs + Social Web + Semantic Web

•  Integration of physical presence and online information •  Semantic user profile generation •  Logging of face-to-face contact •  Social network browsing •  Analysis of online vs offline social networks

SW sources

29

proceedings chair

chair author

CoP

conference

30

Social and information networks

30

31

Merging social networks

31 FOAF

32

Tag Filtering Service

Semantic modeling Semantic analysis Collective intelligence Statistical analysis Syntactical analysis

33

Tag Filtering Service

34

From Tags to Semantics

34

35

Tags to User Interests

35

36

From raw tags and social relations to Structured Data

User raw data

Structured data

Collective intelligence

ontologies

Semantic data

37

RFIDs for tracking social contact

37

38

Convergence with online social networks

38

People contact à RFID à RDF Triples

39

F2FContact

hasContact

contactWith

contactDate contactDura0on

XMLSchema#date XMLSchema#0me

contactPlace

Place

foaf#Person1

foaf#Person2

42

Real-time F2F networks with SNS links

http://www.vimeo.com/6590604

43

Deployed at:

Live Social Semantics

Data analysis •  Face-to-face interactions across scientific conferences

•  Networking behaviour of frequent users

•  Correlations between scientific seniority and social networking

•  Comparison of F2F contact network with Twitter and Facebook

•  Social networking with online and offline friends

Analysis of LSS Results

The New Yorker 2/11/2008

44

Characteristics of F2F contact network

•  Degree is number of people with whom the person had at least one F2F contact

•  Strength is the time spent in a F2F contact •  Edge weight is total time spent by a pair of users in F2F contact

45

Network characteristics

ESWC 2009 HT 2009 ESWC 2010

Number of users 175 113 158

Average degree 54 39 55

Avg. strength (mn) 143 123 130

Avg. weight (mn)

2.65 3.15 2.35

Weights ≤ 1 mn

70% 67% 74%

Weights ≤ 5 mn

90% 89% 93%

Weights ≤ 10 mn 95% 94% 96%

Characteristics of F2F contact events Contact characteristics

ESWC 2009 HT 2009 ESWC 2010

Number of contact events

16258 9875 14671

Average contact length (s)

46 42 42

Contacts ≤ 1mn 87% 89% 88%

Contacts ≤ 2mn 94% 96% 95%

Contacts ≤ 5mn 99% 99% 99%

Contacts ≤ 10mn 99.8% 99.8% 99.8%

F2F contact pattern is very similar for all three conferences

F2F contacts of returning users

101 102101

102

103 104 105103

104

ESW

C20

10

101 102 103 104 105

ESWC2009101102103104

Degree

Total interaction time

Links’ weights

47

•  Degree: number of other participants with whom an attendee has interacted

•  Total time: total time spent in

interaction by an attendee

•  Link weight: total time spent in F2F interaction by a pair of returning attendees in 2010, versus the same quantity measured in 2009

Time spent on F2F networking by frequent users is stable, even when the list of people they networked with changed

ESWC 2009 & ESWC 2010

Pearson Correlation

Degree 0.37

Total F2F interaction time

0.76

Link weight 0.75

Average seniority of neighbours in F2F networks

0 5 10seniority (number of papers)

0

1

2

3

4

5

Ave

rage

seni

ority

of n

eigh

bors

sennsenn,wsenn,max

48

•  No clear pattern is observed if the unweighted average over all neighbours in the aggregated network is considered

•  A correlation is observed when each neighbour is weighted by the time spent with the main person

•  The correlation becomes much stronger when considering for each individual only the neighbour with whom the most time was spent

Avg seniority of the neighbours

with weighted averages

Seniority of user with strongest link

Conference attendees tend to networks with others of similar levels of scientific seniority

Presence of A<endees HT2009

Importance of the bar? Popularity of sessions? par0cular talks?

Number of cliques HT2009

Offline networking vs online networking

51

•  people who have a large number of friends on Twitter and/or Facebook don’t seem to be the most socially active in the offline world in comparison to other SNS users

Users with Facebook and Twitter accounts in ESWC 2010

Twitterers Spearman Correlation (ρ)

Tweets – F2F Degree - 0.15

Tweets – F2F Strength - 0.15

Twitter Following – F2F Degree

- 0.21

No strong correlation between amount of F2F contact activity and size of online social networks

users

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

(" &" ((" (&" $(" $&" )(" )&" %("

*+,-./"01221+./3"

45678.9"

*+..:3"

Scientific seniority vs Twitter followers

52

•  Comparison between people’s scientific seniority and the number of people following them on Twitter

People who have the highest number of Twitter followers are not necessarily the most scientifically senior, although they do have high visibility and experience

users

Twitter users Correlation H-index – Twitter Followers 0.32

H-index – Tweets - 0.13

Conference Chairs

all participants

2009

chairs 2009

all participants

2010

chairs 2010

average degree average strength

55 8590

77.7 19590

54 7807

77.6 22520

average weight average number of events per edge

159 3.44

500 8

141 3.37

674 12

•  Conf chairs interact with more distinct people (larger average degree)

•  Conf chairs spend more time in F2F interaction (almost three times as much as a random participant)

Networking with online and offline ‘friends’ Characteristics all users coauthors Facebook

friends Twitter

followers average contact duration (s)

42 75 63 72

average edge weight (s)

141 4470 830 1010

average number of events per edge

3.37 60 13 14

•  Individuals sharing an online or professional social link meet much more often than other individuals

•  Average number of encounters, and total time spent in interaction, is highest for co-authors

F2F contacts with Facebook & Twitter friends were respectively %50 and %71 longer, and %286 and %315 more frequent than with others They spent %79 more time in F2F contacts with their co-authors, and they met them %1680 more times than they met non co-authors

Twitterers vs Non-Twitterers

•  Time spent in conference rooms – Twitter users spent on average 11.4% more time in the

conf rooms than non-twitter users (mean is 26% higher)

•  Number of people met F2F during the conference – Twitter users met on average 9% more people F2F

(mean 8% higher)

•  Duration of F2F contacts – Twitter users spent on average 63% more time in F2F

contact than non twitter users (mean is 20% higher)

55

56 Web Science Summer School

Galway, 2011

Analysis of behaviour in online

communities

Behaviour of individuals – micro level analysis

57

!"

!#$"

!#%"

!#&"

!#'"

("

(#$"

(" )" *" (+" (," $(" $)" $*" ++" +," %(" %)"-./0123" 4$4"526722" 4$4"8972069:"

:2;<9:=">?@20AB?"C">D?@;<"E7DB<2>#"F72G"

?:;@7>HIJ>"

@0"K88"92;L"6DD1">?@20AB?M";01">D?@;<">@60;<>""

>:=">?@20A>9N"

DO9>@127M":@6:"E7DB<2"

89O1209>M"PQM"12R2<DE27>#"S:DT>"9:2"0239">9;7"72>2;7?:27N"

Why monitor behaviour?

•  Understand impact of behaviour on community evolution •  Forecast community future

•  Learn when intervention might be needed

•  Learn which behaviour should be encouraged or discouraged

•  Find what could trigger certain behaviours

•  What is the best mix of behaviour to increase engagement in the community

•  To see which users need more support, which ones should be confined, and which ones should be promoted

58

Behaviour analysis

•  Behaviour compositions in Boards.ie:

Jeffrey Chan, Conor Hayes, and Elizabeth Daly. Decomposing discussion forums using common user roles. In Proc. Web Science Conf. (WebSci10), Raleigh, NC: US, 2010

Ontology

Encoding Rules in Ontologies with SPIN

Approach for inferring User Roles

62

Structural, social network, reciprocity, persistence, participation

Feature levels change with the dynamics of the community

Associate Roles with a collection of feature-to-level Mappings e.g. in-degree -> high, out-degree -> high

Run our rules over each user’s features and derive the role composition

Data from Boards.ie •  Forum 246 (Commuting and Transport): Demonstrates a clear increase in

activity over time.

•  Forum 388 (Rugby): Exhibits periodic increase and decrease in activity and hence it provides good examples of healthy/unhealthy evolutions.

•  Forum 411 (Mobile Phones and PDAs): Increase in activity over time with some fluctuation - i.e. reduction and increase over various time windows.

•  For the time in 2004-01 to 2006-12

Features

•  In-degree Ratio: The proportion of users U that reply to user υi, thus indicating the concentration of users that reply to υi

•  Posts Replied Ratio: Proportion of posts by user υi that yield a reply, used to gauge the popularity of the user’s content based on replies

•  Thread Initiation Ratio: Proportion of threads that have been started by υi.

•  Bi-directional Threads Ratio: Proportion of threads where user υi replies to a user and receives a reply, thus forming a reciprocal communication

•  Bi-directional Neighbours Ratio: The proportion of neighbours where a reciprocal interaction has taken place - e.g. υi replied to υi and υi replied to υi.

•  Average Posts per Thread: The average number of posts made in every thread that user υi has participated in

•  Standard Deviation of Posts per Thread: The standard deviation of the number of posts in every thread that user υi has participated in. This gauges the distribution of the discussion lengths.

Role Skeleton

Results

•  Correlation of individual features in each of the three forums

Commuting and Transport Rugby Mobile Phones and PDAs

Results (a

) For

um 2

46: C

omm

utin

g an

d Tr

ansp

ort

(b) F

orum

388

: Rug

by

(c) F

orum

411

: Mob

ile P

hone

s an

d P

DA

s

•  Variation in behaviour composition & activity

•  Behaviour composition in/stability influences forum activity

Prediction analysis – preliminary results!

•  Predicting rise/fall in post submission numbers

•  Binary classification

•  Features : Community composition, roles and percentages of users associated with each

•  Cross-community predictions are less reliable than individual community analysis due to the idiosyncratic behaviour observed in each individual community

Forum P R F1 ROC

246 0.799 0.769 0.780 0.800

388 0.603 0.615 0.605 0.775

411 0.765 0.692 0.714 0.617

All 0.583 0.667 0.607 0.466

Observations so far

•  Growing communities contain more elitists and popular participants

•  Shrinking communities contain many taciturns and ignored users

•  A stable composition, with a mix of roles, is associated with increased community activity

•  Different communities may require different behaviour compositions to increase activity/health

What features make online communities tick

71

•  How many do you recognise? Use?

•  Which ones still exist?

•  Which are strong and healthy?

•  Which are aging and withering?

•  What health signs should we look for?

•  How can we predict their future evolution?

Rise and fall of social networks

72

Predicting engagement

•  Which posts will receive a reply? – What are the most influential features here?

•  How much discussion will it generate? – What are the key factors of lengthy discussions?

73

Common online community features

74

initial tweet that generates a reply. Features which describe seed posts can bedivided into two sets: user features - attributes that define the user making thepost; and, content features - attributes that are based solely on the post itself.We wish to explore the application of such features in identifying seed posts, todo this we train several machine learning classifiers and report on our findings.However we first describe the features used.

4.1 Feature Extraction

The likelihood of posts eliciting replies depends upon popularity, a highly subjec-tive term influenced by external factors. Properties influencing popularity includeuser attributes - describing the reputation of the user - and attributes of a post’scontent - generally referred to as content features. In Table 1 we define user andcontent features and study their influence on the discussion “continuation”.

Table 1. User and Content Features

User FeaturesIn Degree: Number of followers of U #

Out Degree: Number of users U follows #List Degree: Number of lists U appears on. Lists group users by topic #Post Count: Total number of posts the user has ever posted #

User Age: Number of minutes from user join date #Post Rate: Posting frequency of the user PostCount

UserAge

Content FeaturesPost length: Length of the post in characters #Complexity: Cumulative entropy of the unique words in post p !

of total word length n and pi the frequency of each word!

i![1,n] pi(log !"log pi)

!Uppercase count: Number of uppercase words #

Readability: Gunning fog index using average sentence length (ASL) [7]and the percentage of complex words (PCW). 0.4(ASL+ PCW )

Verb Count: Number of verbs #Noun Count: Number of nouns #

Adjective Count: Number of adjectives #Referral Count: Number of @user #

Time in the day: Normalised time in the day measured in minutes #Informativeness: Terminological novelty of the post wrt other posts

The cumulative tfIdf value of each term t in post p!

t!p tfidf(t, p)Polarity: Cumulation of polar term weights in p (using

Sentiwordnet3 lexicon) normalised by polar terms count Po+Ne|terms|

4.2 Experiments

Experiments are intended to test the performance of di!erent classification mod-els in identifying seed posts. Therefore we used four classifiers: discriminativeclassifiers Perceptron and SVM, the generative classifier Naive Bayes and thedecision-tree classifier J48. For each classifier we used three feature settings:user features, content features and user+content features.

Datasets For our experiments we used two datasets of tweets available on theWeb: Haiti earthquake tweets4 and the State of the Union Address tweets.5 The

4 http://infochimps.com/datasets/twitter-haiti-earthquake-data5 http://infochimps.com/datasets/tweets-during-state-of-the-union-address

•  How do all these features influence activity generation in an online community? –  Such knowledge leads to better use and management of the community

Experiment for identifying seed posts

•  Twitter data on the Haiti earthquake, and the Union Address

•  Evaluated a binary classification task –  Is this post a seed post or not?

75

Dataset Users Tweets Seeds Non-seeds Replies

Haiti 44,497 65,022 1,405 60,686 2,931

Union Address 66,300 80,272 7,228 55,169 17,875

Identifying seeds with different type of features

76

use f-measure, as defined in Equation 1 as the harmonic mean between precisionand recall, setting ! = 1 to weight precision and recall equally. We also plot theReceiver Operator Curve of our trained models to show graphical comparisonsof performance.

F! =(1 + !2) ! P ! R

!2 ! P + R(1)

For our experiments we divided each dataset up into 3 sets: a training set, avalidation set and a testing set using a 70/20/10 split. We trained our classifi-cation models using the training split and then applied them to the validationset, labelling the posts within this split. From these initial results we performedmodel selection by choosing the best performing model - based on maximisingthe F1 score - and used this model together with the best performing features,using a ranking heuristic, to classify posts contained within our test split. Wefirst report on the results obtained from our model selection phase, before movingonto our results from using the best model with the top-k features.

Table 3. Results from the classification of seed posts using varying feature sets andclassification models

(a) Haiti DatasetP R F1 ROC

User Perc 0.794 0.528 0.634 0.727SVM 0.843 0.159 0.267 0.566NB 0.948 0.269 0.420 0.785J48 0.906 0.679 0.776 0.822

Content Perc 0.875 0.077 0.142 0.606SVM 0.552 0.727 0.627 0.589NB 0.721 0.638 0.677 0.769J48 0.685 0.705 0.695 0.711

All Perc 0.794 0.528 0.634 0.726SVM 0.483 0.996 0.651 0.502NB 0.962 0.280 0.434 0.852J48 0.824 0.775 0.798 0.836

(b) Union Address DatasetP R F1 ROC

User Perc 0.658 0.697 0.677 0.673SVM 0.510 0.946 0.663 0.512NB 0.844 0.086 0.157 0.707J48 0.851 0.722 0.782 0.830

Content Perc 0.467 0.698 0.560 0.457SVM 0.650 0.589 0.618 0.638NB 0.762 0.212 0.332 0.649J48 0.740 0.533 0.619 0.736

All Perc 0.630 0.762 0.690 0.672SVM 0.499 0.990 0.664 0.506NB 0.874 0.212 0.341 0.737J48 0.890 0.810 0.848 0.877

4.3 Results

Our findings from Table 3 demonstrate the e!ectiveness of using solely userfeatures for identifying seed posts. In both the Haiti and Union Address datasetstraining a classification model using user features shows improved performanceover the same models trained using content features. In the case of the Uniondataset we are able to achieve an F1 score of 0.782, coupled with high precision,when using the J48 decision-tree classifier - where the latter figure (precision)indicates conservative estimates using only user features. We also achieve similarhigh-levels of precision when using the same classifier on the Haiti dataset. Theplots of the Receiver Operator Characteristic (ROC) curves in Figure 2 showsimilar levels of performance for each classifier over the two corpora.When usingsolely user features J48 is shown to dominate the ROC space, subsuming theplots from the other models. A similar behaviour is exhibited for the NaiveBayes classifier where SVM and Perceptron are each outperformed. The plotsalso demonstrate the poor recall levels when using only content features, whereeach model fails to yield the same performance as the use of only user features.

•  User features are most important in Twitter

•  But combining user & content features gives best results

Impact of different features

•  What features have the highest impact on identification of seed posts?

•  Rank features by information gain ratio wrt seed post class label

77

which we found to be 0.674 indicating a good correlation between the two listsand their respective ranks.

Table 4. Features ranked by Information Gain Ratio wrt Seed Post class label. Thefeature name is paired within its IG in brackets.

Rank Haiti Union Address1 user-list-degree (0.275) user-list-degree (0.319)2 user-in-degree (0.221) content-time-in-day (0.152)3 content-informativeness (0.154) user-in-degree (0.133)4 user-num-posts (0.111) user-num-posts (0.104)5 content-time-in-day (0.089) user-post-rate (0.075)6 user-post-rate (0.075) user-out-degree (0.056)7 content-polarity (0.064) content-referral-count (0.030)8 user-out-degree (0.040) user-age (0.015)9 content-referral-count (0.038) content-polarity (0.015)10 content-length (0.020) content-length (0.010)11 content-readability (0.018) content-complexity (0.004)12 user-age (0.015) content-noun-count (0.002)13 content-uppercase-count (0.012) content-readability (0.001)14 content-noun-count (0.010) content-verb-count (0.001)15 content-adj-count (0.005) content-adj-count (0.0)16 content-complexity (0.0) content-informativeness (0.0)17 content-verb-count (0.0) content-uppercase-count (0.0)

Fig. 3. Contributions of top-5 features to identifying Non-seeds (N) and Seeds(S).Upper plots are for the Haiti dataset and the lower plots are for the Union Addressdataset.

The top-most ranks from each dataset are dominated by user features includ-ing the list-degree, in-degree, num-of-posts and post-rate. Such features describea user’s reputation, where higher values are associated with seed posts. Figure3 shows the contributions of each of the top-5 features to class decisions in thetraining set, where the list-degree and in-degree of the user are seen to correlateheavily with seed posts. Using these rankings our next experiment explored thee!ects of training a classification model using only the top-k features, observing

Positive/negative impact of features

•  What is the correlation between seed posts and features?

78

which we found to be 0.674 indicating a good correlation between the two listsand their respective ranks.

Table 4. Features ranked by Information Gain Ratio wrt Seed Post class label. Thefeature name is paired within its IG in brackets.

Rank Haiti Union Address1 user-list-degree (0.275) user-list-degree (0.319)2 user-in-degree (0.221) content-time-in-day (0.152)3 content-informativeness (0.154) user-in-degree (0.133)4 user-num-posts (0.111) user-num-posts (0.104)5 content-time-in-day (0.089) user-post-rate (0.075)6 user-post-rate (0.075) user-out-degree (0.056)7 content-polarity (0.064) content-referral-count (0.030)8 user-out-degree (0.040) user-age (0.015)9 content-referral-count (0.038) content-polarity (0.015)10 content-length (0.020) content-length (0.010)11 content-readability (0.018) content-complexity (0.004)12 user-age (0.015) content-noun-count (0.002)13 content-uppercase-count (0.012) content-readability (0.001)14 content-noun-count (0.010) content-verb-count (0.001)15 content-adj-count (0.005) content-adj-count (0.0)16 content-complexity (0.0) content-informativeness (0.0)17 content-verb-count (0.0) content-uppercase-count (0.0)

Fig. 3. Contributions of top-5 features to identifying Non-seeds (N) and Seeds(S).Upper plots are for the Haiti dataset and the lower plots are for the Union Addressdataset.

The top-most ranks from each dataset are dominated by user features includ-ing the list-degree, in-degree, num-of-posts and post-rate. Such features describea user’s reputation, where higher values are associated with seed posts. Figure3 shows the contributions of each of the top-5 features to class decisions in thetraining set, where the list-degree and in-degree of the user are seen to correlateheavily with seed posts. Using these rankings our next experiment explored thee!ects of training a classification model using only the top-k features, observing

H

aiti

Uni

on A

ddre

ss

Identifying Seed Posts

79

•  Can we identify seed posts using the top-k features?

– Stability is reached with 5 features

– Classification with 5 features is sufficient for identifying posts that generate responses

Predicting Discussion Activity

•  Reply rates: – Haiti 1-74 responses, Union Address 1-75 responses

•  Compare rankings – Ground truth vs predicted

•  Experiments – Using Haiti and Union Address datasets – Evaluate predicted rank k where k={1,5,10,20,50,100) – Support Vector Regression with user, content, user+content

features

80

Dataset Training size

Test size Test Vol Mean

Test Vol SD

Haiti 980 210 1.664 3.017

Union Address 5,067 1,161 1.761 2.342

Predicting Discussion Activity

81

Haiti dataset Union Address dataset

•  Content features are key for top ranks

•  Use features more important for higher ranks

82

Identifying Seed Posts in Boards.ie

•  Used the same features as before – User features

•  In-degree, out-degree, post count, user age, post rate – Content features

•  Post Length, complexity, readability, referral count, time in day, informativeness, polarity

•  New features designed to capture user affinity – Forum Entropy

•  Concentration of forum activity •  Higher entropy = large forum spread

– Forum Likelihood •  Likelihood of forum post given user history •  Combines post history with incoming data

83

Experiment for identifying seed posts

•  Used all posts from Boards.ie in 2006

•  Built features using a 6-month window prior to seed post date

•  Evaluated a binary classification task –  Is this post a seed post or not? –  Precision, Recall, F1 and Accuracy –  Tested: user, content, focus features, and their combinations

Posts Seeds Non-Seeds Replies Users

1,942,030 90,765 21,800 1,829,465 29,908

84

Identifying seeds with different type of features

activity levels, and because it has already been used in otherinvestigations (e.g., [14]).

Boards.ie does not provide explicit social relations be-tween community members, unlike for example Facebook andTwitter. We followed the same strategy proposed in [3] forextracting social networks from Digg, and built the Boards.iesocial network for users, weighting edges cumulatively by thenumber of replies between any two users.

TABLE IDESCRIPTION OF THE BOARDS.IE DATASET

Posts Seeds Non-Seeds Replies Users1,942,030 90,765 21,800 1,829,465 29,908

In order to take derive our features we required a windowof n-days from which the social graph can be compiled andrelevant measurements taken. Based on previous work overthe same dataset in [14], we used a similar window of 188days (roughly 6-months) prior to the post date of a given seedor non-seed post. For instance, if a seed post p is made attime t, then our window from which the features (i.e., userand focus features) are derived is from t − 188 to t − 1. Inusing this heuristic we ensure that the features compiled foreach post are independent of future outcomes and will notbias our predictions - for example a user may increase theiractivity following the seed post which would not be a trueindicator of their behaviour at the time the post was made.Table I summarises the dataset and the number of posts (seeds,non-seeds and replies) and users contained within.

V. CLASSIFICATION: DETECTING SEED POSTS

Predicting discussion activity levels are often hindered byincluding posts that yield no replies. We alleviate this problemby differentiating between seed posts and non-seeds through abinary classification task. Once seed posts have been identifiedwe then attempt to predict the level of discussion that suchposts will generate. To this end, we look for the best classifierfor identifying seed and non-seed posts and then search for thefeatures that played key roles in distinguishing seed posts fromnon-seeds, thereby observing key features that are associatedwith discussions.

A. Experimental SetupFor our experiments we are using the previously described

dataset collected from Boards.ie containing both seeds andnon-seeds throughout 2006. For our collection of posts webuilt the content, user, and focus features listed in section IIIfrom the past 6 months of data leading up to the date on whichthe post was published - thereby ensuring no bias from futureevents in our dataset. We split the dataset into 3 sets using a70/20/10% random split, providing a training set, a validationset and a test set.

Our first task was to perform model selection by testing fourdifferent classifiers: SVM, Naive Bayes, Maximum Entropyand J48 decision tree, when trained on various individual fea-ture sets and their combinations: user features, content features

and focus features. This model selection phase was performedby training each classifier, together with the combination offeatures, using the 70% training split and labelling instancesin the held out 20% validation split.

Once we had identified the best performing model - i.e.,the classifier and combination of feature set that produces thehighest F1 value - our second task was to perform featureassessment, thereby identifying key features that contributesignificantly to seed post prediction accuracy. For this wetrained the best performing model from the model selectionphase over the training split and tested its classification accu-racy over the 10% test split, dropping individual features fromthe model and recording the reduction in accuracy followingthe omission of a given feature. Given that we are performinga binary classification task we use the standard performancemeasures for such a scenario: precision, recall and f-measure- setting β = 1 for an equal weighting of precision andrecall. We also measure the area under the Receiver OperatorCharacteristic curve to gauge the relationship between recalland fallout - i.e., false negative rate.

TABLE IIRESULTS FROM THE CLASSIFICATION OF SEED POSTS USING

VARYING FEATURE SETS AND CLASSIFICATION MODELS

P R F1 ROC

User SVM 0.775 0.810 0.774 0.581Naive Bayes 0.691 0.767 0.719 0.540Max Ent 0.776 0.806 0.722 0.556J48 0.778 0.809 0.734 0.582

Content SVM 0.739 0.804 0.729 0.511Naive Bayes 0.730 0.794 0.740 0.616Max Ent 0.758 0.806 0.730 0.678J48 0.795 0.822 0.783 0.617

Focus SVM 0.649 0.805 0.719 0.500Naive Bayes 0.710 0.737 0.722 0.588Max Ent 0.649 0.805 0.719 0.586J48 0.649 0.805 0.719 0.500

User + Content SVM 0.790 0.808 0.727 0.509Naive Bayes 0.712 0.772 0.732 0.593Max Ent 0.767 0.807 0.734 0.671J48 0.795 0.821 0.779 0.675

User + Focus SVM 0.776 0.810 0.776 0.583Naive Bayes 0.699 0.778 0.724 0.585Max Ent 0.771 0.806 0.722 0.607J48 0.777 0.810 0.742 0.617

Content + Focus SVM 0.750 0.805 0.729 0.511Naive Bayes 0.732 0.787 0.746 0.658Max Ent 0.762 0.807 0.731 0.692J48 0.798 0.823 0.787 0.662

All SVM 0.791 0.808 0.727 0.510Naive Bayes 0.724 0.780 0.740 0.637Max Ent 0.768 0.808 0.733 0.688J48 0.798 0.824 0.792 0.692

B. Results: Model Selection

1) Model Selection with Individual Features: The resultsfrom our first experiments are shown in Table II. Lookingfirst at individual feature sets - e.g., SVM together withuser features - we see that content features yield improvedpredictive performance over user and focus features. On dis-cussion forums content appears to play a more central role

Positive/negative impact of features on Boards.ie

•  What are the most important features for predicting seed posts?

•  Correlations: – Referral counts (non-seeds) – Forum likelihood (seeds) –  Informativeness (non-seeds) – Readability (seeds) – User age (non-seeds)

85

TABLE IIIREDUCTION IN F1 LEVELS AS INDIVIDUAL FEATURES ARE

DROPPED FROM THE J48 CLASSIFIER

Feature Dropped F1

- 0.815Post Count 0.815In-Degree 0.811*Out-Degree 0.811*User Age 0.807***Post Rate 0.815Forum Entropy 0.815Forum Likelihood 0.798***Post Length 0.810**Complexity 0.811**Readability 0.802***Referral Count 0.793***Time in Day 0.810**Informativeness 0.801***Polarity 0.808***Signif. codes: p-value < 0.001 *** 0.01 ** 0.05 * 0.1 .

hyperlinks (e.g., ads and spams). This contrasts with work inTwitter which found that tweets containing many links weremore likely to get ‘retweeted’ [11].

The boxplot for Forum Likelihood shows a correlation be-tween seed posts and higher values of the likelihood measure,suggesting that users who frequently post in the same forumsare more likely to start a discussion. Also, If a user often postsin discussion forums, while concentrating on only a few selectforums, then the likelihood that a new post is within one ofthose forums is high.

Fig. 3. Boxplots showing the correlation of feature values with seedand non-seed posts within the training split

VI. REGRESSION: PREDICTING DISCUSSION ACTIVITY

Early detection of lengthy discussions helps analysts andmanagers to focus attention to where activity and topicaldebates are about to occur. In this section we predict thelevel of discussion activity that seed posts will generate andwhat features are key indicators of lengthy discussions. Weuse regression models that induce a function describing therelationship between the level of discussion activity and ouruser, content and focus features. By learning such a functionwe can identify patterns in the data and correlations betweenour dependent variable and the range of predictor variablesthat we have.

Fig. 4. Discussion Activity Length Distribution

A. Experimental Setup

Forecasting the exact number of replies (discussion activity)is limited if the distribution of known reply lengths has alarge skew to either the minimum or maximum. For predictingpopular tweets, Hang et al [12] adopted a multiclass classifi-cation setting to deal with the large skew in the dataset bypredicting retweet count ranges. We have a similar scenarioin our Boards.ie dataset, where a large number of seed postsyield fewer than 20 replies (Figure VI). In such cases utilisingstandard regression error measures such as Relative AbsoluteError produces inaccurate assessments of the predictions dueto using a simple predictor based on the mean of the targetvariables.

In our experiments we instead use the Normalised Dis-counted Cumulative Gain (nDCG) at varying rank positions,looking at the performance of our predictions over the top-k documents where k = {1, 5, 10, 20, 50, 100}. NDCG isderived by dividing the Discounted Cumulative Gain (DCG)of the predicted ranking by the actual rank defined by (iDCG).DCG is well suited to our setting, given that we wish topredict the most popular posts and then expand that selectionto assess growing ranks, as the measure penalises elements inthe ranking that appear lower down when in fact they shouldbe higher up. We define DCG formally, based on the definitionfrom [9], as:

DCGk =k�

i=1

relilog2(1 + i)

(5)

For our experiments we first identify the best performingregression model before moving onto analysing the coeffi-cients of that model and the patterns in the data that leadto increased discussion activity. For our model selection phasewe test three regression models: Linear regression, Isotonic

•  Can we predict the level of discussion activity?

Predicting Discussion Activity in Boards.ie

86

87

•  What impact do features have on discussion length? – Assessed Linear Regression model with focus and content

features – Forum Likelihood (pos) – Content Length (+/neutral) – Complexity (pos) – Readability (+/neutral) – Referral Count (neg) – Time in Day (+/neutral) –  Informativeness (-/neutral) – Polarity (neg)

Predicting Discussion Activity in Boards.ie

Stay tuned •  More communities

– SAP, IBM, StackOverflow, Reddit – Compare impact of features on their dynamics

•  Better behaviour analysis –  Less features, more forums/communities, more graphs! – Healthy? posts, reciprocation, discussions, sentiment mixture

•  Churn analysis – Correlation of features/behaviour to ‘bounce rate’

•  Intervention! – Opportunities and mechanisms to influence behaviour

88

Upcoming events

89

Intelligent Web Services Meet Social Computing AAAI Spring Symposium 2012,

March 26-28, Stanford, California

http://vitvar.com/events/aaai-ss12 Deadline: Octover 7, 2011

Social Object Networks IEEE Social Computing, 2011 October 9-10, Boston, USA

http://ir.ii.uam.es/socialobjects2011/!Deadline: August 5, 2011

Questionnaire on user needs

http://socsem.open.ac.uk/limesurvey/index.php?sid=55487

Questionnaire is to identify the needs that community users have within online communities and to learn the factors and issues that influence those needs.

90

My social semantics team

Thanks to

91

Sofia Angeletou Research Associate

Matthew Rowe Research Associate

Alain Barrat CPT Marseille & ISI

Martin Szomszor CeRC, City University, UK

Wouter van Den Broeck ISI, Turin

Ciro Cattuto ISI, Turin

Live Social Semantics team

Gianluca Correndo, Uni Southampton Ivan Cantador, UAM, Madrid

STI International ESWC09/10 & HT09 chairs and organisers

All LSS participants

Acknowledgements

monitoring and analysis of online communities

Education

social web

social mediadeloitte

impact of social

social media9

social networkingwho

offline social networking

offline social tie

offline social networksfrom