bcs sgai workshop on social media analysis, 10th december 2013 mining newsworthy topics from social...

38
BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert Gordon University) Andrew MacFarlance (City University London)

Upload: billy-lockard

Post on 01-Apr-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

BCS SGAI Workshop on Social Media Analysis, 10th December 2013

Mining Newsworthy Topics from Social

MediaCarlos Martin, David Corney and Ayse Goker (Robert Gordon University)Andrew MacFarlance (City University London)

Page 2: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Outline

• Introduction & Motivation• BNgram approach• Further modifications• Experiments• Demo• Conclusions and Future work• References

#2

Page 3: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Outline

• Introduction & Motivation• BNgram approach• Further modifications• Experiments• Demo• Conclusions and Future work• References

#3

Page 4: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Introduction & Motivation

• Newsworthy stories are increasingly being shared through social networking platforms such as Twitter and Reddit

• Journalists use Social Media to rapidly discover stories and eye-witness accounts.

#4

Page 5: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

• Other tools to detect newsworthy stories:– Twitter trends – http://www.twitter.com– Trendsmap - http://trendsmap.com/– Newship - http://www.newswhip.com/

Introduction & Motivation

#5

Page 6: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Introduction & Motivation

• Gap in the market– Story description is incomplete/unclear (based on the use

of hashtags and entities)– Use of mainstream media

• Proposal of an approach to detect newsworthy stories in real time from Twitter where story description is complete and posts from social network users are associated to each story– Journalists and news readers don’t get overwhelmed.

#6

Page 7: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Outline

• Introduction & Motivation• BNgram approach• Further modifications• Experiments• Demo• Conclusions and Future work• References

#7

Page 8: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

BNgram approach

• Detection of the most representative topics from a timeslot making special emphasis on temporal dimension of data.

1. Detection of emerging phrases (word n-grams) based on df-idft score. It is a variant of tf-idf.

Ranking of n-grams per timeslot sorted by df-idft, avoiding overlaps. Boost factor: Named entity recognition (Stanford) – 3 class classifier (Person, location and organization).

#8

boost

t

df

dfidfdf

t

jji

it

11log

1

1

Page 9: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

BNgram approach

2. Hierarchical clustering of the top k n-grams with the highest df-idft scores. Topic score is computed as the maximum df-idft of its n-grams.

#9

Page 10: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

BNgram approach

• Evaluation benchmark: Comparison with other 4 TDT (document-pivot and feature-pivot) and a baseline (LDA) approach – TMM paper

• User-centred evaluation:– Collections: FA Cup, Super Tuesday and US Elections

(tracking keywords).

– Ground truth: Set of representative topics (manually selected) corresponding to different timeslots, coming from main-stream media(MSM). Timeslot size: FA Cup – 1 min., Super Tuesday and US elections – 10 min. Topics: 13 FA Cup, 22 Super Tuesday and 64 US elections.

#10

Page 11: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

BNgram approach

• Collections:

#11

Page 12: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

BNgram approach

• Results – TMM paper

#12

Method T-REC@2 – FA Cup T-REC@10 – Super Tuesday

T-REC@10 – US Elections

Latent Dirichlet Allocation (baseline)

0.6923 0 0.1094

Document-pivot topic detection

0.7692 0.2273 0.2344

Graph-based feature-pivot topic detection

0 0.0455 0.0781

Frequent pattern mining

0.3077 0.1364 0

Soft Frequent pattern mining

0.6154 0.1818 0.3594

BNgram 0.7692 0.5 0.4844

Page 13: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

BNgram approach

• Examples of topics

#13

Detected topic Corresponding story Sample tweet

FACUP

over line saved super cech claimingwent @chelseafc carroll header liverpool#cfcwembley #facupfinal sl

Liverpool nearly score Andy Carroll takes a shot. PetrCech makes a fantastic save.

Liverpool nearly score Andy Carroll takes a shot. PetrCech makes a fantastic save.

Super Tuesday

romney wins virginia republican presidentialprimary mitt @ap breaking

Fox/NBC is projecting Mitt Romney has won the Virginiaprimary.

@ap: BREAKING: Mitt Romney wins the Virginia Republicanpresidential primary. -RAS

US Elections

@barackobama four more yearsObama tweeted “Four more years”

Several television networks report Obama has been reelected;

@MessyNelle: @barackobama four more yearshttp://t.co/6ortbfqt

Page 14: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Outline

• Introduction & Motivation• BNgram approach• Further modifications• Experiments• Demo• Conclusions and Future work• References

#14

Page 15: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Further modifications

• BNgram approach modifications:– Study of different types of n-grams.– Timeslots vs. Number of tweet slots– Clustering techniques have been tested for

BNgram approach: Apriori and GMM algorithms.– New topic ranking technique has been

considered.

#15

Page 16: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

N-grams

• Word order is often essential to indicate meaning. For example, 'dog bites man' is not news, but 'man bites dog' is news. A bag-of-words approach cannot distinguish these cases.

• Popular in NLP• In this work, n-gram we refer to sequences of up to n

consecutive terms• Copies of posts and RTs are very frequent in Twitter

space. Focused posts in 140 characters.

#16

Page 17: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

• What’s the best timeslot size?.• Other alternatives: Number of tweet slots –

Minimum changes in the approach.

• Small slot size missed stories• Large slot size delay in some stories (refresh rate)

Timeslots vs. Number of tweet slots

#17

Fixed number of tweets instead of time

boost

Page 18: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Clustering approaches

• Weakness detected in our clustering technique:– Example: US elections ngram ranking (sorted by df-idft):

• Basic hierarchical clustering: Incomplete stories.– From our example, the candidate clusters could be:

•Cluster 1: Barack Obama wins + wins Wisconsin (Complete)•Cluster 2: wins California (Incomplete, who?)

• New grouping techniques where one n-gram can be assigned to different clusters.

#18

Position Ngram Docs

#1 Barack Obama wins 1,2,4,6,7,8,9,10

#2 wins Wisconsin 1,2,4,6

#3 wins California 7,8,9,10

Page 19: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Clustering approaches – Gaussian Mixture Models (GMM)• Unsupervised method• Assign probabilities (or strengths) of membership of each n-

gram to each cluster – Partial membership• Iterative approach. Tries to find the parameters of the

probability distribution that has the maximum likelihood of its attributes.

• Input: Number of clusters - Bayesian Information Criteria (BIC)

#19

Page 20: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Clustering approaches – Gaussian Mixture Models (GMM)• Expectation-Maximisation - Two steps:

– E-Step: Estimates the probability of each point belongs to each cluster.

– M-step: Re-estimate the parameter vector of the probability distribution of each class.

• The algorithm finishes when the distribution parameters converges or maximum number of iterations.

#20

Page 21: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Clustering approaches - Apriori algorithm

• Explore associations between n-grams based on the number of shared tweets.

• Number of n-grams per association: Each association contains from 1 n-gram to the considered number of n-grams from the ranking.

• One association is considered if the number of shared tweets for the n-grams of the association is bigger than a threshold (support value).

• In a posterior step, the maximal associations are obtained to avoid overlaps.

#21

Page 22: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Clustering approaches - Apriori algorithm

• From the previous example (if threshold is 3): – Candidate associations: #1, #2, #3, #1#2, #1#3– Maximal associations: #1#2, #1#3

#22

Page 23: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Topic ranking

• Maximum df-idft n-gram approach is not the best alternative for these new clustering techniques

• Inconvenient for slots with active and diverse topics.

#23

n-gram1

n-gram2

N-gram ranking Topic ranking

topic1

topic4

topic3

topic2

topic5

Page 24: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Topic ranking

• Weighted topic-length approach:

where st is the score of topic t, Lt is the length of the topic, Lmax is the maximum number of terms in any topic from the current slot, Nt is the number of tweets in topic t and Ns is the number of tweets in the slot. Finally, α is a weighting term.

#24

Page 25: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Evaluation

• We have estimated the starting and ending times of each event in the ground-truth

#25

Topics for slot i-3

Topics for slot i-2

Topics for slot i-1

Topics for slot i

Starting time (event) Ending time (event)

mm m

m

Merged topics to evaluate the event (top m)

Page 26: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Outline

• Introduction & Motivation• BNgram approach• Further modifications• Experiments• Demo• Conclusions and Future work• References

#26

Page 27: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Experiments – n-grams

• Topic recall for different types of n-grams and three datasets using hierarchical clustering and maximum n-gram topic ranking techniques and fixing the slot size to 1000 tweets (similar patterns observed using other configurations)

#27

Page 28: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

• Normalised area under the curve for the three datasets and its weighted average.

Experiments – n-grams

#28

Page 29: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Experiments- slot size

• Topic recall for different slot-sizes using hierarchical clustering and weighted topic-length topic ranking techniques (3-grams).

• Possible correlation between slot size and tweet rate (Super Tuesday: 832 tpm, FA Cup: 1293 tpm, US elections: 2209 tpm)

• Consider refresh rate UI

#29

Page 30: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Experiments – clustering and topic ranking techniques• Topic recall for different clustering techniques in the three

datasets and using both topic ranking techniques (3-grams and slot size = 1500 tweets)

#30

Page 31: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

• Normalised area under the curve

Experiments – clustering and topic ranking techniques

#31

Page 32: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Outline

• Introduction & Motivation• BNgram approach• Further modifications• Experiments• Demo• Conclusions and Future work• References

#32

Page 33: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Demo

• Social Sensor project – http://www.socialsensor.eu

#33

Page 34: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Outline

• Introduction & Motivation• BNgram approach• Further modifications• Experiments• Demo• Conclusions and Future work• References

#34

Page 35: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Conclusions and Future work

• New TDT approach based on temporal dimension of data and n-grams in Twitter space

• Improve tracking issues – ongoing• Trust and verifications based on following newshounds – ongoing• Improve Topic title – ongoing• Better association of tweets to topics – ongoing• Improve evaluation methods/metrics• Smoothing techniques for df-idft computation• Entity recognition – Other approaches (Illinois NLP tools,…)• Participation in TDT challenges (SNOW14)

#35

Page 36: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Outline

• Introduction & Motivation• BNgram approach• Further modifications• Experiments• Demo• Conclusions and Future work• References

#36

Page 37: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

References

• Aiello, L., Petkos, G., Martin, C., Corney, D., Papadopoulos, S., Skraba, R., Goker, A., Kompatsiaris, I., Jaimes, A.: Sensing trending topics in twitter. Multimedia, IEEE Transactions on 15(6) (2013) 1268–1282

• Martin, C., Corney, D., Goker, A.: Finding newsworthy topics on Twitter. IEEE Computer Society Special Technical Community on Social Networking E-Letter 1(3) (September 2013)

• Steve Schifferes, Nic Newman, Neil Thurman, David Corney, Ayse Göker, Carlos Martin. (2013). Identifying and verifying news through social media: Developing a user-centred tool for professional journalists. In The Future of Journalism Conference 2013, Cardiff, UK.

• Spot the ball: Detecting sports events on Twitter. In proceedings of ECIR 2014, Amsterdam, Netherlands. (To appear)

#37

Page 38: BCS SGAI Workshop on Social Media Analysis, 10th December 2013 Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert

Thank you!