university of sheffield, nlp module 6: summarisation and rumour detection © the university of...

33
University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike Licence

Upload: mariah-gray

Post on 31-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Module 6:Summarisation and Rumour Detection

© The University of Sheffield, 1995-2013This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike Licence

Page 2: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP Slide 2

About this tutorial

This tutorial will be a mixture of explanation, demos and hands-on work

Things for you to try yourself are in red It assumes basic familiarity with the GATE GUI and with

ANNIE and JAPE; no Java expertise

Page 3: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Summarising Social Media

Page 4: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Information Overload and Social Media

Do we suffer from information overload?

Home users don't feel overloaded.. (Hargittai 2012)

But they can't keep up with everything and need help with filtering noise (Bontcheva 2013)

When overload arises (Hargittai 2012):

• Time sensitivity: Limitation of time for reviewing available information

• Decision requirement: time constraints on actual decision making

• Structure of information: The extent to which information is structured, to help retrieval of relevant information

• Quality of information: Filter failure (Clay Shirky) or signal-to-noise ratio

Page 5: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

What is Text Summarisation

Page 6: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Types of Text Summarisation

• What is being summarised:

– Single document summarisation

– Multi-document summarisation

• What is in the summary

– Extractive summarisation

– Abstractive summarisation

Page 7: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

How are extractive summaries created

1. Score textual units (e.g. sentences) according to some representation of the document or document set

2. Generate summaries by selecting high scoring textual units until some desired compression ratio has been achieved.

Page 8: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Scoring methods

1. Frequency based methods

2. Sentence position in the document

3. Centrality scores

Page 9: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

SUMMA: Text Summarisation in GATE

• Implemented by Horacio Saggion (now at UPF)

• GATE Plugin available from http://www.taln.upf.edu/pages/summa.upf/

• Both single- and multi-document extractive summarisation

Page 10: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Single document summ. example

Page 11: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Summary of 4 news articles: Example

• Multi-document summarisation example

• 4 news articles (BBC, Guardian, Independent, Telegraph) on the passport crisis to be summarised

• 10% of all sentences to be selected

Page 12: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Google Product Review Summaries

Page 13: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Sentiment Summarisation in Twitter

Page 14: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Sentiment Summarisation in Twitter

Page 15: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Micropinions (1)

• Concise opinion snippets

– Non-redundant, readable, representative

Kavita Ganesan, ChengXiang Zhai, and Evelyne Viegas. Micropinion generation: an unsupervised approach to generating ultra-concise summaries of opinions. In Proceedings of the 21st International Conference on World Wide Web, WWW ’12, pages 869–878, 2012

Page 16: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Micropinions (2)

• How:– Find high frequency unigrams

– From these generate all possible bigrams

– Merge overlapping bigrams to produce longer phrases as long as they are:• Readable

• Representative

16

Page 17: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Micropinions: Summarising Opinions

• Concise opinion snippets (Ganesan et al, 2012)

• Non-redundant, readable, representative

• How:

• Find high frequency unigrams

• From these generate all possible bigrams

• Merge overlapping bigrams to produce longer phrases iff they are readable and representative

• Example:

Page 18: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Summarising Tweets

The challenge: making sense of streams

Page 19: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Why is Social Media Summarization Hard?

Short messages (microtexts), URLs, #tags

Noisy:

Unusual spelling (2moro) Bad capitalisation Emoticons Idiosyncratic abbreviations (ROFL) Large variance in styles

Temporal

Social context

Our relationship to the tweet’s author influences importance

User-generated

Gender Location Age

Multi-lingual: fewer than 50% of tweets are English

Page 20: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Key Questions in Tweet Summarization

• What is a summary for a set of tweets?

• Select phrases/entities from the tweets, as a high level overview

• Extract opinions and summarise these

• Select the most representative subset

• Temporal aspect:

Are more recent tweets on a topic more important?

• How to make use of natural topic groupings like user lists and hash tags

Page 21: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Term/Entity Clouds as Frequency-based Summaries

Pros: Give a high level topic and entity summary/overview of disparate tweets Easily understood and widely used Good starting point for interactive summarisation

Cons: Do not show opinions Frequency ≠ Interesting/Important

Took a random sample of 450 tweets from news agencies (BBC, Guardian, CNN)

Ran NE recognition and then plotted based on frequency

Improvements: normalise names (Nixon vs Nixson), handle co-reference (Murdoch vs James Murdoch)

Page 22: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Tweet Ranking and Summarisation

Page 23: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Tweet User Geolocation

Page 24: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Rumour Detection

Page 25: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Veracity: the 4th V of Big Data

• We coined the term phemes

– memes are thematic motifs that spread through social media in ways analogous to genetic traits

– phemes add truthfulness and deception to the mix

– named after ancient Greek Pheme, “embodiment of fame and notoriety, her favour being notability, her wrath being scandalous rumours"

25

The PHEME project focuses on detecting rumours in social media http://pheme.eu

http://en.wikipedia.org/wiki/Pheme

Page 26: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Social media is rife with phemes

Page 27: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Social Media is Rife with Phemes (2)

Page 28: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Social Media is Rife with Phemes (3)

Page 29: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

The UK riots study• 2.6M tweets harvested from Twitter ‘fire hose’ matching

specified #tags.• 700,000 individual accounts.• What the corpus can reveal about:

– Reactions to events, both general and specific– How information flows through social media– Kinds of ‘actors’ involved and how they shape discourse– How social media used to inform, organise, etc

Procter, R., Vis, F. and Voss, A. (2013). Reading the riots on Twitter: methodological innovation for the analysis of big data. International Journal of Social Research Methodology, Special Issue on Computational Social Science: Research Strategies, Design & Methods.

Procter, R., Crump, J., Karstedt, S., Voss, A., &Cantijoch, M. (2013). Reading the riots: what were the police doing on Twitter?. Policing and Society, 1-24.

29

Page 30: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Rumour analysis

• Technological challenges

– Analysis is post-hoc, on 7 known rumours

– Analysis and visualisations took months of researcher and programmer effort

• Rumours are challenging

– Some rumours could take days, weeks or even months to die out

– Ill-meaning humans can currently outsmart computers and appear genuine

Page 31: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Available Datasets

Observations

•Misinformation and disinformation tend to be questioned more than facts, attract more affirmations and denials/refutations, and result in deeper conversation threads (Mendoza et al, 2011)Datasets:

•Small set of tweets annotated with 5 rumours, classified as confirm, deny, and question (Qazvinian et al, 2011)

• Many of the tweets have since been deleted (!)

• 30% for one of the rumours, which makes results replication hard

•The seven rumours from the London riots tweets

Page 32: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP

Qazvinian et al 2011 & our ongoing work

• Used POS tags, n-grams, URL features, retweets, etc.

• We added new features: document-intrinsic(stylistic, personality, sentiment, entities)

• Experimenting on the London riots data too

Page 33: University of Sheffield, NLP Module 6: Summarisation and Rumour Detection © The University of Sheffield, 1995-2013 This work is licensed under the Creative

University of Sheffield, NLP Slide 33

QUESTIONS?