tools for (almost) real-time social media analysis
Post on 17-Jul-2015
217 Views
Preview:
TRANSCRIPT
University of Sheffield, NLP
Tools for (Almost) Real-Time Social Media Analysis
Dr. Diana Maynard
Dept of Computer ScienceUniversity of Sheffield, UK
19 March 2015, Vienna
University of Sheffield, NLP
We are all connected to each other...
● Information, thoughts and opinions are shared prolifically on the social web these days
● 72% of online adults use social networking sites
University of Sheffield, NLP
Your grandmother is three times as likely to use a social networking site now as in 2009
University of Sheffield, NLP
There are hundreds of tools for social media analytics
● Most of them are commercial and not freely available
● The research tools tend to focus on specific topics and scenarios, and aren't easily adaptable
● The analysis they do often doesn't go much beyond number crunching, e.g.
– look at number of tweets, retweets, favourites
– filter by hashtag or keyword for topic categorisation
– use off-the-shelf sentiment tools
– use counts of word length, POS categories etc
– very little semantics, don't deal with variation, ambiguity, slang, sarcasm etc.
University of Sheffield, NLP
Analysing Social Media is harder than it sounds
There are lots of things to think about!
University of Sheffield, NLP
Analysing language in social media is hard
● Grundman:politics makes #climatechange scientific issue,people don’t like knowitall rational voice tellin em wat 2do
● @adambation Try reading this article , it looks like it would be really helpful and not obvious at all. http://t.co/mo3vODoX
● Want to solve the problem of #ClimateChange? Just #vote for a #politician! Poof! Problem gone! #sarcasm #TVP #99%
● Human Caused #ClimateChange is a Monumental Scam! http://www.youtube.com/watch?v=LiX792kNQeE … F**k yes!! Lying to us like MOFO's Tax The Air We Breath! F**k Them!
University of Sheffield, NLP
It is difficult to access unstructured information efficiently
Information extraction tools can help you:
● Save time and money on management of text and data from multiple sources
● Find hidden links scattered across huge volumes of diverse information
● Integrate structured data from variety of sources
● Interlink text and data
● Collect information and extract new facts
University of Sheffield, NLP
What is Entity Recognition?● Entity Recognition is about recogising and classifying key Named
Entities and terms in the text
● A Named Entity is a Person, Location, Organisation, Date etc.
● A term is a key concept or phrase that is representative of the text
● Entities and terms may be described in different ways but refer to the same thing. We call this co-reference.
Mitt Romney, the favorite to win the Republican nomination for president in 2012
DatePerson Term
The GOP tweeted that they had knocked on 75,000 doors in Ohio the day prior.
Organisation
co-reference
Location
University of Sheffield, NLP
What is Event Recognition?
● An event is an action or situation relevant to the domain expressed by some relation between entities or terms.
● It is always grounded in time, e.g. the performance of a band, an election, the death of a person
Mitt Romney, the favorite to win the Republican nomination for president in 2012
Event DatePerson
Relation Relation
University of Sheffield, NLP
Why are Entities and Events Useful?
● They can help answer the “Big 5” journalism questions (who, what, when, where, why)
● They can be used to categorise the texts in different ways
– look at all texts about Obama.
● They can be used as targets for opinion mining
– find out what people think about President Obama
● When linked to an ontology and/or combined with other information, they can be used for reasoning about things not explicit in the text
– seeing how opinions about different American presidents have changed over the years
University of Sheffield, NLP
Approaches to Information Extraction
Knowledge Engineering rule based developed by
experienced language engineers
make use of human intuition
easier to understand results
development could be very time consuming
some changes may be hard to accommodate
Learning Systems use statistics or other
machine learning developers do not need
LE expertise requires large amounts of
annotated training data some changes may
require re-annotation of the entire training corpus
University of Sheffield, NLP
Seems like we need a tool to do this clever stuff for us.
How about GATE?
University of Sheffield, NLP
What is GATE?
GATE is an NLP toolkit developed at the University of Sheffield over the last 20 years.
It's open source and freely available. http://gate.ac.uk
• components for language processing, e.g. parsers, machine learning tools, stemmers, IR tools, IE components for various languages...
• tools for visualising and manipulating text, annotations, ontologies, parse trees, etc.
• various information extraction tools
• evaluation and benchmarking tools
University of Sheffield, NLP
GATE components● Language Resources (LRs), e.g. lexicons, corpora,
ontologies
● Processing Resources (PRs), e.g. parsers, generators, taggers
● Visual Resources (VRs), i.e. visualisation and editing components
● Algorithms are separated from the data, which means:
– the two can be developed independently by users with different expertise.
– alternative resources of one type can be used without affecting the other, e.g. a different visual resource can be used with the same language resource
University of Sheffield, NLP
ANNIE
• ANNIE is GATE's rule-based IE system
• It uses the language engineering approach (though we also have tools in GATE for ML)
• Distributed as part of GATE
• Uses a finite-state pattern-action rule language, JAPE
• ANNIE contains a reusable and easily extendable set of components:– generic preprocessing components for tokenisation,
sentence splitting etc
– components for performing NE on general open domain text
University of Sheffield, NLP
Named Entity Grammars • Hand-coded rules written in JAPE applied to annotations to
identify NEs
• Phases run sequentially and constitute a cascade of FSTs over annotations
• Annotations from format analysis, tokeniser. splitter, POS tagger, morphological analysis, gazetteer etc.
• Because phases are sequential, annotations can be built up over a period of phases, as new information is gleaned
• Standard named entities: persons, locations, organisations, dates, addresses, money
• Basic NE grammars can be adapted for new applications, domains and languages
University of Sheffield, NLP
Right, so we have a technology, and we have a tool to apply the technology.
Now how do we do it?
University of Sheffield, NLP
Framework
● Data collection (via Twitter streaming API)
● Documents stored as JSON and processed (annotated) via GCP
● Documents indexed via MIMIR
● Search and visualisation via MIMIR/Prospector
University of Sheffield, NLP
Live streaming (coming soon)
● If we want to process the tweets in real time, we can use the Twitter streaming client to feed the incoming tweets to a message queue.
● A separate process then reads messages from the queue, annotates them and pushes them into Mimir.
● If the rate of incoming tweets exceeds the capacity of the processing side, we can simply launch more instances of the message consumer across different machines to scale the capacity.
● Query and visualisation can then be performed as before on whatever data we currently have available
University of Sheffield, NLP
Let's look at some examples
(For anyone who grew up in the UK): “Here's one I prepared earlier”
University of Sheffield, NLP
DecarboNet project: what do people think about climate change?
And how much do we really know about it?
How do we know what's really true?
“It's cold in my flat“
University of Sheffield, NLP
Political Futures Tracker Application
● Example of using the technology on a real scenario - analysing political tweets in the run-up to the UK elections
● Project funded by Nesta http://www.nesta.org.uk/
● Series of blog posts about the project, leading up to the election, see e.g.
http://www.nesta.org.uk/blog/silver-surfers-and-westminster-twitterati
University of Sheffield, NLP
Twitter collection
● collected Tweets using Twitter's “statuses/filter” streaming API
● can follow up to 5000 user IDs and receive in real time
● collected all tweets and retweets posted by these users
● also retweets of, and replies to, any tweet posted by these users
University of Sheffield, NLP
Twitter collection (2)
● Initial list of 506 UK MPs' Twitter accounts, extracted from a CSV file made available by BBC News Labs and cleaned
● Also added list of UK election candidates collected and made available at https://yournextmp.com, and updated periodically
– 1,504 on 13th January 2015
– 1,811 on 2nd February 2015
● Added 21 official party accounts
● Total number of accounts followed at 16th Feb: 1,894
– 444 MPs standing again are included in both the MP and candidate lists
University of Sheffield, NLP
Tweets per hour collected
Government U-turn on fracking
Douglas Carswell's accidental “Hello Kitty” tweet
University of Sheffield, NLP
Longer web documents
● Also crawled websites of UK political parties (Con, Lab, LD, Green, UKIP, BNP, SNP, PC, plus the NI parties and various smaller parties)
● Initial crawl on 28th-29th October retrieved 375MB (compressed)
● Re-crawled regularly to pick up new pages
University of Sheffield, NLP
Politician and candidate annotation
● Acquired and corrected a list of UK MPs and election candidates and their affiliations, twitter accounts and DBpedia URIs
● Converted to gazetteers so that MPs in various forms (name or twitter handle) can be recognised in tweets and annotated with the relevant info (URI, full name, constituency etc.)
University of Sheffield, NLP
Topic Recognition● A set of themes was taken from the categories used on http://www.gov.uk
● For each theme, a gazetteer list was developed containing typical keywords and phrases representative of that theme
● e.g. “asylum seeker” indicates the topic “borders and immigration”
● Each list was expanded via:
– automatic term recognition (based on tf.idf) over a corpus of manifestos and other political documents
– manual additions
● Each list also contains potential head terms and modifiers which can be expanded into longer terms on the fly from the text during analysis stage
● e.g. “terrorist” can modify many other words (terrorist attack, terrorst threat, ...)
University of Sheffield, NLP
Topic recognition
This term is found by first recognising the head word “job” from a list under the theme “employment” and matching against its root form in the text, i.e. “job”.
It is then extended to include the adjectival modifier “British”, which is not present in a list anywhere.
University of Sheffield, NLP
Sentiment annotation
● Annotations are created over the whole sentence and contain the following features:
– sentiment_kind: optimism / pessimism
– holder: the person holding the opinion (MP's name)
– holder_URI: the URI fo the holder
– target: the target of the opinion, e.g. MP or topic
– target_URI: if appropriate, the URI of the target
– score: a positive/negative value reflecting the strength of opinion
– sarcasm: yes / no (whether sarcasm is present)
– sentiment_string: the main word(s) that contain sentiment
● These annotations and features will be used as input to MIMIR to facilitate analysis/aggregation
University of Sheffield, NLP
GATE Mimir
● can be used to index and search over text, annotations, semantic metadata (concepts and instances)
● allows queries that arbitrarily mix full-text, structural, linguistic and semantic annotations
● is open source
University of Sheffield, NLP
Show me:
● all documents mentioning a temperature between 30 and 90 degrees F (expressed in any unit)
● all abstracts written in French on Patent Documents from the last 6 months which mention any form of the word “transistor” in the English abstract
● the names of the patent inventors of those abstracts
● all documents mentioning steel industries in the UK, along with their location
University of Sheffield, NLP
Document Indexing with MIMIR
● MIMIR allows for indexing and querying text, annotations and semantic knowledge
– this gives a rich source of data for analysis
● Currently we have used MIMIR to index
– the raw collected text
– annotations provided by Twitter and by the applications
University of Sheffield, NLP
Examples of Mimir queries on our corpus
● Get all documents which talk about the borders/immigration topic
{Topic theme = "borders_and_immigration"}
● Get all documents where the author of the document is a candidate
{DocumentAuthor sparql = "?c nesta:candidate ?author_uri"}
● Get all documents where the author is an MP standing for re-election for the same seat
{DocumentAuthor sparql="?c nesta:candidate ?author_uri . ?c dbp-prop:mp ?author_uri"}
● Get all documents where the author is a candidate for the Sheffield Hallam constituency
{DocumentAuthor sparql="<http://dbpedia.org/resource/Sheffield_Hallam_(UK_Parliament_constituency)> nesta:candidate ?author_uri"}
University of Sheffield, NLP
What do the different parties talk about? Conservative vs Labour
Transport, Europe and employment are mentioned more by Conservatives
NHS is mentioned more by Labour
University of Sheffield, NLP
Topics mentioned by the Green Party
Expected topics are high on the list
University of Sheffield, NLP
Measuring engagement about climate change
● We also used the tools to measure how engaged both MPs and the public are about the topic of climate change
● Comparison of climate change with other political topics
● Theory is that people are quite apathetic about most political topics in general
● But people are more enthusiastic about climate change, because it's something they can actively do something about
● Results showed that climate change is not frequently tweeted about by most politicians apart from the Green Party, but is in top 3 topics for most of the engagement criteria we applied (retweets, replies, sentiment, optimism, @mentions, URLs)
● Climate change tweets contained the highest number of URLs - direct engagement with additional information
University of Sheffield, NLP
Summary
● Once you have the indexed data, you can carry on doing all kinds of interesting comparisons and analysis.
● Simple analysis tools can give you pretty pictures, but you can do much more interesting things when you delve a bit deeper and make use of information not explicit in the text
● For this you need both NLP and Linked Open Data
● Our tools are all freely available and open source
top related