making sense of millions of thoughts: finding patterns in the tweets
DESCRIPTION
I gave this presentation at Workshop on Interactive Language Learning, Visualization, and Interfaces / ACL 2014 in Baltimore, MD on June 27, 2014. http://nlp.stanford.edu/events/illvi2014/index.html ABSTRACT Everyday on Twitter, there are millions of thoughts that are captured and shared to the world in the form of 140-character messages, or Tweets. There are many things we could learn from these thoughts if we could figure out a way to digest this gigantic dataset. Visualization is one of the many ways to extract information from these Tweets. In this presentation, I will talk about several visualizations based on Tweets, as well as share experiences and challenges from working with Tweet data.TRANSCRIPT
Making Sense of Millions of Thoughts
Findingpatterns
in theTweets
“Knowing comes from learning, from seeking.”
“What we call chaos is just we haven't recognized.”
“I am looking for a needle haystack.”
“140-character text messages, called ”
Krist Wongsuphasawat
(50 characters)
(58 characters)
(42 characters)
(42 characters)
X-Men
Prof. XAbility: Telepathy (mind reading)
CerebroEnhance telepathy
Prof. X
Cerebro
With this power…
What are you thinking?
What are people thinking about x?
Product Event
Personetc.
Reality
Cerebro
Internet
Platformthought
thought
thought
thought
thought
crowdsourcing social networks
Data
Twittertweet
tweet
tweet
tweet
tweet
Tweets
Tweets• 140 characters
• text + media
• geo
• time
Twittertweet
tweet
tweet
tweet
tweet
Tweets
What can we learn from these Tweets?
visual-insights@twitter@miguelrios @philogb @trebor @kristw
World Cup
Election
Oscars
Pure Curiosity
Grammy
TV Shows
New Year
Breaking news
Earthquake
Insights, Stories
(Tweets)DATA
with limited time
Audience: general public
Tools
• Hadoop
• Apache Pig
• Vertica
• node.js, python
• d3 & co.
Pig
Insights, Stories
(Tweets)DATA
Insights, Stories
(Tweets)
Filter
DATA
Having all Tweets
How people think I feel.
Having all Tweets
How people think I feel. How I really feel.
Filter data
Good news:
Bad news:
Want only relevant Tweets
Have all Tweets
Too many Tweets
Filter data (2)• #hashtags — e.g. #world-cup
• easy to filter
• hashtags must be presented
• typo?
Filter data (2)• #hashtags — e.g. #world-cup
• easy to filter
• hashtags must be presented
• keywords — e.g. goal
• broader
• can be ambiguous
Filter data (3)• Combine with other attributes
• Time
• during the first half of World Cup final
Filter data (3)• Combine with other attributes
• Time
• during the first half of World Cup final
• Location
• Tweets from Brazil
• Not every Tweet is geotagged.
Filter data (4)
• Languages
• Sometimes use only English Tweets
• Future
• Translation?
Insights, Stories
(Tweets)
Filter
Clean
DATA
Clean data
• Typo (Mobile input)
• Abbreviation (due to 140-character limit)
• Exaggeration (e.g. GOOOOALLLL)
• Twitter specific e.g., Old-style retweet “RT …”
• Inappropriate content
Insights, Stories
(Tweets)
Filter
Clean
Visualize
DATA
(+ media)photos, videos
What?
Where? When?
GEO TIME
TEXT
DATA
What?
Where? When?
GEO TIME
TEXT
Visualize Data
What?
Where? When?
GEO TIME
TEXT
Visualize Data
TIME Tweets/second
TIME Tweets/second
TIME Tweets/second + Annotation
http://www.flickr.com/photos/twitteroffice/5681263084/
TIME Tweets/second + Annotation
Manual
To automate
Top tweets (most Retweets, Favs)
What?
Where? When?
GEO TIME
TEXT
Visualize Data
GEOHeatmap
Low density
High density
GEONew York City
flickr.com/photos/twitteroffice/8798020541
GEOSan Francisco
flickr.com/photos/twitteroffice/8798020541
GEOSan Francisco
Rebuild the world based on
tweet volumes
twitter.github.io/interactive/andes/
What?
Where? When?
GEO TIME
TEXT
Visualize Data
TIME + GEO
blog.twitter.com/2011/global-pulseyoutu.be/SybWjN9pKQk
Japan Earthquake 2011
TIME + GEO Tweet pattern [Rios & Lin 2012]
Night
Late night
Daytime
Night
Late night
Daytime
What?
Where? When?
GEO TIME
TEXT
Visualize Data
TEXT Trends
TEXT WordTree [Wattenberg & Viégas 2008]
www.jasondavies.com/wordtree
www.jasondavies.com/wordtree
TEXT• Now
• Derived information: Sentiment, Topic
• Combine with other information (geo & time) + context
• Future
• Better technique + involves more NLP e.g. key phrases, etc.
TEXT Descriptive Keyphrases [Chuang et al. 2012]
TEXT• Challenge
• Scale
What?
Where? When?
GEO TIME
TEXT
Visualize Data
GEO + TEXT Real-time Tweet map
GEO + TEXT Real-time Tweet map
GEO + TEXT Real-time Tweet map
most frequent
term
GEO + TEXT Real-time Tweet map
Gmail went down Jan 24, 2014
GEO + TEXT Real-time Tweet map
Nelson Mandela passed away Dec 5, 2013
GEO + TEXT Real-time Tweet map
• Next:
• Involves more NLP
• Tokenization - Languages without space between words
• etc.
• Challenge:
• Real-time
What?
Where? When?
GEO TIME
TEXT
Visualize Data
TIME + TEXT
http://www.babynamewizard.com/voyager
Baby Name Voyager
TIME + TEXT
http://www.babynamewizard.com/voyager
Baby Name Voyager
TIME + TEXT
UEFA Champions League
Biggest Tournament for European soccer clubs
Many Tweets during the matches
TIME + TEXT UEFA Champions League
Dortmund Bayern Munich
Count Tweets mentioning the teams every minute
Team 1 Team 2
TIME + TEXT UEFA Champions League
TIME + TEXT UEFA Champions League
+ “goal” count + context
TIME + TEXT UEFA Champions League
+ “offside”
TIME + TEXT UEFA Champions League
+ players
A B C D
A C
C
Competition Tree
vs vs
vs
A B C D
A C
C
Competition Tree
+
vs vs
vs
TIME + TEXT UEFA Champions League
• Challenges
• Filter relevance tweets
• Multiple matches at the same time
• Ambiguous words: “goal”, “red”, “yellow”
• Tweets mentioning both teams e.g. “#GER 2-2 #GHA”
What?
Where? When?
GEO TIME
TEXT
Visualize Data
TIME + GEO + TEXT State of the Union
twitter.github.io/interactive/sotu2014
TIME + GEO + TEXT State of the Union
1) timeline + topic from Tweets
4) Density map of Tweets about selected topic
3) Volume of Tweets by topics
during selected part of the SOTU
2) context (speech)
twitter.github.io/interactive/sotu2014
TIME + GEO + TEXT New Year 2014
TIME + GEO + TEXT New Year 2014
TIME + GEO + TEXT New Year 2014
twitter.github.io/interactive/newyear2014/
Recap
What can we learn from these Tweets?
many, many things.
better
the examples in this talk
imagine…
DATA(Tweets)
Insights, Stories
(Tweets)
Filter
Clean
Visualize
DATA
(Tweets)
Insights, Stories
Filter
Clean
Process &Visualize
DATA
(Tweets)
Insights, Stories
Filter
Clean
Process &Visualize
DATA
NLP
TEXTWhat?
Where? When?
GEO TIME
Visualize data
(Tweets)
Insights, Stories
Filter
Clean
Process &Visualize
DATA
Research
Working together
Raw data
Human
Working together
Raw data
Human
Computer (One machine, Cloud, MapReduce, etc.)
Working together
Raw data
Human
Ignored informationProcessed information
Computer (One machine, Cloud, MapReduce, etc.)
Working together
Raw data
Human
Aggregated information
Ignored informationProcessed information
Computer (One machine, Cloud, MapReduce, etc.)
Working together
Raw data
Human
Aggregated information
Ignored informationProcessed information
Computer (One machine, Cloud, MapReduce, etc.)
NLP Make computers think more like Human.
Working together
Raw data
Human
Aggregated information
Ignored informationProcessed information
VISHelp people consume information.
Computer (One machine, Cloud, MapReduce, etc.)
NLP Make computers think more like Human.
Working together
Raw data
Human
Aggregated information
Ignored informationProcessed information
VISHelp people consume information.
Computer (One machine, Cloud, MapReduce, etc.)
NLP Make computers think more like Human.
HCI
User interactions or
Provide feedback
Bridge the gap. Connect human & computer.
Advanced techniques vs.
Scalability
LifeFlow => Flying SessionsResearch System at Twitter
Summary• Thoughts are captured in the Tweets: what, where, when
• Finding patterns from: text + geo + time
• Opportunities for NLP + HCI + VIS collaboration
• Better technique vs. Scalability + Real-time
@kristw / interactive.twitter.com
Questions?
Thank you