characterizing the life cycle of online news stories using social media reactions
TRANSCRIPT
Characterizing the Life Cycle of Online News StoriesUsing Social Media ReactionsCarlos Castillo, Mohammed El-Haddad, Matt Stempeck, Jürgen Pfeffer
Twitter: @ChaToX
2
Carlos Castillo – @chatoxhttp://www.chato.cl/research/
Outline
• Determining classes of news articles• Predicting traffic using social media
3
Carlos Castillo – @chatoxhttp://www.chato.cl/research/
Usage analysis in online news
• Aikat (1998)– Short dwell times, weekday+, weekend-,
bursty traffic.
• Crane and Sornette (2008), Yang and Leskovec (2011), Lehmann et al. (2012)– Behavioral classes of attention online
4
Carlos Castillo – @chatoxhttp://www.chato.cl/research/
Analysis of social media responses
• SocialFlow whitepaper (Lotan, Gaffney, and Meyer 2011)– Al Jazeera, BBC News, CNN, The Economist,
Fox News and The New York Times
• Hu et al. (2011)– Tweets during speech of US president
5
Carlos Castillo – @chatoxhttp://www.chato.cl/research/
Predictive Web Analytics (references)
6
Carlos Castillo – @chatoxhttp://www.chato.cl/research/
Data collection
• Three weeks in October 2012• “Beacon” embedded in Al Jazeera pages
– Real-time data processing– Apache S4 application for online processing– Cassandra (NoSQL database) for storage
≈ 3M visits
≈ 200K social media reactions
7
Carlos Castillo – @chatoxhttp://www.chato.cl/research/
Summary of dataset
8
Carlos Castillo – @chatoxhttp://www.chato.cl/research/
News In-Depth
Examples:• US state of Maryland
abolishes death penalty (May 2nd, 2013)
• Hundreds arrested in China over 'fake' meat (May 3rd, 2013)
Examples:• Spirits of Japan shrine
haunt Asian relations (May 2nd, 2013)
• Interactive: Powering the Gulf (May 2nd, 2013)
9
Carlos Castillo – @chatoxhttp://www.chato.cl/research/
News (322) In-Depth (139)
Tag clouds extracted from titles of articles
15
Carlos Castillo – @chatoxhttp://www.chato.cl/research/
Typical visitation profiles (12 hours)
Decreasing (78%)
Steady (9%)
Increasing (3%)
Rebounding (10%)
Examples
Decreasing (78%):
● Almost all breaking news
● Sometimes delayed due to timezone differences, e.g. Hurricane Sandy
Steady or Increasing (12%):
● Ongoing news: Obama/Romney, Worker strikes in SA, Syrian unrest
● Articles updated with supporting content
Rebounding (10%):
● Articles picked up by external sources or social media (typically single source of traffic)
● Background articles to new developments
17
Carlos Castillo – @chatoxhttp://www.chato.cl/research/
Prediction of visits
• Short-term traffic is to a large extent correlated with long-term traffic
• Social media signals are correlated with traffic and shelf-life
More reactions → more trafficMore discussion → longer shelf-life
• Can we predict 7 days after 30 minutes?
18
Carlos Castillo – @chatoxhttp://www.chato.cl/research/
Predicting traffic and shelf-life online has a long history
• Predicting long-term behavior and half-life from short-term observations– Observations = comments, visits, votes, …– Behavior = total comments, total visits, …– 10+ papers specifically on web traffic
• Bit.ly (2011, 2012)– Studies half-life per topic and platform
22
Carlos Castillo – @chatoxhttp://www.chato.cl/research/
Selected variables, traffic prediction
Results (shelf-life prediction)
Larger improvements for In-Depth articles
Still, this is a 12 hours error in predicting something with an average of 48-72 hours
24
Carlos Castillo – @chatoxhttp://www.chato.cl/research/
http://fast.qcri.org/
25
Carlos Castillo – @chatoxhttp://www.chato.cl/research/
What did we learn?
• Decrease, Stay or Increase. Rebound– Roughly 80:10:10 ratio
• News vs In-Depth: different behavior• Social media signals are useful to
understand and predict visits
26
Carlos Castillo – @chatoxhttp://www.chato.cl/research/
Invitation:ECML/PKDD Discovery Challenge 2014
• Open competition on predictive Web Analytics
• Data provided by Chartbeat Inc.
Thank you!Carlos Castillo · [email protected]
http://www.chato.cl/research/