social media news mining and automatic content analysis of news

83
Social Media News Mining Carlos Castillo Gilad Lotan @ChaToX @gilgul

Upload: carlos-castillo-chato

Post on 16-Apr-2017

2.837 views

Category:

News & Politics


1 download

TRANSCRIPT

Social Media News Mining

Carlos Castillo Gilad Lotan

@ChaToX @gilgul

Social Media News Mining &Automatic Content Analysisof NewsCarlos Castillo – Qatar Computing Research Institute

Nov 14th, 2013

5

Carlos Castillo – [email protected]://www.chato.cl/research/

Outline

• Social media around news1. Predictive analytics using social media2. Crowds and curators

• Automatic content analysis of news3. TV news via closed captions4. Online news in international media

6

Carlos Castillo – [email protected]://www.chato.cl/research/

Communication scholarsvs. Computer scientists

• Media and communication scholars– Start from high-level questions

• Computer scientists– Start from low-level observations

• We need to find a middle ground– To a large extent, we are still not there– I am certainly still not there

7

Carlos Castillo – [email protected]://www.chato.cl/research/

Collaborators• Gianmarco de Francisci Morales – Yahoo!• Mohammed El-Haddad – Al Jazeera• Sandra González-Bailón – University of Pennsylvania• Nasir Khan – Al Jazeera• Mounia Lalmas – Yahoo!• Janette Lehmann – Pompeu Fabra University & Yahoo!• Marcelo Mendoza – Yahoo!• Jürgen Pfeffer – CMU• Matt Stempeck – MIT Civic Media• Diego Sáez-Trumper – Pompeu Fabra University• Ethan Zuckerman – MIT Civic Media

Predictive analytics using social mediaCarlos Castillo, Mohammed El-Haddad, Jürgen Pfeffer and Matt StempeckCharacterizing the Life Cycle of Online News Stories Using Social Media ReactionsTo appear in Proc. of Computer Supported Collaborative Work and Social Media.Baltimore, MD, USA. February 2014.

See also: demo at http://fast.qcri.org/

Topic 1 of 4

Pirates abduct ship’s crew off Nigerian coastOctober 17th, 2012

10

Carlos Castillo – [email protected]://www.chato.cl/research/

Usage analysis (in news) online

• Aikat (1998)– Bursts, short dwell times, weekday != weekend

• Crane and Sornette (2008), Yang and Leskovec (2011), Lehmann et al. (2012)– Behavioral classes of attention online

• Lotan, Gaffney, and Meyer (SocialFlow, 2011)– Al Jazeera, BBC, CNN, The Economist, Fox News, NY

Times

• … and many others!

News In-Depth

News examples In-Depth examples

● Dozens killed in India bus-crash blaze (Oct 30th, 2013)

● Kenyan army admits soldiers looted mall (Oct 30th, 2013)

● Sex selective abortions worry Azerbaijanis (Oct 29th, 2013)

● Time to put an end to Israel's don't ask-don't tell nuclear policy (Oct 18th, 2013)

News: intense first hourIn-Depth: longer shelf-life

14

Carlos Castillo – [email protected]://www.chato.cl/research/

Average visitation/sharing profiles

News In-Depth

15

Carlos Castillo – [email protected]://www.chato.cl/research/

Types of news visitation profiles (12 h)

Decreasing (78%)

Steady (9%)

Increasing (3%)

Rebounding (10%)

16

Carlos Castillo – [email protected]://www.chato.cl/research/

Prediction of visits

• Short-term traffic is to a large extent correlated with long-term traffic

• Social media signals are correlated with traffic and shelf-life

More reactions → more trafficMore discussion → longer shelf-life

• Can we predict 7 days after 30 minutes?

Results (traffic predictions)

Improved predictionsUsing social media variables

http://fast.qcri.org/

http://fast.qcri.org/

Predictions are updated as new information arrives. Predictive models are re-trained every 24 hours. Traffic to many (but not all) articles is easy to predict.

Don't remove over- achievers, promote under- achievers.

20

Carlos Castillo – [email protected]://www.chato.cl/research/

Take-home messages

• Decrease, Stay or Increase. Rebound– Roughly 80:10:10 ratio in first 12 hours

• News vs In-Depth: different behavior– News pieces die out rapidly on the web– In-Depth pieces live longer

• Visit forecasting can help take more informed editorial decisions

News crowds and news curators in social mediaJanette Lehmann, Carlos Castillo, Mounia Lalmas and Ethan Zuckerman:Transient News Crowds in Social MediaIn Proc. of International Conference on Weblogs and Social Media.Cambridge, MA, USA, July 2013. See also: blog post.

Janette Lehmann, Carlos Castillo, Mounia Lalmas and Ethan Zuckerman:Finding News Curators in TwitterSocial News on the Web (SNOW) workshop.Rio de Janeiro, Brazil, May 2013. See also: blog post.

Topic 2 of 4

Social mediaSocial mediausers that areusers that arehighly engagedhighly engagedwith newswith news

23

Carlos Castillo – [email protected]://www.chato.cl/research/

Transient News Crowds

24

Carlos Castillo – [email protected]://www.chato.cl/research/

Empirical results

• Experiment with articles in BBC and AJE• People who tweeted an article within 6 hours of

publication → news crowd– Follow the crowd for one week– Divide time in 12-hour slices

• Most crowds disperse rapidly– They tweeted once about the same thing– Now they tweet about different things

• Some crowds re-group later

Syria allows UN to step up food aid

French troops launch

ground combat

in Mali13 Jan 2013

13 Jan 2013

26

Carlos Castillo – [email protected]://www.chato.cl/research/

How do we find the related ones?

• Machine-learning approach• Important attributes

– Text similarity to original story– Exclusivity of history to this crowd

• Finds 14% to 72% of related stories automatically (@ 2/3 precision)

27

Carlos Castillo – [email protected]://www.chato.cl/research/

Application to tracking a story

28

Carlos Castillo – [email protected]://www.chato.cl/research/

Focus on articles → focus on users

Twitter user Followers Tweets about ...

@RevolutionSyria 88,122 Syria

@KenanFreeSyria 13,388 Syria

@UP_food 703 Food

@KeriJSmith 8,838 Breaking news/top stories

@BreakingNews 5,662,866 Breaking news/top stories

Example: which users with a large number of followers tweeted

Syria allows UN to step up food aid (16 Jan 2013)

29

Carlos Castillo – [email protected]://www.chato.cl/research/

News curators

• Think Andy Carvin @acarvin, who was a “distant witness” of the Arab Spring

30

Carlos Castillo – [email protected]://www.chato.cl/research/

Do we have curators in Twitter?

Human Automatic

Topic-unfocus

ed

Topic-unfocused curatorDisseminating news articles about diverse topics, usually breaking news/top stories

@KeriJSmith

News aggregatorsCollecting news articles (e.g. from RSS feeds) and automatically post their corresponding headlines and URLs@BreakingNews

Topic-focused

Topic-focused curatorCollecting interesting information with a specific focus, usually a geographic region or a topic@KenanFreeSyria

Topic-focused aggregatorsDisseminating automatically news with topical focus

@UP_food, @RevolutionSyria

31

Carlos Castillo – [email protected]://www.chato.cl/research/

Which users do we care about?

Human Automatic

Topic-focused

Topic-focused curatorCollecting interesting information with a specific focus, usually a geographic region or a topic@KenanFreeSyria

Topic-focused aggregatorsDisseminating automatically news with topical focus

@UP_food, @RevolutionSyria

32

Carlos Castillo – [email protected]://www.chato.cl/research/

Manual annotation (200 users)

13%

8%

79%

Focused - Human

Focused - Auto

Unfocused

2%

3%

95%

Focused - Human

Focused - Auto

Unfocused

33

Carlos Castillo – [email protected]://www.chato.cl/research/

Automatically finding curators

• Simple rules– UserFracURL >= 85%: automatic– UserSectionsQ >= 90%: unfocused

• Complex model (AUC > 0.90)– Random forest

34

Carlos Castillo – [email protected]://www.chato.cl/research/

Take-home messages

• Twitter users quickly shift topics– But sometimes return to a topic

• There are excellent news curators in Twitter– Although many of them are automatic

• Automatic systems can help identify curators and follow-up news

Analysis of TV news viaclosed captionsCarlos Castillo, Gianmarco De Francisci Morales, Marcelo Mendoza, Nasir Khan:Says Who? Automatic Text-based Content Analysis of Television NewsWorkshop on Mining Unstructured Data Using NLP (UnstructureNLP).Burlington, CA, USA. October 2013.

Topic 3 of 4

36

Carlos Castillo – [email protected]://www.chato.cl/research/

Acquiring closed captions

• We used data from Yahoo's IntoNow– 140 TV channels– 2MB/channel/day– Jan-Jun 2012

• Internet Archive: http://archive.org/details/tv

37

Carlos Castillo – [email protected]://www.chato.cl/research/

Text pre-processing: input

[1339302660] WHAT MORE CAN YOU ASK FOR?

[1339302662] >> THIS IS WHAT NBA

[1339302663] BASKETBALL IS ABOUT

38

Carlos Castillo – [email protected]://www.chato.cl/research/

Text pre-processing: output

What/WP more/JJR can/MD you/PRP ask/VB

for/IN ?/. This/DT is/VBZ what/WDT

NBA/NNP [entity: National_Basketball_

Association] basketball/NN is/VBZ

about/IN ./. [sentiment: 0.0]

Clusters by non-entity words

General news

Sport news

General + entertainment

Sports

Sports

General news

Sports

General + sports

Business + sports

Business + sports

Clusters by linguistic styleGeneral + business

General + entertainment

Sports

Sorting by average sentiment

Mixed

Sports

Sentiment scores on TV captions go from neutral to positive.

Strong positive words are used more than strong negative words?

42

Carlos Castillo – [email protected]://www.chato.cl/research/

Automatic TV ↔ online news matching

• Same pre-processing is done over articles on the Yahoo! News website

• Genre classification (general, sports, business, entertainment) by– Data from TV guide for closed captions– Section in Yahoo! News for web news

Coverage by prominence

TV networks with more resources can cover more stories.

Some prefer to cover only prominent ones, others want some niche content.

US military to probe “marine abuse video”January 12th, 2012

Breaking stories vs news matching

Average story duration

Sports stories tend to have a longer life

47

Carlos Castillo – [email protected]://www.chato.cl/research/

Newsmakers

• By professional activity– Sentiments– Distributions

• In relationship to news providers

• Everybody is a (potential) entertainer

Distributions of mentions per person

By professional activity

Athletes or entertainers?

Politicians or entertainers?

51

Carlos Castillo – [email protected]://www.chato.cl/research/

Take-home messages

• Closed captions are a goldmine of data for content analysis

• Automatic content analysis is feasible up to a certain extent– But we still need to learn to use it

• Reduce subjectivity when trying to answer some research questions

Biases in online news in international news mediaDiego Sáez-Trumper, Carlos Castillo and Mounia Lalmas:Social Media News Communities: Gatekeeping, Coverage, and Statement BiasIn Proc. of Conference on Information and Knowledge Management (short paper).Burlingame, CA, USA, October 2013.

Topic 4 of 4

Jonathan StrayThe Atlantic, Feb 2013

Wei Hao-LinPhD Thesis, CMU 2008

55

Carlos Castillo – [email protected]://www.chato.cl/research/

Selection bias

Coverage bias

Statement bias

56

Carlos Castillo – [email protected]://www.chato.cl/research/

Goal: discover bias in news media

• 60+ news sources in English– BBC, CNN, Fox, Time, UPI, Herald Sun, Times

of India, Euro News, DW English, etc.

• Follow news through RSS and Twitter• Collect tweets pointing to news • No a-priori information on conflicts or

divisions → unsupervised methods

57

Carlos Castillo – [email protected]://www.chato.cl/research/

Method

• “Community” of a news source– Users who tweeted at least 3 articles from

that source in the last 3 days

• Collect all articles posted by each– News source– Community of a news source

• Compute distances and project in 2D

Community overlaps (J>0.03)

Selection bias

Coverage bias

Measure the distribution of the number of words given to each news story.

Compute the 1-divergence between each pair of sources.

Coveragebias

In Twitter, coverage bias (as measured by number of tweets) is evident while selection bias is not.

Coverage bias and partisan politics

Sentiment analysis

64

Carlos Castillo – [email protected]://www.chato.cl/research/

Future work: find patterns like this?

“perusing TIME’s covers reveals countless examples of the publication tempting the world with critical events, ideas or figures, while dangling before Americans the chance to indulge in trite self-absorption” – David Harris Gershon

65

Carlos Castillo – [email protected]://www.chato.cl/research/

Take-home messages

• Encouraging results on fully unsupervised discovery– But results are quite shallow for now

• It is frustratingly difficult to discover bias and framing– We are not happy with only quantifying or

analyzing known conflicts

Closing remarks

67

Carlos Castillo – [email protected]://www.chato.cl/research/

Journalismneeds

Dataavailability

Computingcapabilities

68

Carlos Castillo – [email protected]://www.chato.cl/research/

Journalismneeds

Dataavailability

Computingcapabilities

Overexploited

Finding common ground is not easy.

AI-completeproblems

Poorly planned projects

69

Carlos Castillo – [email protected]://www.chato.cl/research/

Data analysis is easy, fun and addictive.

Without good research questions,it is often useless.

Computer science to support a key function of society = Applied computing at its best!

Thank you!Carlos Castillo · [email protected]

http://www.chato.cl/research/

72

Carlos Castillo – [email protected]://www.chato.cl/research/

Shouldn't traditional news outlets resent social media?

• We did not take their lunch• I am not pointing fingers but …

• … online classified ads are to “blame”

73

Carlos Castillo – [email protected]://www.chato.cl/research/

Data sample from Al Jazeera English

• October 2012≈ 3M visits≈ 606 articles

≈ 200K social media reactions• Open Source Web Analytics beacon

– High-performance process (S4+Cassandra).

News: less shared on FacebookIn-Depth: more shared on Facebook

Examples (mid-2012)

Decreasing (78%):● Almost all

breaking news

● Sometimes delayed due to timezone differences, e.g. Hurricane Sandy

Steady or Increasing (12%):● Ongoing news:

Obama/Romney, Worker strikes in SA, Syrian unrest

● Articles updated with supporting content

Rebounding (10%):● Articles picked up

by external sources or social media (typically single source of traffic)

● Background articles to new developments

76

Carlos Castillo – [email protected]://www.chato.cl/research/

Predicting traffic and shelf-life online has a long history

• Predicting long-term behavior and half-life from short-term observations– Observations = comments, visits, votes, …– Behavior = total comments, total visits, …– 10+ papers specifically on web traffic

• Bit.ly (2011, 2012)– Studies half-life per topic and platform

Results (shelf-life prediction)

Larger improvements for In-Depth articles

Still, this is a 12 hours error in predicting something with an average of 48-72 hours

78

Carlos Castillo – [email protected]://www.chato.cl/research/

Social media users engaged with news

• To what extent can they contribute to the journalistic process?

• What kind of roles do they play?

• 47% of journalists from 15 countries (n=478) said Twitter is a source of information for them [source]

79

Carlos Castillo – [email protected]://www.chato.cl/research/

Manual annotation

• 200 users in 20 articles• Crowdsourcing workers see:

– Title of news article– Profile and description of user– Sample of 10 tweets of the user

In relation to news providers

Projection in 2D of the second component of a 3-way decomposition with a 3x2x2 core of the tensor of sources x newsmakers x style.

The first component separates football from basketball.

81

Carlos Castillo – [email protected]://www.chato.cl/research/

Text pre-processing: steps

• Determine paragraph boundaries– Speech change markers, heuristics based

on text and time

• Apply a part-of-speech tagger– Stanford NLP tagger

• Find named entity mentions• Apply sentiment analysis

82

Carlos Castillo – [email protected]://www.chato.cl/research/

News sources

• Non-entity words• Linguistic style

– Prevalence of different part-of-speech classes

• Overall sentiment• Coverage• Timeliness

83

Carlos Castillo – [email protected]://www.chato.cl/research/

News matching (model)

• Target: {same story, different story}• Example features:

– Dot product of aboutness scores of resolved entities in the title, body

– Jaccard coefficient of unresolved entities in the title, body

• Logistic regression• 4 models in total, one per genre