data pipeline for sentiment analysis project

24
SENTIMENT ANALYSIS OF TWITTER DATA & NEWSPAPER LETTERS-TO- THE-EDITOR TO DETECT LEARNED HELPLESSNESS IN THE PUBLIC SPHERE Mary van Valkenburg & Thea Ledbetter

Upload: mary-van-valkenburg

Post on 16-Apr-2017

195 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Data Pipeline for Sentiment Analysis Project

SENTIMENT ANALYSIS OF TWITTER DATA & NEWSPAPER LETTERS-TO-THE-EDITOR TO

DETECT LEARNED HELPLESSNESS

IN THE PUBLIC SPHEREMary van Valkenburg

& Thea Ledbetter

Page 2: Data Pipeline for Sentiment Analysis Project

Hypothesis: As a society, we have developed learned helplessness in regard to mass school shootings

Page 3: Data Pipeline for Sentiment Analysis Project

“Somehow this has become routine. The reporting is routine. My response here at this podium ends up being routine. The conversation in the aftermath of it. We've become numb to this.”

President Barack Obama 10/01/2015

Page 4: Data Pipeline for Sentiment Analysis Project

Jurgen Habermas’ idea of the Public Sphere

“socially engaged in critical public debate”

https://en.wikipedia.org/wiki/Occupy_movement

Page 5: Data Pipeline for Sentiment Analysis Project

Demographic Differences

Twitter Letters-to-the-Editor

Younger – under 40More likely politically and socially liberal

Older – over 40More likely politically and socially conservative

Page 6: Data Pipeline for Sentiment Analysis Project

Acquire Data

Wrangle Data

Explore Data

Analyze Data

The data pipeline

379 LettersColumbine High School (91)Amish School (18)Virginia Tech (158)Sandy Hook Elementary (93)Umpqua Community College (19)

1,349,765 TweetsVirginia Tech (1,180)Sandy Hook Elementary (1,139,751)Umpqua Community College (208,834)

Page 7: Data Pipeline for Sentiment Analysis Project

Acquire Data

Wrangle Data

Explore Data

Analyze Data

The data pipeline

1. Remove Retweets:Virginia Tech (1,180 1,180)Sandy Hook Elementary (1,139,751 597,835)Umpqua Community College (208,834 82,000)

2. Translate Emojis (😡 “angry”)3. Convert to lowercase (Disgusted! disgusted!)4. Remove Stop Words (174 common words: 20 days of anguish!

20 days anguish!) 5. Remove non-alphabetic characters (20 days anguish! days

anguish)

Clean

Page 8: Data Pipeline for Sentiment Analysis Project

The data pipeline

1. Get NRC sentiment values for each tweet or letter(Anger, Disgust, Fear, Sadness, Surprise)

2. Calculate word counts (tweets ranged from 1 to 56, letters ranged from 10 to 271 words)

3. Calculate adjusted sentiment values2. Calculate proportion of tweets that are in response to each event

by date: 2007 (Virginia Tech) – approx. 5000 tweets per day2012 (Sandy Hook) – approx. 400,000,000 tweets per day 2015 (Umpqua) – approx. 500,000,000 tweets per day

Organize

Acquire Data

Wrangle Data

Explore Data

Analyze Data

Page 9: Data Pipeline for Sentiment Analysis Project

Sample Letter (Des Moines Register – 4/18/2007)

Our culture of violence toward people comes through in music, books, video games, movies, the Iraq war, divorce, theft, illegal immigration and lying on our income taxes. Then, we are surprised when the tragedy of Virginia Tech takes place. When we eliminate God from our lives, the animalistic behavior that lies within each of us is allowed to come out in all its destructive fury. Don't be surprised; it will only get worse.

word count (after processing) = 42anger = 7, adjusted anger score = 0.1667fear = 10, adjusted anger score = 0.2381sadness = 9, adjusted anger score = 0.2143disgust = 5, adjusted anger score = 0.1190surprise = 2, adjusted anger score = 0.0476

Page 10: Data Pipeline for Sentiment Analysis Project

Sample Tweet - 4/18/2007)

mass shooting at umpqua community college 😱

word count (after processing) = 7anger = 3, adjusted anger score = 0.4286fear = 3, adjusted anger score = 0.4286sadness = 0, adjusted anger score = 0disgust = 1, adjusted anger score = 0.1429surprise = 0, adjusted anger score = 0

Page 11: Data Pipeline for Sentiment Analysis Project

Acquire Data

Wrangle Data

Explore Data

Analyze Data

The data pipeline

1. Create summary views:• Message counts by date, event, source and

source type• Mean adjusted sentiment scores by event and

source type

2. Create quick plots

Page 12: Data Pipeline for Sentiment Analysis Project

Acquire Data

Wrangle Data

Explore Data

The data pipeline

Analyze Data

How many people are tweeting about mass school shootings?

Page 13: Data Pipeline for Sentiment Analysis Project
Page 14: Data Pipeline for Sentiment Analysis Project
Page 15: Data Pipeline for Sentiment Analysis Project

Proportion of users tweetingt.test(vt_tweets.by_date$proportion, sh_tweets.by_date$proportion, alternative = "g")

t = 2.6578, df = 27.001, p-value = 0.006526mean of x mean of y 0.0084285714 0.0000515375

t.test(ucc_tweets.by_date$proportion, sh_tweets.by_date$proportion, alternative = "l")t = -4.0666, df = 31.246, p-value = 0.0001501 mean of x mean of y 5.655172e-06 5.153750e-05

Page 16: Data Pipeline for Sentiment Analysis Project

Acquire Data

Wrangle Data

Explore Data

The data pipeline

Analyze Data

How has sentiment changed over time?

Page 17: Data Pipeline for Sentiment Analysis Project
Page 18: Data Pipeline for Sentiment Analysis Project

Acquire Data

Wrangle Data

Explore Data

The data pipeline

Analyze Data

Is the increase in anger significant?

Page 19: Data Pipeline for Sentiment Analysis Project

t-test for significance in increased anger after Umpqua as compared to anger after Sandy Hook

t.test(Umpqua.anger, SandyHook.anger, alternative = "g")Welch Two Sample t-test

data: Umpqua.anger and SandyHook.angert = 93.156, df = 101120, p-value < 2.2e-16alternative hypothesis: true difference in means is greater than 095 percent confidence interval: 0.02285762 Inf

sample estimates: mean of x mean of y 0.06516975 0.04190128

Page 20: Data Pipeline for Sentiment Analysis Project

• Oregon college shooting: Shock, fear and confusion after attack at Umpqua Community College

• Another gun massacre in the States

• senseless. https://t.co/22WtC4Jfll

• holy hell the #Umpqua shooting...

• Horrible. https://t.co/I30SEODIUB

• Damn crazy at Umpqua .been there before when I stayed up there.

• Heard about umpqua college shooting attack. Crazy !

• im so in shock because of the umpqua shooting i do not even have words

Page 21: Data Pipeline for Sentiment Analysis Project

Acquire Data

Wrangle Data

Explore Data

The data pipeline

Analyze Data

Is there a difference in the sentiment content of letters vs tweets?

Page 22: Data Pipeline for Sentiment Analysis Project

t = 9.1543, p-value < 2.2e-16 t = 8.7403, p-value < 2.2e-16 t = 10.972, p-value < 2.2e-16

t = 9.6111, p-value < 2.2e-16 t = 9.2149, p-value < 2.2e-16

Page 23: Data Pipeline for Sentiment Analysis Project

Concluding Thoughts

• Letters show more intense sentiment than tweets, even after controlling for differences in word counts.

• Anger in response to mass school shootings appears to have trended downward through the first four mass shootings studied. However, anger rose significantly with the last event, the shooting at Umpqua Community College.

• The proportion of tweets that were about mass school shootings declined from Virginia Tech in 2007 to Sandy Hook in 2012. They continued to decrease with the Umpqua Community College shooting in 2015.

Page 24: Data Pipeline for Sentiment Analysis Project

“Sentiment analysis is a very specific tool that's useful in certain situations, but it isn't magic.” 

-David Robinson

(quoted in Scientific American 8/18/2016)