towards detecting influenza epidemics by analyzing twitter massages

21
Towards Detecting Influenza Epidemics by Analyzing Twitter Massages Aron Culotta Jedsada Chartree

Upload: emelda

Post on 25-Feb-2016

61 views

Category:

Documents


0 download

DESCRIPTION

Towards Detecting Influenza Epidemics by Analyzing Twitter Massages. Aron Culotta. Jedsada Chartree. Introduction. Growing interest in monitoring disease outbreaks. Growing of twitter users - February, 201050 million tweets/day - June, 201065 million tweets/day (750 tweets/ s - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Aron Culotta

Jedsada Chartree

Page 2: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Introduction• Growing interest in monitoring disease outbreaks.• Growing of twitter users

- February, 2010 50 million tweets/day- June, 2010 65 million tweets/day (750 tweets/s

- 190 million users

Source: http://en.wikipedia.org/wiki/Twitter

Page 3: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Introduction

• Twitter is a website, which offers a social networking and micro-blogging service.- Users send and read messages called “tweets”

(140 characters)

Page 4: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Introduction• Advantages of Twitter for this research

- Full messages provide more information than query.- Twitter profiles contain more detail to analyze.

(city, state, gender, age)- Diversity of twitter users.

Page 5: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Methodology• Data

- Collect 574,643 messages for 10 weeks (February 12, 2010 to April 24, 2010) - The US Centers for Disease Control and Prevention (CDC)

publishes the US Outpatient Influenza-like Illness Surveillance Network (ILINet)

Page 6: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Methodology

The Ground truth ILI rates obtained from the CDC statistics

Page 7: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Methodology• Regression Models 1. Simple linear regression

P = the proportion of the population exhibiting ILI symptoms = the coefficients = Error = the fraction of document in D that match W = D = a document collection Dw = a document frequency for word W logit(x) =

log it(P) = β1 log it(Q(W ,D))+ β 2 +ε

β1

β2€

ε

Q(W ,D)

DwD

ln( x1− x

)

Page 8: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Methodology• Regression Models 2. Multiple linear regression

P = the proportion of the population exhibiting ILI symptoms = the coefficients = Error = the fraction of document in D that match Wi = D = a document collection Dwi = a document frequency for word Wi

logit(x) =

log it(P) = β1 log it(Q({W1},D))+ ...+ log it(Q({Wk},D))+ β k+1 +ε

β1

β2€

ε

Q({Wi },D)

DwiD

ln( x1− x

)

Page 9: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Methodology• Keyword Selection1. Correlation Coefficient

- Simple linear regression model evaluation

2. Residual Sum of Squares (RSS)

- It measures a discrepancy between the data and an estimation model

RSS(P,^P) = ( pi − p

^)2

i∑

Page 10: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Methodology• Keyword Generation1. Hand-chosen keywords

(flu, cough, sore throat, headache)

2. Most frequent keywords - Search all documents containing any of hand-chosen

keywords. - Find the top 5,000 most frequently occurring words.

Page 11: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Methodology• Document Filtering - Applying logistic regression to predict whether a Twitter

message is reporting an ILI symptom.

yi = a binary random variable (1 if document Di is positive, 0 otherwise) xi = {xij} = number of times word j appears in document i

p(y i = 1 | x i ;θ ) =1

1+ e(−xi •θ )

Page 12: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Methodology

Page 13: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Methodology• Classification evaluation

- Accuracy - Precision - Recall - F-measure

F = 2• Pr ecision • RecallPr ecision +Recall

Page 14: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Results

• Document Filtering

Evaluation of messages classification with standard error in parentheses

Page 15: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Results

• Regression

The 10 different systems evaluated

Page 16: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Results

• Regression

The regression coefficient (r), residual sum of square (RSS), and standard error of each system

Page 17: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Results

Results for multi-hand-rss(2) Results for classification-hand

Page 18: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Results

Results for multi-freq-rss(3) Results for simple-hand-rss(1)

Page 19: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Results

Correlation results for simple –hand-rss and multi-hand-rss

Correlation results for simple –hand-corr and multi-hand-corr

Page 20: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Results

Correlation results for simple –freq-rss and multi-freq-rss

Correlation results for simple –freq-corr and multi-freq-corr

Page 21: Towards Detecting Influenza Epidemics by Analyzing Twitter Massages

Conclusion• Several methods to identify influenza-related messages.• Compare a number of regression models to correlate the

messages with CDC statistics.• The best model achieves correlation of .78 .