monitoring social media for communicable disease ... · 19 april 2013 monitoring social media for...

15
19 April 2013 Monitoring social media for communicable disease surveillance: An Australian study Matthew HAMLET a , Guido ZUCCON a,b , Sankalp KHANNA a,b , Anthony NGUYEN a,b , Justin BOYLE a,b , Mark CAMERON b a The Australian e-Health Research Centre, CSIRO, Australia b CSIRO ICT Centre, Australia

Upload: phunghanh

Post on 20-Aug-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

19 April 2013

Monitoring social media for communicable disease surveillance: An Australian study Matthew HAMLETa, Guido ZUCCONa,b, Sankalp KHANNAa,b, Anthony NGUYENa,b,

Justin BOYLEa,b , Mark CAMERONb a The Australian e-Health Research Centre, CSIRO, Australia b CSIRO ICT Centre, Australia

Motivation

2 | Monitoring social media for communicable disease surveillance: An Australian Study | Sankalp Khanna

Big Data to the Rescue

3 | Monitoring social media for communicable disease surveillance: An Australian Study | Sankalp Khanna

The (current) Solution

• Victorian Infectious Diseases Reference Laboratory (VIDRL), Queensland Health and other state based agencies.

• Australian Sentinel Practices Research Network (ASPREN)

• National Health Call Centre Network

• FluTracking

•Google Flu Trends

4 | Monitoring social media for communicable disease surveillance: An Australian Study | Sankalp Khanna

Google Flu Trends

The Problem

+ time, money, and resources ...

5 | Monitoring social media for communicable disease surveillance: An Australian Study | Sankalp Khanna

Why Twitter?

There are 200 million active users on Twitter! And those users post an average of 400 million Tweets every day1

6 | Monitoring social media for communicable disease surveillance: An Australian Study | Sankalp Khanna

1 https://business.twitter.com/audiences-twitter 2 http://www.website-monitoring.com/blog/2012/11/07/twitter-2012-facts-and-figures-infographic/ 3 http://www.adcorp.com.au/news-blog/social-media-statistics-july-2012,-australia-new-z

Top 10 Countries by tweet volume (July 2012)2

Top 15 Social Media Sites in Australia (July 2012) 3

• 3.5 months ... May to August 2011

• Victorian Tweets – using CSIRO’s ESA-AWTM architecture1

• 13.5+ million tweets

• Filtered on Keywords (see table)2,3

• Retweets were removed

• 100,000 potentially influenza related

tweets remaining

Data

7 | Monitoring social media for communicable disease surveillance: An Australian Study | Sankalp Khanna

1 Jie Yin, et al., “ESA: Emergency Situation Awareness via Microbloggers”, CIKM 2012 (2012). 2 Sadilek A, Kautz H, Silenzio V. Predicting disease transmission from geo-tagged micro-blog data. Conf Proc 26th AAAI Conference on

Artificial Intelligence 2012, 136-42 3 Signorini A, Segre AM, Polgreen PM. The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza

A H1N1 pandemic. PLoS One. 2011 May 4;6(5):e19467

Flu Sick Headache Fever

Ache Cough Throat Cold

Stomach Runny Sneeze Pneumonia

Influenza Stuffy Tylenol Diarrhea

Snot Tissues Antibiotics Shivering

Unwell Chills Doctor Fatigued

Down With Vomit Nausea Vicks

Not:doctor who Not:jab Not:shot Not:pandemic

Not:fully sick Not:sick of Not:vaccine Not:Bieber

Not:weather

Table: List of keywords used for filtering

Manual Classification for Validation

• 10483 tweets

• 17 AEHRC volunteers

• Scale of 0-100 with respect to likelihood of describing an influenza case

• 5% duplicates ... to measure inter-assessor agreement

8 | Monitoring social media for communicable disease surveillance: An Australian Study | Sankalp Khanna

Our international students had their first taste of footy

fever on the weekend.

http://bit.ly/kUCHLC Big thanks to

@NorthKangaroos!

I swear I can smell cold and flu bugs on

this tram ride. Be gone evil germs, I'll have none of you!

we didnt really get to talk:/ & but isaid kiss me please & he said

im sick i cant & im like idc il get sick& he still

said

I just remembered that I’m going to be 30 in 2 weeks. I’m

going to embrace this fully. Now pass me my royal sick bag.

Manual Classification for Validation

• Standard deviation between duplicate tweet rating was 4.89

• Shorter tweets had higher standard deviation on average.

• Average score used for duplicate tweets

9 | Monitoring social media for communicable disease surveillance: An Australian Study | Sankalp Khanna

0

2000

4000

6000

8000

0 1-25 26-50 51-75 76-99 100

Nu

mb

er o

f Tw

eets

Score

Figure. Manually classified tweets (bucketed by ‘likelihood of flu’ score).

A Model for Automatic Flu Classification

10 | Monitoring social media for communicable disease surveillance: An Australian Study | Sankalp Khanna

.

• Tri-gram model (N-gram language model family) used • 5 models representing different levels of tolerance, i.e. cut-off score [ t = {0,25,50,75,100} ] • 243,121 features (or word combinations) were used to classify new tweets. • Feature frequency thresholds of 0, 3, 5, 7 and 10 were tested. • 10-fold cross validation used for evaluating model performance • The effect of removing punctuation and capitalisation was evaluated

Word pattern Frequency Word pattern Frequency

this flu 1.0 sick in bed. 0.83

like shit. 1.0 sick today, 0.83

flu is 0.9 & flu 0.83

have the flu 0.89 work tomorrow 0.83

sick to go 0.88 the weekend 0.83

cough 0.86 #manflu 0.83

lemon 0.86 nose and 0.83

getting sick :( 0.86 cold & flu 0.83

better soon. 0.86 with the flu 0.82

Table: List of features with high probability in the n-gram model with t=50

Results – The Model

11 | Monitoring social media for communicable disease surveillance: An Australian Study | Sankalp Khanna

Word pattern Frequency

this flu 1.0

have the flu 0.89

#manflu 0.83

Table 3. List of features with high probability in the n-gram model with t=75.

Results – The Model

12 | Monitoring social media for communicable disease surveillance: An Australian Study | Sankalp Khanna

Results – The Model

13 | Monitoring social media for communicable disease surveillance: An Australian Study | Sankalp Khanna

50

60

70

80

90

100

0 25 50 75 100

Co

rrec

tly

Iden

tifi

ed (

%)

Threshold t

50

60

70

80

90

100

0 3 5 7 10

Co

rrec

tly

Iden

tifi

ed (

%)

Feature Frequency Threshold

50

60

70

80

90

100

0 25 50 75 100

Co

rrec

tly

Iden

tifi

ed (

%)

Threshold t

(a) different certainty scores (b) different feature frequency (c) different certainty scores cutoffs (t=100) punctuation and capitalisation removed

• Improve automatic flu classification model

• Investigate correlation with VIDRL confirmed influenza cases.

• Incorporate other data sources (social media, weather data, etc)

The Future

14 | Monitoring social media for communicable disease surveillance: An Australian Study | Sankalp Khanna

CSIRO Australian e-Health Research Centre Sankalp Khanna Postdoctoral Research Fellow

t +61 7 3253 3629 e [email protected] w www.aehrc.com

THE AUSTRALIAN E-HEALTH RESEARCH CENTRE

Productivity isn’t everything, but in the long run it is almost everything.

Paul Krugman, 1991 Professor Princeton University, Nobel Prize in Economics 2008