illness surveillance on twittercobweb.cs.uga.edu/~squinn/mmd_s15/presentations/influenza.pdf ·...

Influenza-Like Illness Surveillance on Twitter

through Automated Learning of Naïve Language

Joey Ruberti Presented by

Francesco Gesualdo, Giovanni Stilo, Eleonora Agricola, Michaela V. Gonfiantini, Elisabetta Pandolfi , Paola Velardi, and Alberto E. Tozzi

Overview

1. Problem

2. Previous Approaches

3. Proposed Approach

4. System Details

5. System Evaluation Strategies

6. Results

7. Reliability

8. Effectiveness

9. Limitations

10. Conclusions

Problem: Twitter Mining Potential

• The general public shares personal information on social networks and microblogs like Twitter

• How can this data be utilized?

• This information is a potential source of real-time data directly from individuals that can be used for disease surveillance and public health

• Tweets often accompanied by location indicators

• Syndromic surveillance systems

• What is the best way to aggregate and analyze this data?

288 million monthly

active users on Twitter

500 million Tweets sent

per day

Previous Approaches: Measuring Specific Keywords

• Measure the occurrence of specific disease-related search

keywords vs disease trends

• Flu Trends - A Google service utilized this technique to estimate and predict influenza activity by aggregating search query volumes

• Suffers from a high level of noise because search peaks are often completely unrelated to the incidence of a disease

Previous Approaches: Measuring Specific Keywords

• These approaches usually look for the name of the clinical

condition or its synonyms (eg: H1N1 or Swine Flu)

• Sometimes these keywords are arbitrarily chosen by the authors but are related to the clinical syndrome (eg: Flu or vaccine)

Problems with this type of approach:

1. In blogs/forums people are motivated by a communication need, rather than information need so naïve language is often used over technical language

2. Most users will describe a combination of symptoms rather than a diagnosis. Looking at disease-related keywords can miss a large volume of messages that include signs/symptoms

New Approach: Goals

• Analyze Twitter messages as a source of data for syndromic surveillance but take into account the use of non-medical language by Twitter users

• Use a combination of symptoms rather than a suspected or final diagnosis keyword like previous approaches

• Use Twitter’s geolocation data to narrow down results to locations in the United States

New Approach: Design Overview

1. Develop a minimally supervised algorithm that learns technical term-naïve term pairs based on pattern generalization and complete-linkage clustering

2. Apply the algorithm to a group of technical terms extracted from the European Centre for Disease Prevention and Control (ECDC) case definition for influenza-like illness (ILI)

3. Construct a Boolean query based on the ECDC case definition for ILI, using both technical and related jargon terms identified by the algorithm from step 1

4. Collect 2 sets of Twitter messages matching the query

5. Compare the trends of these messages with traditional surveillance data for influenza in the US

Similarity of 2 clusters is

the similarity of their most dissimilar members

Results in clusters with minimum similarity

http://nlp.stanford.edu/IR-book/html/htmledition/single-link-and-complete-link-clustering-1.html

Algorithm Development: extraction of naïve-medical jargon

• In order to overcome the clinical term bias of the previous approaches, an algorithm was developed that automatically maps all naïve terms related to a specific medial term using www.freebase.com/view/medicine/disease

• The algorithm starts with an initial small learning set of medical conditions, composed by term pairs (1 technical term and 1 naïve term, eg. emesis-vomiting) to extract basic patterns from the web, and then generalize, cluster, and weight these patterns based on another small set of pairs

• Generalized patterns are learned for sentence fragments of naïve terms and for multi-word expressions describing medical conditions (eg. “inflammation of the nose” -> “inflammation of BODYPART”)

Query Development: Aggregation of Symptoms

• A Boolean query was developed to look for Tweets based on an aggregation of symptoms using the following ECDC case definition for an influenza-like illness:

Sudden onset of symptoms AND at least one of the following 4 systemic symptoms fever or feverishness, malaise, headache, myalgia AND at least one of the following 3 respiratory symptoms cough, sore throat, shortness of breath

Applying the Algorithm

The algorithm was applied to a set of 8 symptom-related medical conditions expressed as technical terms derived from the case definition

Set of naïve terms obtained by the algorithm:

Generating the Boolean Query

Using the naïve terms discovered by the algorithm and the original technical terms, the influenza-like illness case definition was transformed into a Boolean query

( (fever) OR (feverishness) OR (malaise) OR (headache) OR (myalgia) ) AND ( (cough) OR (pharyngitis) OR (dyspnea) )

Extracting Twitter Data: The Datasets

Twitter data was analyzed on two different datasets

Dataset 1

From November 11, 2012 to April 27, 2013, the first dataset was derived from a 1% sample of the worldwide Twitter traffic using the Twitter API

Dataset 2

From January 27, 2013 to May 2013, the second dataset was derived from all the Tweets including at least one of the singleton terms composing the influenza-like illness query and 3 additional queries based on other case definitions (Cold, Gastroenteritis, Allergy)

17 technical keywords and 65 jargon keywords

Geolocalization: How to identify Tweets from the US?

• 3 different geo-localization strategies were used to identify tweet trends localized in the US

1. US-GEO - tweets providing US GPS coordinates

2. US-WIDE - tweets responding to 1 of the following:

• US GPS coordinates • Explicit US place code • US related time zone • US place indicated in user’s profile

3. US-NARROW - same as US-WIDE excluding all tweets reporting a US time zone but a non-US place code

• This approach allows for a larger number of tweets to be identified rather than just using GPS coordinates alone

Query Evaluation: Will the query work?

• 100 tweets were extracted from the second dataset that matched the query on influenza like illness

• A random sample of 500 tweets not matching the query, but including at least one symptom were also extracted

• These Tweets were independently examined by the authors to test the consistency of extracted tweets for the case definition

• The Tweet examination yielded a 3% false positive rate with a precision of 0.97

Source of Influenza-like illness data

US Influenza-like illness trend data

• Obtained from reports by the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet)

• Their weekly reports were sent to the CDC and contain the number of patient visits for influenza-like illness by age group

• The CDC defines an influenza-like illness as fever (temperature

of 100°F or greater) and a cough and/or a sore throat without a known cause other than influenza

Control series

• Some models built on Twitter series can fit the data even using keywords not related to ILI

• In order to measure the correlation of unrelated data, a series of tweets containing ILI non-related keywords were used

• The ILI non-related keywords were:

• "zombie" OR "zed" OR "undead" OR "living dead“

• This data was used to compare non-ILI trend with the ILINet data

Statistical analysis

Results of Tweet trends are reported as number of ILI positive tweets (or number of ILI negative tweets for the control series) in the unit of time (week) Results of Tweet trends, ILINet data, and Google Flu trends data are expressed as z-scores Pearson correlation coefficients were used to compare US surveillance data with Twitter traffic consistent with the ILI case definition and with Twitter traffic not consistent with the ILI case definition (non-ILI tweets)

Statistical analysis

Twitter traffic expressed as:

• Total available Tweet traffic • US-GEO Tweets • US-WIDE Tweets • US-NARROW Tweets

Total available traffic series for the 1% sample dataset and US-NARROW series for the second dataset were smoothed by Loess function

Tweets from the 1% sample and US-NARROW tweets consistent with the ILI case definition were also compared with Google Trends data and with trends generated by tweets reporting the words “flu” OR “influenza”

Results: Dataset 1

447,597,718 Tweets were extracted between November 11, 2012 to April 27, 2013 (1% sample of the total worldwide Twitter traffic)

From the extracted tweets, 5,508 satisfied the conditions set by the query for influenza-like illness The sample of ILI tweets responding to the geo-localization criteria was too small, so the total ILI tweet series was used Twitter and traditional surveillance trends for US were compared, and the correlation coefficient was high (0.981, p<0.001).

Comparison between weekly ILI tweets, ILINet data, Google Flutrends and tweets containing the words “flu” or “influenza”

Z-scores of CDC’s reported ILI *from November 2012 to May 2013

Z-scores of tweets satisfying the ILI query

Z-scores of tweets including the words “flu” or “influenza”

Z-scores of Google Flu Trends data

Tweets satisfying the ILI query do not overestimate the actual flu peak Google Flu Trends series and the series of tweets containing “flu” or “influenza” do

Results: Dataset 1 (non-related keywords)

The ILINet data with the control series of tweets containing ILI non-related keywords was also compared and the correlation coefficient was very low (0.292, p=0.159) The ILI non-related keywords were:

"zombie" OR "zed" OR "undead" OR "living dead“

Results: Dataset 2

232,452,510 tweets were extracted from January 27, 2013 to May 5, 2013 containing at least one of the terms included in the ILI case definition and in the 3 additional Influenzanet case definitions (Cold, Allergy, Gastroenteritis)

3,252,013 (1.3%) Tweets responded to the US-GEO criteria *Tweets with GPS Coordinates

85,381,987 (36%) responded to the US-WIDE criteria *Tweets with GPS Coordinates, US place code, US related time zone, or US place indicated in profile

11,040,587 (4.7%) responded to the US-NARROW criteria *same as US-WIDE excluding all Tweets reporting a US time zone but a non-US place code

262,853 tweets (0.11%) satisfied the conditions set by the query for ILI

Weekly reported ILI (CDC) and Tweets satisfying ILI query

Z-scores of CDC’s reported ILI *from January 2013 to May 2013

Z-scores of tweets satisfying the ILI query

A. All tweets (regardless of location) B. US GEO (GPS localized tweets)

C. US-Wide Tweets D. US-Narrow Tweets

(r=0.769, p=0.001)

(r=0.974, p=0.001) (r=0.980, p=0.001)

(r=0.977, p=0.001)

highest correlation coefficient

Results: Dataset 2

When smoothed by Loess function, the comparison of ILINet data with US-NARROW yielded the highest correlation coefficient (r=0.997, p<0.001)

Comparing ILINet data with Tweets containing the word “flu” or the word “influenza”

Z-scores of CDC’s reported ILI *from January 2013 to May 2013

Z-scores of tweets including the words “flu” or “influenza” *geolocalized with the extended narrow localization pattern

low correlation coefficient compared to the tweet trend consistent with the ECDC case definition (r=0.944, p<0.001)

Reliability

• The results show a very high correlation between tweet trends and traditional US surveillance data (higher than Google Flu Trends for the same time period)

• This approach did not overestimate the actual flu peak in the 2012-2013 flu season like Google Flu Trends and the series of tweets containing “flu” or “influenza”

• The system has a very low rate of false positives (3%) yielded by the manual examination of the sample tweets

How has this approach proved useful?

Demonstrated the importance of

• Accounting for naïve language when performing syndromic surveillance

• Improves the detection of health-related concepts to produce a large body of evidence

• Eg. Pharyngitis cumulated 26 tweets while the corresponding naïve terms occurred 234,951 times

• Using a combination of symptoms to analyze words as they appear in specific contexts instead of relying on a final diagnosis keywords for query development

• Allows for a variety of natural language analyses and sense disambiguation techniques to be performed that could potentially reduce noise and more accurately detect disease indicators

How has this approach proved useful?

• The system can be applied to different country settings and languages

• By introducing other disease ontologies, the system can be applied to other kinds of syndromic surveillance (emerging diseases/allergies)

• Allows for the discovery of associations between symptoms and specific exposures

• System cost is low and the data can be acquired quickly compared to traditional surveillance systems

What were the limitations to this approach?

• Twitter surveillance, like search-related surveillance used in Google Flu Trends, may be influenced by news and media reports

• The second dataset only obtained Tweets from the second phase of the influenza season

• Twitter users are not representative of the entire US population

• This might show a trend towards a restricted population group

• Restricting the analysis to geo-localized tweets may introduce a selection bias

• Eg. users that allow GPS coordinates or include localization information in their profile may differ from other Twitter users

• System only tested on 1 influenza season

Conclusions

• Twitter mining techniques focused on disease surveillance can be improved by mining Tweets with Boolean queries derived from disease case definitions and by including naïve terms in the queries

• This technique proved less sensitive to media reports compared to other approaches like Google’s Flu Trends

• Using Twitter’s geolocation data allows for more precise information to be extracted for syndromic surveillance and disease mapping

References

http://www.plosone.org/article/fetchObject.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0082489&representation=PDF

illness surveillance on twittercobweb.cs.uga.edu/~squinn/mmd_s15/presentations/influenza.pdf ·...

Documents