07045443_002

8
1216 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 4, JULY 2015 Predicting Asthma-Related Emergency Department Visits Using Big Data Sudha Ram, Member, IEEE, Wenli Zhang, Max Williams, and Yolande Pengetnze Abstract—Asthma is one of the most prevalent and costly chronic conditions in the United States, which cannot be cured. However, accurate and timely surveillance data could allow for timely and targeted interventions at the community or individual level. Cur- rent national asthma disease surveillance systems can have data availability lags of up to two weeks. Rapid progress has been made in gathering nontraditional, digital information to perform disease surveillance. We introduce a novel method of using multiple data sources for predicting the number of asthma-related emergency department (ED) visits in a specific area. Twitter data, Google search interests, and environmental sensor data were collected for this purpose. Our preliminary findings show that our model can predict the number of asthma ED visits based on near-real-time environmental and social media data with approximately 70% pre- cision. The results can be helpful for public health surveillance, ED preparedness, and targeted patient interventions. Index Terms—Asthma, big data, emergency department (ED) visits, environmental sensors, predictive modeling, social media. I. INTRODUCTION A STHMA is one of the most prevalent and costly chronic conditions in the United States, with 25 million people affected [1]. Asthma accounts for about two million emergency department (ED) visits, half a million hospitalizations, and 3 500 deaths [1], and incurs more than 50 billion dollars in direct med- ical costs annually [2]. Moreover, asthma is a leading cause of loss productivity with nearly 11 million missed school days and more than 14 million missed work days every year due to asthma [1]. Although asthma cannot be cured, many of its ad- verse events can be prevented by appropriate medication use and avoidance of environmental triggers [3]. The prediction of population- and individual-level risk for asthma adverse events using accurate and timely surveillance data could guide timely and targeted interventions, to reduce the societal burden of asthma. Manuscript received September 29, 2014; revised December 4, 2014; ac- cepted February 6, 2015. Date of publication February 19, 2015; date of current version July 23, 2015. This work was supported in parts by the National Center for Advancing Translational Sciences under Grant 1UL1TR001105-01, Na- tional Institutes of Health, and by Grant 10-D-028227 from the W.W. Caruth, Jr. Foundation at Communities Foundation of Texas, to PCCI. S. Ram and W. Zhang are with the Department of Management Infor- mation Systems, University of Arizona, Tucson, AZ 85721 USA (e-mail: [email protected]; [email protected]). M. Williams is with the Parkland Center for Clinical Innovation, Dallas, TX 75247 USA (e-mail: [email protected]). Y. Pengetnze is with the Parkland Center for Clinical Innovation, Dallas, TX 75247 USA and also with the Children’s Medical Center of Dallas, Dallas, TX 75235 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JBHI.2015.2404829 At the population level, current national asthma disease surveillance programs rely on weekly reports to the Centers for Disease Control and Prevention (CDC) of data collected from various local resources by state health departments [4]. Notori- ously, such data have a lag time of weeks, therefore, providing retrospective information that is not amenable to proactive and timely preventive interventions. At the individual level, known predictors of asthma ED visits and hospitalizations include past acute care utilization, medication use, and sociodemographic characteristics [5]–[7]. Common data sources for these vari- ables include electronic medical records (EMR), medical insur- ance claims data, and population surveys, all of which, also, are subject to significant time lag. In an ongoing quality im- provement project for asthma care, Parkland Center for Clinical Innovation (PCCI), Dallas, TX, USA, researchers have built an asthma predictive model relying on a combination of EMR and claim data to predict the risk for asthma-related ED vis- its within three months of data collection [Unpublished reports from PCCI]. Although the model performance (C-statistic 72%) and prediction timeframe (three months) are satisfying, a nar- rower prediction timeframe potentially could provide additional risk-stratification for more efficiency and timeliness in resource deployment. For instance, resources might be prioritized to first serve patients at high risk for an asthma ED visit within two weeks of data collection, while being safely deferred for pa- tients with a later predicted high-risk period. Novel sources of timely data on population- and individual- level asthma activities are needed to provide additional temporal and geographical granularity to asthma risk stratification. Short of collecting information directly from individual patients (a time- and resource-intensive endeavor), readily available pub- lic data will have to be repurposed intelligently to provide the required information. There has been increasing interest in gathering nontraditional, digital information to perform disease surveillance. These in- clude diverse datasets such as those stemming from social me- dia, internet search, and environmental data. Twitter is an online social media platform that enables users to post and read 140- character messages called “tweets.” It is a popular data source for disease surveillance using social media since it can provide nearly instant access to real-time social opinions. More impor- tantly, tweets are often tagged by geographic location and time stamps potentially providing information for disease surveil- lance [8], [9]. Another notable nontraditional disease surveil- lance system has been a data-aggregating tool called Google Flu Trends, which uses aggregated search data to estimate flu activity [10], [11]. Google Trends was quite successful in its es- timation of influenza-like illness. It is based on Google’s search 2168-2194 © 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Upload: maneesha-m-s

Post on 17-Aug-2015

213 views

Category:

Documents


0 download

DESCRIPTION

electronics and communication

TRANSCRIPT

1216 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 4, JULY 2015Predicting Asthma-Related Emergency DepartmentVisits Using Big DataSudha Ram, Member, IEEE, Wenli Zhang, Max Williams, and Yolande PengetnzeAbstractAsthma is one of the most prevalent andcostly chronicconditions in the United States, which cannot be cured. However,accurate and timely surveillance data could allow for timely andtargeted interventions at the community or individual level. Cur-rentnationalasthmadiseasesurveillancesystemscanhavedataavailability lags of up to two weeks. Rapid progress has been madein gathering nontraditional, digital information to perform diseasesurveillance. We introduce a novel method of using multiple datasourcesforpredictingthenumberofasthma-relatedemergencydepartment (ED) visitsinaspecicarea. Twitterdata, Googlesearch interests, and environmental sensor data were collected forthis purpose. Our preliminary ndings show that our model canpredict the number of asthma ED visits based on near-real-timeenvironmental and social media data with approximately 70%pre-cision. The results can be helpful for public health surveillance, EDpreparedness, and targeted patient interventions.IndexTermsAsthma, bigdata, emergencydepartment(ED)visits, environmental sensors, predictive modeling, social media.I. INTRODUCTIONASTHMA is one of the most prevalent and costly chronicconditionsintheUnitedStates, with25millionpeopleaffected [1]. Asthma accounts for about two million emergencydepartment (ED) visits, half a million hospitalizations, and 3 500deaths [1], and incurs more than 50 billion dollars in direct med-ical costs annually [2]. Moreover, asthma is a leading cause oflossproductivitywithnearly11millionmissedschool daysand more than 14 million missed work days every year due toasthma [1]. Although asthma cannot be cured, many of its ad-verseeventscanbepreventedbyappropriatemedicationuseand avoidance of environmental triggers [3]. The prediction ofpopulation- and individual-level risk for asthma adverse eventsusing accurate and timely surveillance data could guide timelyandtargetedinterventions, toreducethesocietal burdenofasthma.Manuscript receivedSeptember29, 2014; revisedDecember4, 2014; ac-cepted February 6, 2015. Date of publication February 19, 2015; date of currentversion July 23, 2015. This work was supported in parts by the National Centerfor AdvancingTranslational Sciencesunder Grant 1UL1TR001105-01, Na-tional Institutes of Health, and by Grant 10-D-028227 from the W.W. Caruth,Jr. Foundation at Communities Foundation of Texas, to PCCI.S. RamandW. Zhangare withthe Department of Management Infor-mationSystems, Universityof Arizona, Tucson, AZ85721USA(e-mail:[email protected]; [email protected]).M. Williams is with the Parkland Center for Clinical Innovation, Dallas, TX75247 USA (e-mail: [email protected]).Y. Pengetnze is with the Parkland Center for Clinical Innovation, Dallas, TX75247 USA and also with the Childrens Medical Center of Dallas, Dallas, TX75235 USA (e-mail: [email protected]).Color versions of one or more of the gures in this paper are available onlineat http://ieeexplore.ieee.org.Digital Object Identier 10.1109/JBHI.2015.2404829At the population level, current national asthma diseasesurveillance programs rely on weekly reports to the Centers forDisease Control and Prevention (CDC) of data collected fromvarious local resources by state health departments [4]. Notori-ously, such data have a lag time of weeks, therefore, providingretrospective information that is not amenable to proactive andtimely preventive interventions. At the individual level, knownpredictors of asthma ED visits and hospitalizations include pastacutecareutilization, medicationuse, andsociodemographiccharacteristics[5][7]. Commondatasourcesforthesevari-ables include electronic medical records (EMR), medical insur-anceclaimsdata, andpopulationsurveys, allofwhich, also,aresubjecttosignicanttimelag. Inanongoingqualityim-provement project for asthma care, Parkland Center for ClinicalInnovation(PCCI), Dallas, TX, USA, researchershavebuiltan asthma predictive model relying on a combination of EMRandclaimdatatopredicttheriskforasthma-relatedEDvis-its within three months of data collection [Unpublished reportsfromPCCI]. Although the model performance (C-statistic 72%)and prediction timeframe (three months) are satisfying, a nar-rower prediction timeframe potentially could provide additionalrisk-stratication for more efciency and timeliness in resourcedeployment. For instance, resources might be prioritized to rstservepatientsathighriskforanasthmaEDvisitwithintwoweeksofdatacollection, whilebeingsafelydeferredforpa-tients with a later predicted high-risk period.Novel sources of timely data on population- and individual-level asthma activities are needed to provide additional temporaland geographical granularity to asthma risk stratication. Shortofcollectinginformationdirectlyfromindividual patients(atime- and resource-intensive endeavor), readily available pub-lic data will have to be repurposed intelligently to provide therequired information.There has been increasing interest in gathering nontraditional,digitalinformationtoperformdiseasesurveillance.Thesein-clude diverse datasets such as those stemming from social me-dia, internet search, and environmental data. Twitter is an onlinesocial media platform that enables users to post and read 140-character messages called tweets. It is a popular data sourcefor disease surveillance using social media since it can providenearly instant access to real-time social opinions. More impor-tantly, tweets are often tagged by geographic location and timestampspotentiallyprovidinginformationfordiseasesurveil-lance[8],[9].Anothernotablenontraditionaldiseasesurveil-lancesystemhasbeenadata-aggregatingtoolcalledGoogleFlu Trends, which uses aggregated search data to estimate uactivity [10], [11]. Google Trends was quite successful in its es-timation of inuenza-like illness. It is based on Googles search2168-2194 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistributionrequires IEEE permission. See http://www.ieee.org/publicationsstandards/publications/rights/index.html for more information.RAM et al.: PREDICTING ASTHMA-RELATED EMERGENCY DEPARTMENT VISITS USING BIG DATA 1217engine, which tracks how often a particular search-term is en-tered relative to the total search-volume across a particular area.This enables access to the latest data from web search interesttrendsonavarietyoftopics, includingdiseaseslikeasthma.Air pollutants are known triggers for asthma symptoms and ex-acerbations [12]. The United States Environmental ProtectionAgency (EPA) provides access to monitored air quality data col-lected at outdoor sensors across the country, which could be usedas a data source for asthma prediction. Meanwhile, as health re-formprogresses, the quantity and variety of health records beingmade available electronically are increasing dramatically [13].In contrast to the traditional disease surveillance systems, thesenew data sources have the potential to enable health organiza-tions to respond to chronic conditions, like asthma, in real time.This in turn implies that health organizations can appropriatelyplan for stafng and equipment availability in a exible man-ner. They can also provide early warning signals to the peopleat risk for asthma adverse events, and enable timely, proactive,and targeted preventive and therapeutic interventions.Ourresearchobjectiveistoleveragesocialmedia,internetsearch, and environmental air quality data to estimate ED visitsfor asthma in a relatively discrete geographic area (a metropoli-tan area) within a relatively short time period (days). To this end,we have gathered asthma-related ED visits data, social mediadata from Twitter, internet users search interests from Googleand pollution sensor data from the EPA, all from the same ge-ographic area and time period, to create a model for predictingasthma-relatedEDvisits. Thisstudyisdifferent fromextantstudies that typically predict the spread of contagious diseasesusingsocialmediasuchasTwitter.Unlikeinuenzaorotherviral diseases, asthma is a noncommunicable health conditionandwedemonstratetheutilityandvalueoflinkingbigdatafrom diverse sources in developing predictive models for non-communicable diseases with a specic focus on asthma.II. BACKGROUNDAnumber of research studies have explored the use ofnoveldatasourcestoproposerapid,cost-effectivehealthsta-tus surveillance methodologies. Some of the early studies relyon document classication suggesting that Twitter data can behighly relevant for early detection of public health threats [14].Othersemploymorecomplexlinguisticanalysis, suchastheAilment Topic Aspect Model [15], which is useful for syndromesurveillance. This type of analysis is useful for demonstratingthe signicance of social media as a promising new data sourcefor health surveillance.Other recent studies have linked social media data with real-world disease incidence to generate actionable knowledge usefulformakinghealthcaredecisions. Theseinclude[16], whichanalyzed Twitter messages related to inuenza and correlatedthem with reported CDC statistics. Similarly, a study by Chew[17] during the 2009 H1N1 u pandemic, validated Twitter as areal-time content, sentiment, and public attention trend-trackingtool. Collier[18]employedsupervisedclassiers(SVMandNaive Bayes) to classify tweets into four self-reported protectivebehavior categories. This study adds to evidence supporting ahigh degree of correlation between prediagnostic social mediasignals and diagnostic inuenza case data.While, these disease surveillance systems, including GoogleFlu trends [10], based on novel data sources have shown signi-cant promise, other studies have challenged the accuracy of thesesystems for two reasons [19], [25]. 1) Anomalous media spikes:People searching u terms may have had symptomsbut manyusersmighthavebeensimplylookingfornewsstoriesaboutananomalousseason. Mediaattentionmight increasetweetsaboutspecicdiseasesbutmaynotnecessarilyreectactualdisease afiction. 2) Misleading information: Tweets indicatingawareness of disease, e.g., Hope I dont get asthma, or usingdisease as rhetoric, e.g., He is so cute I think I got asthma, areclearly about a specic disease but are not about actual diseaseafiction. Thesekindsofsignalscanbemisleadingandcanmask signs of actual disease afiction. These issues and chal-lenges are being addressed by several studies including [19] and[20]. For instance, Google Flu Trends engineers announced aredesign of their algorithm by dampening media attention andusingElasticNet, ratherthanregression, forestimation[19].Commonly used Twitter techniques such as keyword matchingor linear regression can widely overestimate the prevalence ofdisease. Broniatowski and Paul [8] have specically addressedthisissueandbuilt modelstodetermineiftweetswererele-vant tohealth, inuenza, or anactual infection. Inaddition,theyusedgeographicinformationassociatedwithtweetsforinuenza surveillance.Building on these techniques, our work uses a combinationof data from multiple sources to predict the number of asthma-relatedEDvisitsinnear real time. Indoingso, weexploitgeographicinformationassociatedwitheachdataset. Wede-scribe the techniques to process multiple types of datasets, toextract signals from each, integrate, and feed into a predictionmodel using machine learning algorithms, and demonstrate thefeasibility of such a prediction.III. METHODSWe have examinedthe feasibilityof usingmultiple datasources for predicting the number of asthma-related EDvisits. Apreliminary prediction model was built for this purpose. Semisu-pervisedclassicationmodelswereappliedondatastreamsstemming from Twitter to distinguish tweets indicating asthmaafiction fromtweets just including keywords related to asthma.We also processed air quality data obtained from sensors, his-torical electronic health records indicating asthma-related visitsto an emergency room, and Google search trends, all from thesame specic geographic area, during the same time period.A. Data Collection and Processing1) EDData: DeidentiedaggregatedataonEDvisitsforasthma as a primary diagnosis (International Classication ofDisease Ninth[ICD9] code 493.00to493.99) tothe Chil-drens Medical Center (CMC) of Dallas were collected betweenOctober 2013 and December 2013 for a Quality Improvementinitiative (see Fig. 1). Additional data were collected betweenNovemberandDecember2013onEDvisitsforconstipation1218 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 4, JULY 2015Fig. 1. Hospital administrative data on EDvisits for asthma, Dallas, TX, USA,October 1, 2013 December 24, 2013.Fig. 2. Examples of keywords andfrequencyof tweets (Asthmastream,October 11, 2013 December 31, 2013).(ICD9564.00to564.09)orabdominalpain(ICD9789.00to789.09)toserveascontrolsforbackgroundvariationsinthenumber ED visits unrelated to asthma activity. The two controlconditions were chosen because they affect a different organ sys-tem than asthma and are unlikely to be related to asthma. Thestudy was approved by the CMC Institutional Review Board.Itshouldbenotedthattheaverageandstandarddeviationof the ED visits data from December (ave = 81, std = 40) aresignicantly different fromthe rst two months data (ave =94,std = 25). So we excluded Decembers ED visits data from oneof our correlation analyses.2) Twitter Data: Twitter offers streaming APIs to give de-velopers and researchers low latency access to its global streamof data. Public streams, which can provide access to the publicdata owing through Twitter, were used in this study. Studieshave estimated that using Twitters Streaming API, researcherscan expect to receive 1% of the tweets in near real time. Twit-ter4j, an unofcial Java library for the Twitter API, was used toaccess tweet information from the Twitter Streaming API. TwodifferentTwitterdatasetswerecollectedinthisstudy.1)Thegeneraltwitterstream: A simple collection of JSON grabbedfrom the general Twitter stream. The general tweet counts wereused to estimate the Twitter population and further normalizeasthma tweet counts. 2) The asthma-related stream: to collectonlytweetscontaininganyof19relatedkeywordsthatweresuggested by our clinical collaborators from PCCI. The asthmastream is limited to 1% of full tweets as well.Fig. 2showsthenumber of tweetsinour asthmastreamfor some of the keywords used in data collection. Our Twitterdataset for thisstudywascollectedfromOctober 11, 2013,through December 31, 2013, and contains 464 845 785 generalFig. 3. (a) Global Asthma related tweets (Asthma stream, October 11, 2013December 31, 2013). (b) Asthma related tweets, United States, (Asthma stream,October 11, 2013December 31, 2013).tweets and 1 315 390 asthma-related tweets. On average, 15 000asthma-related tweets were generated fromall over the world perday. This demonstrates that Twitter is a promising data sourcefor asthma surveillance and should be explored further.The geographic location of each tweet is identied via twoelds: coordinates and location. Coordinates indicate the longi-tude and latitude of the tweets location, e.g., {coordinates:[97.51087576, 35.46500176]}. Unfortunately, onlyasmallpercentageoftweetsexposetheircoordinates.Amongalltheasthma-related tweets we collected, only 2% (35152/1315390)of the tweets revealed their coordinates as shown in Fig. 3(a).Most of these tweets are from English-speaking countries andapproximately60%ofthemarefromtheUnitedStates[seeFig. 3(b)]. We further analyzed these tweets based on a subsetof our keywords [see Fig. 4(a) and (b)].Locations indicate the cities and states where the users, whopostedtweets, reside, e.g., {location: SanFrancisco, CA,USA}. This informationwas collectedfromTwitter userspublic biographic proles. To estimate the prevalence of asthma-related tweets in a geographic region, we only included tweetsfromthat particular region. We conned our analysis to the Twit-ter streaming data collected from the Dallas-Fort Worth (DFW)Metropolitan area in Texas, to closely match the geographicalorigins of patients in our clinical data sample. The boundary ofthe region and geographical coordinates are shown in Fig. 5.As previouslymentioned, not manypeople divulge theirlocation in their tweets; consequently, in our dataset, onlyRAM et al.: PREDICTING ASTHMA-RELATED EMERGENCY DEPARTMENT VISITS USING BIG DATA 1219Fig. 4. (a) Global Asthma-related tweets based on keywords (Asthma stream,October11, 2013December31, 2013). (b)Asthma-relatedtweetsbasedonkeywords, UnitedStates(Asthmastream, October 11, 2013December 31,2013).892 asthma-related tweets were actually identied to fall withinthe geographic boundaries of interest to our study. Hence, weexaminedthe datasetin more detail,andcollectedprolesofusers tweeting about asthma. By examining these user proles,we were able to extract users locations from their prole infor-mation and identied additional tweets stemming fromour loca-tion of interest. We were, thus, able to identify 3 768 additionaltweets from the asthma stream in the area of interest, and a totalof 1 953 402 tweets from the general stream in the same area.One of the challenges we needed to address was to extractsignal from the noisy Twitter dataset, i.e., to distinguish tweetsthat are relevant to asthma from tweets that mentioned asthmain an irrelevant context. Fig. 6 shows the process used for clean-ing the Twitter dataset. First, non-English tweets and retweetswere excluded. The exclusion of non-English tweets is not ex-pected to have a major impact on the analysis as 95% of tweetsoriginating in the USA, including our geographical region of in-terest, are in English [21]. We transformed tweets to lower case,e.g., removing all the special characters, targets (@), hashtags(#), URLs, and emoticons. Each tweet was then tokenized bysplitting based on nonletter characters. The tokens were used togenerate a vector numerically representing each tweet. TF-IDF(termfrequencyinverse document frequency) [22] was used forFig. 5. Geo boundaries dened for coordinates of Twitter asthmastream. Northwest: Decatur, TX, USA, 33.228426, 97.597020; North-east: Greenville, TX, 33.131878, 96.105626; Southwest: Lake Granbury,32.446121, 97.767308; Southeast: Cedar Creek Reservoir, 32.353360,96.171544.Fig. 6. Extracting signal from noisy twitter dataset.this purpose. All the words were stemmed by applying Portersalgorithm [23] and English stop-words [24] were ltered out.Wethenemployedamachinelearningclassicationtech-nique called articial neural networks (ANN) to accurately iden-tify relevant tweets, using the process shown in Fig. 6. ANN is asupervised classication technique requiring a training dataset.First, we extracted a dataset containing 4 500 tweets from ourasthmastreamdescribedearlier, i.e., eachtweet inthetrain-ing dataset contained at least one asthma-related keyword. Thisdataset was divided into three parts and each tweet was manuallylabeled by three researchers as asthma relevant or asthma ir-relevant. The annotation criteria for asthma relevant tweetsincluded: a statement that the individual has had asthma; andsupportingcriteriaincluded: 1)severedifcultyinbreathingaspart ofadiscreteattack, (2)shortnessofbreath-triggeredby exercise, stress, smoke, or irritants, 3) night time coughingduration greater than 1 month, 4) family history or childhoodhistory of asthma, or 5) use of an inhaler.Theclassierwastrainedonthisdataset. Atenfoldcrossvalidation was executed to evaluate the performance. Theresults showed a high overall accuracy of 85.78%. The1220 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 4, JULY 2015Fig. 7. Google search trends for Asthma-related keywords.precision for asthma irrelevant class was 86.71% and the pre-cision for asthma relevant class was 66.67%. The recall fortrue asthma irrelevant class was as high as 98.15%, whereasthe recall for true asthma relevant class was 19.72%, indicat-ing that there is a lot of noise in the data. Despite the latter lowrecall, the large Twitter dataset provided a sufcient number of:true asthma-related tweets for the analyses.Thiscomplexcleaningprocessresultedinadataset fromwhich we were able to extract a sufcient number of asthma-related tweets in the geographic area of interest along with theirspecic timestamps.3) Google Data: Google Trends analyzes its search enginetrafc to determine the usage frequency of specic search-termsby individual users as compared to the total number of Googlesearchesperformedduringaspecictimeperiod.Tomakeiteasier to compare data on different keywords, results in GoogleTrends are normalized using their total search trafc. Using thekeywordsfromourTwitterasthmastreamcollectionprocess,we extracteddata from GoogleTrends.To align with the EDvisits data, Google search interests were accessed for the sametimeperiodandinthesamelocationastheTwitterdata(seeFig. 7).WeretrievedGooglesearchinterest databyaccessingtheGoogle Trends website (www.google.com/trends) on three spe-cic dates, December 10, 2013, February 21, 2014, and Septem-ber 20, 2014 with the same query. For reasons unknown to us,the results are different on each of the three days. We used eachof the three datasets for our analysis as described later.4) Sensor Data: Air pollution data were collected from theEPA databases (www.epa.gov). The dataset contains measuresof six types of pollutants, i.e., particulate matter, ground-levelozone, carbonmonoxide, sulfuroxides, nitrogenoxides, andlead. The air quality indexes (AQI) associated with these pol-lutants were used in our model. The higher the AQI value, thegreater the level of air pollution and the greater the health con-cern. Along with the AQI, we were able to get the AQS-SITE-ID(Air Quality System, site identication) from the EPA database.ASiteIDisassociatedwithaspecicphysicallocationandaddress. The site latitude and longitude also are provided. Us-ingthisinformation, wecollectedAQIdatafrom27sitesintheDFWarea.Thesitesclosesttothezipcodesoforiginofasthma patients were retained for analysis. Using this data, wecalculated daily average AQI for our prediction model.B. Prediction ModelWe rst analyzed the association between the asthma-relatedED visits and data from Twitter, Google trends, and Air Qualitysensors, using the Pearson correlation coefcient. We also ex-amined the association between asthma-related tweet counts andEDvisit counts for abdominal pain/constipation patients, to con-trol for nonasthma-specic variations in ED visit counts. Then,we designed and implemented a prediction model to estimatethe incidence of asthma ED visits at CMC using a combinationof independent variables from the aforementioned data sources.Since each dataset is from a different source and has differentlevels of granularity with respect to time and location, we rstperformed some transformations on each dataset to make themcompatible. An important transformation was to normalize eachdataset using a standard normalization technique, i.e., z-scorez =x u(1)where u is the mean and is the standard deviation.Additionally, Twitter data were normalized by calculating theratio of asthma-related tweets to the total number of tweets inthe general twitter stream collected from the same geo-regionin a given time.For the prediction model, we employed four different classi-cation methods: Decision tree, Naive Bayes, SVM, and ANN,and compared their classication accuracy. We also used tech-niques called adaptive boosting and stacking, to reduce classi-cationerrors.TheEDvisitcountswereconvertedfromnu-merical to categorical values based on the z-values, where theobservations were labeled as High, Medium, or Low. Ourmodel was used to classify the predicted variable, i.e., number ofdaily ED visits, into one of three complementary and mutuallyexclusiveclassesHigh, Medium, orLow. TheNaiveBayestechnique requires nominal data, hence, another transformationwas used to convert all numerical data values into categoricalvalues based on the z values similar to the transformation usedfor ED visit counts.IV. ANALYSIS AND RESULTSA. Relationship Between ED Visits and Individual Types ofDataWe rst report on the relationship between asthma ED visitsand each individual type of dataset, i.e., Twitter, Google trends,and Air Quality sensors.Of note, Twitter data were only available beginning onOctober 11, 2013, and ED visits data were not available afterDecember24,2013.We,therefore,performedthecorrelationanalysisbasedon74days worthof data. Our resultsindi-catethat absoluteasthmatweetscount iscorrelatedwiththeasthma ED visit counts (r = 0.328, p < 0.01) (see Table I). Af-ter normalization of the number asthma tweets using the dailynumber of general tweets, the correlation coefcient improved(r = 0.378, p < 0.01) (see Table I).Given that the average and standard deviation of the asthmaEDvisitsdataforDecember2013(ave=81, std=40)aresignicantlydifferentfromthersttwomonthsdata(ave =94, std = 25), we did a sensitivity analysis excluding data fromDecember, which left us 50 observations. The 50 observationsshowed further improvement of correlation between the numberof asthma tweets and asthma ED visit counts (see Table II).RAM et al.: PREDICTING ASTHMA-RELATED EMERGENCY DEPARTMENT VISITS USING BIG DATA 1221TABLE ICORRELATION RESULTS BETWEEN TWITTER DATA (74 OBSERVATIONS) ANDASTHMA ED VISITS# of asthmaafliation tweets# of normalizedtweets# of ED visits Pearson Correlation 0.3280.378Sig. 0.004 0.001N 74 74 Correlation is signicant at the 0.01 level.TABLE IICORRELATION RESULTS BETWEEN TWITTER DATA (50 OBSERVATIONS) ANDASTHMA ED VISITS# of AsthmaTweets# of NormalizedAsthma Tweets# of ED Visits Pearson Correlation 0.4090.363Sig. 0.003 0.009N 50 50 Correlation is signicant at the 0.01 level.TABLE IIICORRELATION RESULTS BETWEEN TWITTER DATA AND ABDOMINALPAIN/CONSTIPATION ED VISITS# of AsthmaTweets# of NormalizedAsthma Tweets# of AbdominalPain/ConstipationPatients ED visitsPearsonCorrelation0.084 0.075Sig. 0.697 0.729N 24 24Note: We were only able to get the abdominal pain/constipation ED visits datafrom December 01December 24, 2013.TABLE IVCORRELATION RESULTS BETWEEN AIR POLLUTION DATA AND ASTHMA EDVISITSCO NO2PM2.5# of ED Visits Pearson Correlation 0.3320.3160.239Sig. 0.002 0.003 0.027N 85 85 85 Correlation is signicant at the 0.01 level. Correlation is signicant at the 0.05 level.Meanwhile, as a control, we also examined the relationshipbetween asthma tweets count and abdominal pain/constipationED visit counts. The absence of correlation speaks to the speci-cityof the associationbetweenasthma-relatedtweets andasthma ED visits (see Table III).We next report on the correlation between the pollutant in-dexes and asthma ED visits (pollutant data after December 24,2013wereremovedsincetherewasnoEDvisitsdataavail-able after that date). Three pollutant indexes, i.e., CO, NO2, andPM2.5 show signicant correlation with asthma ED visits (seeTable IV and Fig. 8).Asstatedearlier, wehadthreedifferent datasetscollectedfromGoogle trends. We examined the relationship between eachFig. 8. Air pollution data, Twitter data and asthma ED visits.TABLE VCORRELATION RESULTS BETWEEN GOOGLE DATA AND ASTHMA ED VISITSGoogle12/10/2013Google02/21/2014Google09/20/2014# of ED visits PearsonCorrelation0.2980.049 0.049Sig. 0.002 0.654 0.658N 61 85 85 Correlation is signicant at the 0.01 level.TABLE VIBACKWARD FEATURE SELECTION TO DEMONSTRATE USEFULNESS OFATTRIBUTESClassication Methods AttributesCO+NO2+PM2.5 Tweets ALLDecision Tree 65.18% 63.93% 65.18%ANN 61.25% 62.68% 66.25%ANN (Adaptive Boosting) 62.50% 62.14% 66.43%Stacking (ANN +Decision Tree) 61.07% 64.86% 66.25%Evaluation metric: Accuracy; Design: tenfold cross validation.dataset and asthma ED visits (Table V). The asthma ED visitswereonlycorrelatedwithoneoftheGoogledatasets,hence,Google data were not included in the nal prediction model.B. Prediction ResultsBasedontheresultsfromthecorrelationanalysis, asthmatweets, CO, NO2, and PM2.5 were selected as inputs into ourpredictionmodel.WeareonlyreportingresultsfortheDeci-sion Tree and ANN techniques, as the Naive Bayes and SVMtechniques did not yield good prediction results.First, backward feature selection algorithmwas used to exam-ine if the addition of Twitter data would improve the prediction.As shown in Table VI, combining air quality data with tweetsresulted in higher prediction accuracy.Additionally,weevaluatedpredictionprecision.Giventhatour prediction task is for a three-way classication, each tech-nique resulted in different prediction and/or precision in differ-ent classes (see Table VII). Decision Tree performed well in pre-dicting the High class, while ANN, after Adaptive Boosting,worked well for the Low class. Stacking the two techniquesperformed well for the Medium class.The results of our analysis are promising because theyperform with a fairly high level of accuracy overall. As notedinthe introduction, traditional asthma EDvisit models areuseful for predicting events in a three-month windowandhave an accuracy of approximately 70%. It is to be noted thattraditional models estimate a risk score for asthma ED visit1222 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 19, NO. 4, JULY 2015TABLE VIIPREDICTION RESULTSClassication Methods Class Class PrecisionDecision Tree High 72.73%ANN Low 71.43%ANN (Adaptive Boosting) Low 72.73%Stacking (ANN +Decision Tree) Medium 75.00%Evaluation metric: Precision; Design: tenfold cross validation.for each individual patient, whereas our Twitter/Environmentaldata model predicts the risk for a daily number of ED visitsbeing High, Low, or Medium. The former is an individual-levelrisk model, while the latter is a population-level risk model. Ourpopulation-level asthma risk prediction model has the potentialforcomplementingcurrent individual-level models, andmaylead to a shorter time window and better accuracy of prediction.This in turn has implications for better planning and proactivetreatment in specic geo-locations at specic time periods.V. DISCUSSION, IMPLICATIONS, AND LIMITATIONSAlthoughpreliminary, thendings of this studyareverypromisingformanyreasons.Asasthmaprevalencecontinuestorise, novel andcoordinatedstrategies arerequiredat thepublic health and clinical levels to curb the societal burden ofasthma adverse outcomes. Readily available, real-time or near-real-time, environmental and internet-based data offer a uniqueopportunity for early identication of clusters of patients or com-munities at high risk for severe asthma events at a given time.Interventions would be prioritized in time and place to reduce therisk for asthma ED visits. For instance, public health resourcescouldbe usedto reachoutto patientsfrom high-risk clustersor communities at any given time, and direct them toward lesscostly and more efcient care sites such as their primary careproviderofces. Moreover, predictedriskscouldbespatiallyandtemporallyvisualized, andmadeavailabletocommunitystakeholders through various media sources. Clinical resourcescould be prioritized to offer earlier clinic appointments to pa-tients with impending risk for failure, and later, slots to patientswithdeferredrisk.Additionally,hospitalsandEDscouldusesuchriskstraticationforoptimalresourceplanning,suchasED stafng or opening observation units.The limitations of our study include: First, this study is lim-ited to English tweets, which might not accurately represent theTwitter activity of non-English speakers, although, the latter arelikely represented in our ED sample. However, we do not ex-pect this to have a signicant impact on the nal results giventhat non-English tweets represent less than 5% of tweets in theUSA, as discussed in the Methods section. Second, this studywas limited to ED visits data from one hospital only which didnot allow us to examine variations around different clinical caresites. A larger study is underway to validate these preliminaryndings in a larger clinical sample spanning a wider geograph-ical area over a longer timeframe. Third, the current model isdesigned for a noncommunicable disease (asthma) with a sig-nicant prevalence in the community. Although similar modelscould be developed for other noncommunicable diseases suchas diabetes and chronic obstructive pulmonary disease (COPD),this model might not be suitable for communicable diseases orfor diseases with low prevalence and high social media activityreecting public awareness rather than actual disease activity.VI. CONCLUSION AND FUTURE RESEARCHIn this study, we have provided preliminary evidence that so-cial media and environmental data can be leveraged to accuratelypredict asthma ED visits at a population level.We are in the process of conrming these preliminary ndingsby collecting larger clinical datasets across different seasons andmultiple hospitals. Our continued work is focused on extend-ingthisresearchtoproposeatemporalpredictionmodelthatanalyzes the trends in tweets and AQI changes, and estimatesthe time lag between these changes and the number of asthmaED visits. We also are collecting AQI data over a longer timeperiod to examine the effects of seasonal variations. In addition,we would like to explore the effect of relevant data from othertypes of social media interactions, e.g., blogs and discussion fo-rums, on our asthma visit prediction model. Additional studiesare needed to examine how combining real-time or near-real-time social media and environmental data with more traditionaldatamight affect theperformanceandtimingofthecurrentindividual-level prediction models for asthma, and eventually,forotherchronicconditions. Infutureprojects, weintendtoextendourworktodiseaseswithgeographical andtemporalvariability, e.g., COPD and diabetes.ACKNOWLEDGMENTThe authors would like to thank Dr. P. Jamieson and Ms. A.DecrescendofromPCCIforhelpingwiththeannotationandcleaning of asthma tweets.REFERENCES[1] L. J. Akinbami, J. E. Moorman, and X. Liu, Asthma prevalence, healthcare use, and mortality: United States, 20052009, National Center forHealth Statistics, Hyattsville, MD, USA, National health statistics reportsno. 32, 2011.[2] Vital signs: Asthma prevalence, disease characteristics, and self-management education: UnitedStates, 20012009, Centers for Dis-ease Control and Prevention Atlanta, GA, USA, Morbidity and mortalityweekly report, vol. 60, no. 17, pp. 547, 2011.[3] Guidelinesfor theDiagnosisandManagement of Asthma,NationalInstitutes of Health, Bethesda, MD, USA, Expert Panel Report 3, vol. 2,1997.[4] Centers for Disease Control and Prevention. (2010). About the Morbid-ity and Mortality Weekly Report (MMWR) Series.[Online]. Available:http://www.cdc.gov/mmwr/about.html[5] G. R. Pesola, F. Xu, H. Ahsan, P. Sternfels, I. H. Meyer, and J. G. Ford,Predicting asthma morbidity in Harlem emergency department patients,Acad. Emergency Med., vol. 11, no. 9, pp. 944950, 2004.[6] M. Schatz, R. S. Zeiger, W. M. Vollmer, D. Mosen, G. Mendoza, A. J.Apter, T. B. Stibolt, A. Leong, M. S. Johnson, andE. F. Cook, Thecontroller-to-total asthma medicationratiois associatedwithpatient-centeredas well as utilizationoutcomes, Chest J., vol. 130, no. 1,pp. 4350, 2006.[7] C. Tolomeo, C. Savrin, M. Heinzer, andA. Bazzy-Asaad, Predictorsof asthma-related pediatric emergency department visits and hospitaliza-tions, J. Asthma, vol. 46, no. 8, pp. 829834, 2009.[8] D. A. Broniatowski, M. J. Paul, andM. Dredze, National andlocalinuenza surveillance through twitter: An analysis of the 20122013 in-uenza epidemic, PloS One, vol. 8, no. 12, p. e83672, 2013.RAM et al.: PREDICTING ASTHMA-RELATED EMERGENCY DEPARTMENT VISITS USING BIG DATA 1223[9] E.-K. Kim, J. H. Seok, J. S. Oh, H. W. Lee, andK. H. Kim, Useofhangeul twitter to track and predict human inuenza infection, PloS one,vol. 8, no. 7, p. e69305, 2013.[10] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski,and L. Brilliant, Detecting inuenza epidemics using search engine querydata, Nature, vol. 457, no. 7232, pp. 10121014, 2008.[11] S. Cook, C. Conrad, A. L. Fowlkes, andM. H. Mohebbi, AssessingGoogle u trends performance in the United States during the 2009 in-uenza virus A (H1N1) pandemic, PloS One, vol. 6, no. 8, p. e23610,2011.[12] L. Trasande and G. D. Thurston, The role of air pollution in asthma andother pediatric morbidities, J. Allergy Clin. Immunol., vol. 115, no. 4, pp.689699, 2005.[13] D. W. Bates, S. Saria, L. Ohno-Machado, A. Shah, and G. Escobar, Bigdata in health care: Using analytics to identify and manage high-risk andhigh-cost patients, Health Affairs, vol. 33, no. 7, pp. 11231131, 2014.[14] M. Krieck, J. Dreesman, L. Otrusina, and K. Denecke, Anewage of pub-lic health: Identifying disease outbreaks by analyzing tweets, presentedat the Health Web-Science Workshop, ACMWeb Science Conf., Koblenz,Germany, 2011.[15] M. J. Paul and M. Dredze, You are what you Tweet: Analyzing twitterfor public health, in Proc. Int. Conf. Weblogs Social Media , 2011, pp.265272.[16] C. Aron, TowardsdetectinginuenzaepidemicsbyanalyzingTwittermessages,inProc. ACM1stWorkshopSocialMediaAnal., 2010, pp.115122.[17] C. Chew and G. Eysenbach, Pandemics in the age of Twitter: Contentanalysis of Tweets during the 2009 H1N1 outbreak, PloS one, vol. 5, no.11, e14118, 2010.[18] C. Nigel, N. T. Son, and N. M. Nguyen, OMG U got u? Analysis ofshared health messages for bio-surveillance, J. Biomed. Semantics, vol.2, no. S-5, S9, 2011.[19] D. M. Lazer, R. Kennedy, G. King, and A. Vespignani. (2014). GoogleFlu trends still appears sick: An evaluation of the 20132014 u season.[online]. Available: http://ssrn.com/abstract = 2408560[20] M. Mark, S.-H. Zhu, W. Chapman, and M. Conway, Using Twitter to ex-amine smoking behavior and perceptions of emerging tobacco products,J. Med. Internet Res., vol. 15, no. 8, e174, 2013.[21] D.Mocanu,A.Baronchelli,N.Perra,B.Goncalves,Q.Zhang,andA.Vespignani, Thetwitter of Babel: Mappingworldlanguagesthroughmicroblogging platforms, PLoS ONE, vol. 8, no. 4, 2013.[22] G. SaltonandM. J. MacGill, IntroductiontoModernInformationRe-trieval. New York, NY, USA: McGraw-Hill, 1983.[23] M.Porter.(2006).ThePorterstemmingalgorithm.[online].Available:http://tartarus.org/martin/PorterStemmer/[24] C. Fox, A stop list for general text, ACM SIGIR Forum, vol. 24, no. 12,pp. 1921, 1989.[25] D. Scanfeld, V. Scanfeld, andE. L. Larson, Disseminationof healthinformationthroughsocialnetworks:Twitterandantibiotics,Amer. J.Infection Control, vol. 38, no. 3, pp. 182188, 2010.Sudha Ram (M85) received the Ph.D. degree fromthe University of Illinois at Urbana-Champaign,Champaign, IL, USA, in 1985.She is the Anheuser-Busch Endowed Professor ofmanagement information systems, and entrepreneur-ship&innovationintheEllerCollegeofManage-ment, University of Arizona, Tucson, AZ, USA. Shehas joint faculty appointment as a Professor of com-puterscience. SheistheDirectoroftheAdvancedDatabaseResearchGroupandaCodirectorofIN-SITE: Center for Business Intelligence and Analytics(www.insiteua.org), University of Arizona. Her research interests include the ar-eas of enterprise data management, business intelligence, large scale networks,and data analytics. Her work uses different methods such as machine learning,statistical approaches, ontologies, and conceptual modeling. She has publishedarticles in journals such as the Communications of the ACM, IEEE INTELLIGENTSYSTEMS, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, In-formation Systems, Information Systems Research, Management Science, andMIS Quarterly. Her research has been highlighted in several media outlets in-cluding NPR news,Dr. Ram was a Speaker for a TED talk in December 2013 on Creating aSmarter World with Big Data.Wenli Zhang is currently working toward the Ph.D.degreeinmanagement informationsystemsat theUniversity of Arizona, Tucson, AZ, USA.Hermainresearchinterestsincludebigdatainhealthcare, analyzing and mining social networks.Max Williams received the B.Sc. degree in biomed-icalsciencesfromTexasA&MUniversity,CollegeStation, TX, USA, in 2013.He is currently working as a Research Fellow atthe Parkland Center for Clinical Innovation, Dallas,TX. He plans to pursue a career in public health.Yolande Pengetnze received the M.D. degree in1998 fromthe University of Yaounde, Yaounde,Cameroon. She completed Pediatric Residency train-ing in 2008 from Maimonides Medical Center, NewYork, NY, USA, then General Pediatric/Health Ser-vicesResearchFellowshiptrainingcombinedwithaMastersofSciencesinClinicalSciencesin2013from the University of Texas Southwestern MedicalCenter (UTSW), Dallas, TX, USA.She joined Parkland Center for Clinical Innova-tion (PCCI), Dallas, in December 2013, as a Physi-cianScientist andholdsaClinical Facultypositionat UTSW. Herresearchinterests include the use of advanced predictive analytics integrating traditionaldata sources and novel ..Big data.. sources to improve health outcomes at theindividualandpopulationlevel.SheiscurrentlyleadingmultipleprojectsatPCCI, includingpopulationhealthqualityimprovementprojectsinpediatricasthma using both traditional and nontraditional data sources.