predicting the flu from instagram - arxiv.org · related instagram posts showed that majority of...

8
Predicting the Flu from Instagram Oguzhan Gencoglu, Miikka Ermes Abstract—Conventional surveillance systems for monitoring infectious diseases, such as influenza, face challenges due to short- age of skilled healthcare professionals, remoteness of communities and absence of communication infrastructures. Internet-based approaches for surveillance are appealing logistically as well as economically. Search engine queries and Twitter have been the primarily used data sources in such approaches. The aim of this study is to assess the predictive power of an alternative data source, Instagram. By using 317 weeks of publicly available data from Instagram, we trained several machine learning algorithms to both nowcast and forecast the number of official influenza- like illness incidents in Finland where population-wide official statistics about the weekly incidents are available. In addition to date and hashtag count features of online posts, we were able to utilize also the visual content of the posted images with the help of deep convolutional neural networks. Our best nowcasting model reached a mean absolute error of 11.33 incidents per week and a correlation coefficient of 0.963 on the test data. Forecasting models for predicting 1 week and 2 weeks ahead showed statistical significance as well by reaching correlation coefficients of 0.903 and 0.862, respectively. This study demonstrates how social media and in particular, digital photographs shared in them, can be a valuable source of information for the field of infodemiology. Index Terms—infectious diseases, deep neural networks, ma- chine learning, influenza, epidemiology I. I NTRODUCTION Infectious diseases concern public health. Some infectious diseases, even with small number of initial cases, can cause huge epidemics; therefore, continuous monitoring and early detection is important [1]–[3]. However, identifying, diagnos- ing and reporting infectious diseases is still a challenge for many countries [1]. Surveillance systems for communicable diseases aim to ensure that the diseases are monitored ef- ficiently and effectively [4]. During past 2 decades, expert groups have recommended that public health surveillance should be enhanced in order to provide early warnings of emerging infectious diseases, but the systems are still limited and fragmented with uneven global coverage [2], [5]. While studies have shown that both developed and developing coun- tries face challenges with surveillance systems [4], affordable and logistically appealing internet-based methods could be especially valuable in the developing countries. Seasonal influenza is one of the most common infec- tious diseases. It is an acute respiratory infection caused by influenza viruses, and it can spread easily from person to person [6]. Influenza continues to be a major cause of acute respiratory illnesses, increasing morbidity and mortality O. Gencoglu is with Faculty of Medicine and Health Tech- nology, Tampere University, Tampere, 33014, Finland e-mail: ([email protected]). M. Ermes is with Tieto Ltd., P.O.Box 449, 33101, Tampere, Finland throughout the world, increasing financial costs and conse- quentially causing significant economic burden [7]. While yearly epidemics affect all populations, hospitalization and death occur mainly among high-risk populations including individuals with chronic conditions, pregnant women, elderly, children aged 6-59 months and healthcare workers [6]. Esti- mates indicate that worldwide annual epidemics result in about 3 to 5 million cases of severe illnesses and about 290,000 to 650,000 deaths [6]. Influenza epidemics possess certain easily identifiable characteristics, which have allowed their identification throughout history. These characteristics include immense attack rates and explosive spread of the disease. Symptoms of influenza, such as cough and fever, are very characteristic as well [7]. Financial and productivity losses are due to increased levels of work and school absenteeism and sudden peak in demand for healthcare services [6]. Utilizing Internet data for estimating incidents of influenza and influenza-like illnesses (ILI) dates back to 2006, in which anonymous search engine queries were used for the task [8]. Since then, recent developments in Internet technologies have provided novel ways to detect and even predict the outbreak of epidemics. In particular, Internet search engine queries [9], Twitter [10], online blogs [11], electronic health records [12] or other sources such as Wikipedia article access logs [13], weather data [14], restaurant table reservations [14] etc. have been used to develop models to monitor and/or forecast official influenza cases. So far, the proposed methods employ tabular, time-series and textual data to build the predictive models. Recent advancements in the field of machine learning, i.e., deep learning [15], especially in computer vision and image understanding, enables one to build novel approaches that can utilize image data from social media for disease surveillance. One promising social media platform for this purpose is Instagram [16], [17]. Instagram is a photo and video-sharing social network with 500 million daily active users as of 2018 [18]. 35% of adults in United States state that they follow Instagram online or on their cell phone [19]. That percentage is around 71% for the age group 18-24 [19]. Publicly available Instagram data has been successfully used for identification of predictive markers for depression [20], potential drug interaction monitoring [21] and extracting nutritional and calorific information of food posts [22]. In this work, we propose a machine learning pipeline to nowcast (estimation of current week) as well as forecast official weekly ILI cases in Finland using publicly available data from Instagram. In addition to timestamp and textual features of the Instagram posts such as hashtags, we also utilize the visual content (i.e. images) in them with the help of deep neural network models. We train and compare arXiv:1811.10949v1 [cs.LG] 27 Nov 2018

Upload: lyhanh

Post on 11-Aug-2019

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Predicting the Flu from Instagram - arxiv.org · related Instagram posts showed that majority of such posts by youth depict alcohol in a positive social context [52]. Topic modeling

Predicting the Flu from InstagramOguzhan Gencoglu, Miikka Ermes

Abstract—Conventional surveillance systems for monitoringinfectious diseases, such as influenza, face challenges due to short-age of skilled healthcare professionals, remoteness of communitiesand absence of communication infrastructures. Internet-basedapproaches for surveillance are appealing logistically as well aseconomically. Search engine queries and Twitter have been theprimarily used data sources in such approaches. The aim of thisstudy is to assess the predictive power of an alternative datasource, Instagram. By using 317 weeks of publicly available datafrom Instagram, we trained several machine learning algorithmsto both nowcast and forecast the number of official influenza-like illness incidents in Finland where population-wide officialstatistics about the weekly incidents are available. In addition todate and hashtag count features of online posts, we were able toutilize also the visual content of the posted images with the help ofdeep convolutional neural networks. Our best nowcasting modelreached a mean absolute error of 11.33 incidents per week anda correlation coefficient of 0.963 on the test data. Forecastingmodels for predicting 1 week and 2 weeks ahead showedstatistical significance as well by reaching correlation coefficientsof 0.903 and 0.862, respectively. This study demonstrates howsocial media and in particular, digital photographs shared inthem, can be a valuable source of information for the field ofinfodemiology.

Index Terms—infectious diseases, deep neural networks, ma-chine learning, influenza, epidemiology

I. INTRODUCTION

Infectious diseases concern public health. Some infectiousdiseases, even with small number of initial cases, can causehuge epidemics; therefore, continuous monitoring and earlydetection is important [1]–[3]. However, identifying, diagnos-ing and reporting infectious diseases is still a challenge formany countries [1]. Surveillance systems for communicablediseases aim to ensure that the diseases are monitored ef-ficiently and effectively [4]. During past 2 decades, expertgroups have recommended that public health surveillanceshould be enhanced in order to provide early warnings ofemerging infectious diseases, but the systems are still limitedand fragmented with uneven global coverage [2], [5]. Whilestudies have shown that both developed and developing coun-tries face challenges with surveillance systems [4], affordableand logistically appealing internet-based methods could beespecially valuable in the developing countries.

Seasonal influenza is one of the most common infec-tious diseases. It is an acute respiratory infection causedby influenza viruses, and it can spread easily from personto person [6]. Influenza continues to be a major cause ofacute respiratory illnesses, increasing morbidity and mortality

O. Gencoglu is with Faculty of Medicine and Health Tech-nology, Tampere University, Tampere, 33014, Finland e-mail:([email protected]).

M. Ermes is with Tieto Ltd., P.O.Box 449, 33101, Tampere, Finland

throughout the world, increasing financial costs and conse-quentially causing significant economic burden [7]. Whileyearly epidemics affect all populations, hospitalization anddeath occur mainly among high-risk populations includingindividuals with chronic conditions, pregnant women, elderly,children aged 6-59 months and healthcare workers [6]. Esti-mates indicate that worldwide annual epidemics result in about3 to 5 million cases of severe illnesses and about 290,000to 650,000 deaths [6]. Influenza epidemics possess certaineasily identifiable characteristics, which have allowed theiridentification throughout history. These characteristics includeimmense attack rates and explosive spread of the disease.Symptoms of influenza, such as cough and fever, are verycharacteristic as well [7]. Financial and productivity losses aredue to increased levels of work and school absenteeism andsudden peak in demand for healthcare services [6].

Utilizing Internet data for estimating incidents of influenzaand influenza-like illnesses (ILI) dates back to 2006, in whichanonymous search engine queries were used for the task [8].Since then, recent developments in Internet technologies haveprovided novel ways to detect and even predict the outbreakof epidemics. In particular, Internet search engine queries [9],Twitter [10], online blogs [11], electronic health records [12]or other sources such as Wikipedia article access logs [13],weather data [14], restaurant table reservations [14] etc. havebeen used to develop models to monitor and/or forecast officialinfluenza cases. So far, the proposed methods employ tabular,time-series and textual data to build the predictive models.Recent advancements in the field of machine learning, i.e.,deep learning [15], especially in computer vision and imageunderstanding, enables one to build novel approaches that canutilize image data from social media for disease surveillance.One promising social media platform for this purpose isInstagram [16], [17].

Instagram is a photo and video-sharing social network with500 million daily active users as of 2018 [18]. 35% of adultsin United States state that they follow Instagram online or ontheir cell phone [19]. That percentage is around 71% for theage group 18-24 [19]. Publicly available Instagram data hasbeen successfully used for identification of predictive markersfor depression [20], potential drug interaction monitoring [21]and extracting nutritional and calorific information of foodposts [22]. In this work, we propose a machine learningpipeline to nowcast (estimation of current week) as well asforecast official weekly ILI cases in Finland using publiclyavailable data from Instagram. In addition to timestamp andtextual features of the Instagram posts such as hashtags, wealso utilize the visual content (i.e. images) in them with thehelp of deep neural network models. We train and compare

arX

iv:1

811.

1094

9v1

[cs

.LG

] 2

7 N

ov 2

018

Page 2: Predicting the Flu from Instagram - arxiv.org · related Instagram posts showed that majority of such posts by youth depict alcohol in a positive social context [52]. Topic modeling

the performance of 9 machine learning methods for the task.To the best of our knowledge, this is the first study to employimages in social media for forecasting the influenza epidemics.

II. RELATED WORK

There has been a significant number of studies regardingmonitoring and predicting ILI from the Internet data. Severalmachine learning and statistical estimation methods have beenproposed to nowcast and forecast official ILI cases usingvarious different data sources such as search engine queries,Twitter, online blogs, Wikipedia article access logs etc. Theproposed metrics and consequently the reported results forevaluating and comparing these approaches differ as well. Ex-tensive reviews of influenza forecasting studies include [23]–[25], the one by Alessa and Faezipour [25] being the mostrecent one.

When it comes to data sources to infer the count of officialILI cases from, Internet search engine queries and Twitter havebeen the most prevalent data sources. The main reason is thepossibility of implementing estimation models with high pre-dictive power due to high number of users creating substantialamount of data in these platforms. Search engine queries, e.g.,Google Trends, Baidu Index, Google Flu Trends, [8], [9], [12],[14], [26]–[34], Twitter [10]–[12], [14], [29], [35]–[42], onlineblogs [11], [29], electronic health records [12], [32], [43] orother sources such as Wikipedia articles access logs [11], [13],[44], [45], weather data [14], restaurant table reservations [14]etc. have been used as data sources for flu forecasting. Severalstudies combine more than one data source as well to enhancethe predictive power of their models [11], [14], [29], [32], [46],[47].

Proposed methods include conventional time-series fore-casting techniques such as auto-regressive moving aver-age (ARMA), auto-regressive integrated moving average(ARIMA) or variants of these [10], [12], [28], [31], [32],[47]. Different machine learning and statistical estimationtechniques has been used as well, such as linear regression [9],[26], [27], [38], [46], ridge regression [30], [46], least absoluteshrinkage and selection operator (LASSO) [29]–[31], [35],[45], support vector machine (SVM) regression [29], [34],[36], [40], k-nearest neighbor (kNN) regression [14] andneural networks [31], [39], [41].

Most frequent metrics to evaluate the performance of theproposed predictive models are Mean Absolute Error (MAE),Root Mean Squared Error (RMSE), Mean Absolute PercentError (MAPE) and Pearson’s correlation (extensive reviewin [24]). Due to different population sizes under examination,MAE and RMSE can not be used for an objective comparisonbetween studies and MAPE can not be used if the ground truthdata has zero values due to its mathematical definition. Overall,several studies report a Pearson’s correlation coefficient greaterthan 0.90 on their nowcasting predictions on the test (hold-out)data [9], [10], [12], [29], [32], [40], [41], [46].

Publicly available Instagram data has not been utilized forforecasting epidemics yet; however, data collected from Insta-gram has been used for several other health-related analysis.

In [20], Instagram data from 166 individuals has been analyzedfor identification of predictive markers for depression. Correiaet al. collected close to 7,000 Instagram user timelines tomonitor potential drug interaction [21]. Extracting nutritionaland calorific information of food posts has also been per-formed [22]. Similarly, 3 million food related posts shared onInstagram has been analyzed to depict dietary choices in fooddeserts - urban neighborhoods or rural towns characterized bypoor access to healthy and affordable food [48]. Cherian et al.identified hashtags and searchable text phrases associated withcodeine misuse by analyzing the content of 1,156 sequentialInstagram posts over the course of 2 weeks [49]. To exploreyoung women’s smoking behaviors Cortese et al. analyzedInstagram posts, revealing dangerous trends counteracting pub-lic health efforts such as normalization of tobacco use [50].Similarly, content analysis of marijuana-related posts on Insta-gram revealed potential influence on social norms surroundingmarijuana use [51]. Findings of content analysis of alcohol-related Instagram posts showed that majority of such posts byyouth depict alcohol in a positive social context [52]. Topicmodeling analysis of 96,426 Instagram posts has been carriedout for characterizing the health topics that are prominentlydiscussed on Instagram [53]. Shuai et al. compared LogisticRegression (LR), SVM, decision tree and their own tensormodel to detect users with a social network mental disorderon Instagram and Facebook [54]. LR and SVM models havealso been proposed for age group prediction of Instagramusers [55].

III. METHODS

A. Data Collection

1) Surveillance data: In Finland, the National InfectiousDiseases Register is run by the virology unit of the NationalInstitute for Health and Welfare (THL). There are specificguidelines and notification templates compiled by THL fordoctors, healthcare centers, hospital districts and laboratoriesfor mandatory reporting of about 70 diseases such as influenza.If the conditions for notification are met, healthcare profes-sionals are obliged to report infectious disease cases to theregister within seven days. The original notification is canceledor complemented if the notification is discovered false later on.The official statistics of the registry are publicly accessible andcan be used freely under the open data license [56].

For this study, weekly ILI incidents reported by publicprimary healthcare register in Finland between the dates 30April 2012 and 27 May 2018 (in total of 317 weeks) wereused. The data is publicly available and accessible [57].

2) Instagram data: We identified 7 keywords in Finnishlanguage to be searched from the hashtags of the Instagramposts, namely cough, fever, flu, influenza, muscle ache, sick,throat ache. These keywords correspond to the most commonsymptoms of ILI and we hypothesized that they would be oftenused in social media posts associated with ILI. We collectedpublicly available Instagram posts containing at least one ofthese hashtags between the dates 30 April 2012 and 27 May2018 (in total of 317 weeks). A Python crawler was employed

Page 3: Predicting the Flu from Instagram - arxiv.org · related Instagram posts showed that majority of such posts by youth depict alcohol in a positive social context [52]. Topic modeling

to collect only the URL to the image, main text and thetimestamp of the post for each post containing one of thehashtags. Other fields such as username, comments, location,number of likes etc. were not collected. Posts containingvideos were excluded as well. In order to preserve privacy,instead of storing the images, we stored only the URL pointersto the images that are hosted on Instagram. Reading of theimages and further processing such as feature extraction wasdone programmatically without storing the images even thoughall posts were publicly available.

The number of Instagram posts containing a given hashtagcan be examined from Table I. The rows of the table are notmutually exclusive, i.e., a post may contain several of thesehashtags at the same time. Consequently, number of posts inTable I sum up to 22,257 while the total number of uniqueposts is 20,994. On the average, posts contains 10.9 words(non-hashtag) and 7.2 hashtags. Table I also shows the averageword (non-hashtag) and hashtags counts for posts containinga given hashtag.

TABLE ILIST OF HASHTAGS FOR SEARCHING INSTAGRAM POSTS, NUMBER OF

POSTS, AVERAGE WORD AND HASHTAG COUNTS

Hashtag EnglishTranslation

Numberof Posts

AverageWordCount

AverageHashtagCount

#yska cough 661 10.9 8.1#kuume fever 3,863 9.6 6.9#flunssa flu 14,251 11.1 7.1#influenssa influenza 970 14.5 6.2#lihaskipu muscle ache 101 20.2 7.7#kipea sick 1,861 9.5 8.2#kurkkukipu throat ache 550 10.5 7.4

B. Feature Extraction

Each week corresponds to a single observation, i.e., a singledata point. We used the week number (between 1 and 52),month number (between 1 and 12) and the year, as numericfeatures. These 3 features will be referred as date features.

For each hashtag (in total 7), we counted the weekly numberof posts containing that particular hashtag, later referred to ascount features. Hashtags can be examined from Table I.

We selected 4 reference images that contributed to thedefinition of 4 image features. The images were collectedusing Google Images search engine and all of them arereleased under public domain enabling full permission forusage as is. The images were searched using terms ”boxes ofdrugs/medicine”, ”boxes of drugs/medicine and pills”, ”mint,ginger and lemons” and ”ginger and lemon tea”, respec-tively [58] and they are shown collectively in Figure 2.The search terms were defined by the authors’ notion thatimages containing items like these were typically found inInstagram posts related to ILI. In order to count the weeklynumber of images similar to reference images on Instagram,we employed a pretrained deep convolutional neural network(CNN) model, i.e., Inception-ResNet-v2 [59]. This CNN ar-chitecture combines the two powerful architectures that has

been proved to be successful for numerous computer visiontasks: Inception [60] and ResNet [61]. The model is 164layers deep and has been pretrained on the well knownImageNet dataset [62]. For each reference image, we obtainedthe vector representations out of the final layer before thefully-connected layers (an average pooling layer), V i

ref ∈ R,resulting in a vector of length 1536 for each image. Similarly,we extracted the vector representations of each image collectedfrom Instagram (from hashtag search), V j

post ∈ R. For eachreference image vector, we computed the cosine distance, dij(i = 1, ..., 4, j = 1, ..., 22257), to every other Instagram imagevector.

dij = 1−V iref · V

jpost∥∥∥V i

ref

∥∥∥∥∥∥V jpost

∥∥∥ (1)

Cosine similarity/distance has been frequently used in content-based image retrieval tasks [63], [64]. Smaller distance be-tween the two representation vectors means those images aremore likely to have similar visual content. For each referenceimage, we computed the mean, µi, and the standard deviation,σi, of the distances (resulting in 4 mean and standard deviationpairs, one for each reference image). If the cosine distancebetween an Instagram image and a reference image is less then2 unit standard deviations less from the mean, we incrementedthe weekly count corresponding to that reference image. Thismeans, that particular Instagram image is very likely to havesimilar content to our reference image. Essentially, this is asimilar image search scheme and image features are simplythe weekly counts of Instagram images similar to the selectedreference images.

In total, concatenation of all features (date, count, andimage) resulted in a feature vector of length 14. Technically,both count and image features are count-based features, i.e.,count of Instagram posts by hashtags and count of imagesby similarity to reference images. A visual depiction of theproposed feature extraction scheme can be examined fromFigure 1. Each feature is normalized to zero mean and unitystandard deviation (with respect to the training data) beforeregression modeling.

C. Modeling

We trained 9 different machine learning algorithms fornowcasting the official weekly ILI counts in Finland. Thesealgorithms include linear regression (also known as ordinaryleast squares), ridge regression, elastic net, LASSO, k-nearestneighbor regression, support vector machine, random forest,AdaBoost and XGBoost. For each algorithm, we used date andcount features for modeling. Furthermore, with XGBoost, weperformed an extensive evaluation of the effect of differentfeature group combinations. XGBoost is a re-implementationof gradient boosting methods with enhanced regularizationand speed functionality, gaining popularity since its announce-ment [65]. XGBoost can also return the feature importancesinherently which we reported. Additionally, we performedforecasting up to 3 weeks with the XGBoost model.

Page 4: Predicting the Flu from Instagram - arxiv.org · related Instagram posts showed that majority of such posts by youth depict alcohol in a positive social context [52]. Topic modeling

Username

Comments

Number of Likes

Publicly Available Instagram

Posts

- cough- fever- flu- influenza- muscle ache- sick- throat ache

Username

Username

Weekly Count

Username

Username

COUNT FEATURES Pretrained CNN

Cosine Distance

Thresholding

Weekly CountIMAGE FEATURES

Hashtag Search

Reference Images

Image Similarity Search

Keywords

1

2

4

3

DATE FEATURES

Fig. 1. Proposed feature extraction scheme utilizing the date, text and image content of publicly available Instagram posts. 1 - search Instagram postscontaining any of the selected keywords as hashtags, 2 - perform weekly counts of the collected posts for each keyword, 3 - extract the date features of thecorresponding weeks, 4 - perform weekly counts of the posted images similar to reference images

Fig. 2. Selected 4 reference images for counting occurrence of similar imagesin Instagram posts.

Implementation was done in Python (version 3.6) usingscikit-learn, xgboost and TensorFlow libraries [65]–[67] on a64-bit Ubuntu 16.04 workstation. All computations related toneural network models were performed on a single NVIDIATitan Xp GPU. The rest of the machine learning algorithms

were trained on a 20-core CPU in a parallel processing fashionto speed up the hyper-parameter search and cross-validation(CV).

D. Evaluation Scheme

We used weekly data from 30 April 2012 to 22 May2017 (265 weeks) as the training data, i.e., hyper-parameteroptimization and model comparison. In order to report the per-formance of the trained models, data from one year was usedas the test (hold-out) data, i.e., weekly data from 29 May 2017to 27 May 2018 (52 weeks). The evaluation metric was chosento be Mean Absolute Error when comparing the models. To beable to successfully assess a machine learning model with agiven set of hyper-parameters, we performed a 10-fold cross-validation on the training set. We ran our hyper-parametersearch for each algorithm, calculating the average (over 10folds) MAE. After best performing (achieving lowest MAE)hyper-parameters were set, we performed a final training of themodel on the whole training set and evaluated the algorithmson the test data. Predictions of negative incidence counts wereclipped to zero. We report MAE, coefficient of determination

Page 5: Predicting the Flu from Instagram - arxiv.org · related Instagram posts showed that majority of such posts by youth depict alcohol in a positive social context [52]. Topic modeling

TABLE IINOWCASTING RESULTS ACHIEVED BY DIFFERENT MACHINE LEARNING MODELS AND CORRESPONDING INPUT FEATURES

Model Feature Extraction Number ofFeatures

MAE(10-fold CV)

MAE(test)

R2

(test)Pearson’sCorrelation (test)

1 - Elastic Net date + count 10 27.10 31.83 0.609 0.8052 - XGBoost date 3 28.28 29.33 0.552 0.7633 - XGBoost count 7 23.55 26.64 0.710 0.9104 - SVM date + count 10 18.72 24.30 0.657 0.8265 - AdaBoost date + count 10 18.11 18.35 0.781 0.9156 - Random Forest date + count 10 17.04 17.61 0.861 0.9387 - LASSO date + count 10 21.94 16.20 0.877 0.9498 - Ridge Regression date + count 10 22.12 14.75 0.892 0.9429 - Linear Regression date + count 10 22.55 14.66 0.886 0.94310 - kNN Regression date + count 10 18.11 14.00 0.895 0.94811 - XGBoost date + count 10 15.67 13.83 0.897 0.95412 - Deep ConvNet + XGBoost date + count + image 14 13.14 11.33 0.925 0.963

(R2) and Pearson’s correlation coefficient achieved on the testset.

TABLE IIIPERFORMANCE OF THE FORECASTING MODELS ON THE TEST DATA

MAE R2 Pearson’s Correlationnowcast (current week) 11.33 0.925 0.9631-week forecast 17.84 0.814 0.9032-weeks forecast 23.38 0.687 0.8623-weeks forecast 35.54 0.205 0.685

IV. RESULTS

Prediction results for nowcasting are shown in Table II. Thehyper-parameters that resulted in the lowest MAE on the 10-fold CV of the training data are as follows: regularizationcoefficient, α, of 10.0 for ridge regression; regularizationcoefficient, α, of 10.0 and `1 ratio of 0.9, for elastic net;regularization coefficient, α, of 1.0 for LASSO; number ofneighbors of 6 and distance metric of euclidean for kNNregression; regularization coefficient, C, of 100.0 and kernelof radial basis function for SVM; number of estimators of300 and maximum number of features of 3 for random forest;number of estimators of 300, loss function of linear and learn-ing rate of 0.001 for AdaBoost; number of estimators of 300,booster of type tree, learning rate of 0.3 and `1 regularizationcoefficient of 10.0 for XGBoost. R2 scores and correlationcoefficients were found to be statistically significant (p<0.001)for every model.

The official ILI statistics and the predictions of the best-performing nowcasting model, i.e., model number 12 involvingXGBoost with date, count and image (extracted with a DeepConvNet) features, can be seen in Figure 3. As XGBoost isbased on gradient tree boosting, it can inherently calculatethe contribution of each feature to the decision. Such featureimportance analysis results can be seen in Figure 4. In addi-tion, performance of the XGBoost model (model 12) on testdata when predicting the future weeks, i.e, forecasting, can beexamined in Table III. R2 scores and correlation coefficientswere found to be statistically significant (p<0.001) for theforecasting models as well.

V. DISCUSSION

Overall, we show that Instagram can be considered as asignificant source of information for Internet-based monitoringand forecasting of influenza epidemics. Furthermore, we showthat the visual content of the posted images can also beutilized as input features with the help of a deep convolutionalneural network, increasing the prediction performance. Amean absolute error of 11.33 incidents per week and Pearson’scorrelation of 0.963 were achieved with XGBoost algorithmwhen several modalities of Instagram posts (date, count,image) have been used as an input for nowcasting the officialinfluenza-like illness counts in Finland (see Table II). Theachieved MAE corresponds to 5.3% of the average numberof incidents observed in the highest 3 weeks of the test data,i.e., week 8, week 9 and week 6 of 2018 with 226, 210 and206 incidents, respectively. For the forecasting model, accuratepredictions up to 2 weeks were shown to be feasible as wellwith a correlation of 0.862 as shown in Table III.

The proposed approach of image analysis can be utilizedalso for other social media platforms that contain images aspart of their media. For instance, performance of communi-cable disease prediction models trained on Twitter data canbe enhanced by incorporating content analysis of images inTwitter posts. Same idea can be extended to video content aswell. Our results show that adding image features improvesthe prediction performance by decreasing the test MAE from13.83 to 11.33. Even though the content of posted imageson Instagram varies a lot, informative patterns for a giventask can be filtered out with the proposed image similaritysearch. Our feature importance analysis show that the relativecontribution of image features to the predictions correspondsto 12.1% of all features (see Figure 4). Considering only 4reference images were used in this work to extract imagefeatures from, contribution of visual content of images to thepredictive models can further be enhanced with an extensiveset of reference images.

Robustness is a desired aspect of Internet-based diseasesurveillance systems. Different data sources can possess dif-ferent robustness risks in terms of being a foundation fordata-driven models. For instance, it was shown that in 2012-2013 flu season, Google Flu Trends failed to predict the

Page 6: Predicting the Flu from Instagram - arxiv.org · related Instagram posts showed that majority of such posts by youth depict alcohol in a positive social context [52]. Topic modeling

2013 2014 2015 2016 2017 2018

Date

0

50

100

150

200

250C

ount

Official ILI IncidentsPredictions (Training)Predictions (Test)

Fig. 3. Weekly predictions of the best-performing nowcasting XGBoost model (model 12) and the official statistics.

Date Features Count Features Image FeaturesType

0

10

20

30

40

50

Feat

ure

Impo

rtanc

e (%

)

38.7

49.2

12.1

Fig. 4. Relative feature importances of different feature modalities fornowcasting of ILI.

number of flu incidents by estimating more than double theactual numbers [68]–[70]. The cause was attributed to theheightened media attraction of the tool [68]. Difficulty ofreplicability, therefore transparency, is another drawback ofapproaches based on search engine queries due to proprietary

nature of the data. Social media platforms such as Twitter andInstagram possess robustness risks as well. Content analysisof Instagram images during the West African Ebola outbreakin 2014 showed that significant portion of the posted imageswere jokes or irrelevant to the outbreak itself [71].

Our study was conducted by searching keywords in Finnishlanguage and by using the influenza statistics of Finland as thereference. Focusing on a single language and country allowedus to fairly accurately identify the Instagram posts posted fromthis country while also obtaining accurate reference about theoccurrence of influenza in the same country. Finnish language,unlike English, is spoken predominantly by Finnish peopleand people residing in Finland. Employing our approach insituations where a connection between a single language andcountry cannot be drawn (e.g. English-speaking countries) willrequire further filtering of posts by location which may not beavailable for all posts. In addition, only the publicly availableInstagram data can be utilized for the proposed approach. Notethat, public availability is a user-defined setting that can bechanged anytime, rendering Instagram a dynamic source ofinformation. This limitation holds true for other social mediaplatforms such as Twitter or Facebook as well.

Future work will enable improvements by incorporatingother input features derived from Instagram posts such assentiment content of the post text and comments, location etc.The modeling performance can be enhanced with an extensiveimage similarity search as well as with ensembling severalmachine learning models together. Furthermore, normalizationof the weekly counts with respect to the total number ofweekly Instagram posts in the population under study (if

Page 7: Predicting the Flu from Instagram - arxiv.org · related Instagram posts showed that majority of such posts by youth depict alcohol in a positive social context [52]. Topic modeling

available) may enhance the predictive performance by takingthe popularity aspect of the platform into account. The ultimategoal is to implement robust and reliable predictive models thatcan use numerous data sources (Instagram, Twitter, Facebook,Wikipedia, search engines, blogs, weather data etc.) simultane-ously to perform timely and accurate prediction of epidemics.

VI. CONCLUSION

In summary, we show that Instagram is a valuable source ofinformation for predicting influenza outbreaks. In addition todate and hashtag counts in publicly available Instagram posts,with the help of recent advancements in machine learningresearch, i.e., deep neural networks, we also utilized the imagecontent of the posts in order to develop more accurate modelsfor predicting influenza-like illnesses incidents in Finland.Future work includes validating our approach in wider scopegeographically. We believe our work serves as an advancementin the field of infodemiology by highlighting a novel datasource and advanced machine learning methods for enhancedaccuracy and reliability of infectious disease surveillance.

ACKNOWLEDGMENT

We thank Heidi Simila, Marja Harjumaa and Mikko Virta-nen who provided expertise that greatly assisted the research.

REFERENCES

[1] WHO, “Global surveillance during an influenza pandemic: version 1(updated draft april 2009),” 2009.

[2] K. Fukuda, Pandemic influenza preparedness and response: a WHOguidance document. World Health Organization, 2009.

[3] WHO, “Who recommended surveillance standards,” 1999.[4] N. Sahal, R. Reintjes, and A. R. Aro, “Communicable diseases surveil-

lance lessons learned from developed and developing countries: Litera-ture review,” Scandinavian journal of public health, vol. 37, no. 2, pp.187–200, 2009.

[5] S. S. Morse, “Public health surveillance and infectious disease detec-tion,” Biosecurity and bioterrorism: biodefense strategy, practice, andscience, vol. 10, no. 1, pp. 6–16, 2012.

[6] W. Influenza, “Fact sheet no. 211,” World Health Organization, 2014.[7] J. J. Treanor, “Influenza (including avian influenza and swine influenza),”

in Mandell, Douglas, and Bennett’s Principles and Practice of InfectiousDiseases (Eighth Edition). Elsevier, 2015, pp. 2000–2024.

[8] G. Eysenbach, “Infodemiology: tracking flu-related searches on the webfor syndromic surveillance,” in AMIA Annual Symposium Proceedings,vol. 2006. American Medical Informatics Association, 2006, p. 244.

[9] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski,and L. Brilliant, “Detecting influenza epidemics using search enginequery data,” Nature, vol. 457, no. 7232, p. 1012, 2009.

[10] H. Achrekar, A. Gandhe, R. Lazarus, S.-H. Yu, and B. Liu, “Predictingflu trends using twitter data,” in Computer Communications Workshops(INFOCOM WKSHPS), 2011 IEEE Conference on. IEEE, 2011, pp.702–707.

[11] Z. Ertem, D. Raymond, and L. A. Meyers, “Optimal multi-sourceforecasting of seasonal influenza,” PLoS computational biology, vol. 14,no. 9, p. e1006236, 2018.

[12] F. S. Lu, S. Hou, K. Baltrusaitis, M. Shah, J. Leskovec, R. Sosic,J. Hawkins, J. Brownstein, G. Conidi, J. Gunn et al., “Accurate influenzamonitoring and forecasting using novel internet data streams: a casestudy in the boston metropolis,” JMIR public health and surveillance,vol. 4, no. 1, 2018.

[13] K. S. Hickmann, G. Fairchild, R. Priedhorsky, N. Generous, J. M.Hyman, A. Deshpande, and S. Y. Del Valle, “Forecasting the 2013–2014 influenza season using wikipedia,” PLoS computational biology,vol. 11, no. 5, p. e1004239, 2015.

[14] P. Chakraborty, P. Khadivi, B. Lewis, A. Mahendiran, J. Chen, P. Butler,E. O. Nsoesie, S. R. Mekaru, J. S. Brownstein, M. V. Marathe et al.,“Forecasting a moving target: Ensemble models for ili case countpredictions,” in Proceedings of the 2014 SIAM international conferenceon data mining. SIAM, 2014, pp. 262–270.

[15] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,no. 7553, p. 436, 2015.

[16] “Instagram,” https://www.instagram.com/, accessed: 2018-11-13.[17] T. P. Gauthier and E. Spence, “Instagram and clinical infectious dis-

eases,” Clinical infectious diseases, vol. 61, no. 1, pp. 135–136, 2015.[18] C. Smith, “By the numbers: 250+ interesting instagram statistics,”

Retrieved May, 2016.[19] A. Smith and M. Anderson, “Social media use in 2018,” Pew Res Cent

Internet Sci Tech, 2018.[20] A. G. Reece and C. M. Danforth, “Instagram photos reveal predictive

markers of depression,” EPJ Data Science, vol. 6, no. 1, p. 15, 2017.[21] R. B. Correia, L. Li, and L. M. Rocha, “Monitoring potential drug inter-

actions and reactions via network analysis of instagram user timelines,”in Biocomputing 2016: Proceedings of the Pacific Symposium. WorldScientific, 2016, pp. 492–503.

[22] S. S. Sharma and M. De Choudhury, “Measuring and characterizingnutritional information of food and ingestion content in instagram,” inProceedings of the 24th International Conference on World Wide Web.ACM, 2015, pp. 115–116.

[23] E. O. Nsoesie, J. S. Brownstein, N. Ramakrishnan, and M. V. Marathe,“A systematic review of studies on forecasting the dynamics of influenzaoutbreaks,” Influenza and other respiratory viruses, vol. 8, no. 3, pp.309–316, 2014.

[24] J.-P. Chretien, D. George, J. Shaman, R. A. Chitale, and F. E. McKenzie,“Influenza forecasting in human populations: a scoping review,” PloSone, vol. 9, no. 4, p. e94130, 2014.

[25] A. Alessa and M. Faezipour, “A review of influenza detection andprediction through social networking sites,” Theoretical Biology andMedical Modelling, vol. 15, no. 1, p. 2, 2018.

[26] P. M. Polgreen, Y. Chen, D. M. Pennock, F. D. Nelson, and R. A.Weinstein, “Using internet searches for influenza surveillance,” Clinicalinfectious diseases, vol. 47, no. 11, pp. 1443–1448, 2008.

[27] Q. Yuan, E. O. Nsoesie, B. Lv, G. Peng, R. Chunara, and J. S.Brownstein, “Monitoring influenza epidemics in china with search queryfrom baidu,” PloS one, vol. 8, no. 5, p. e64323, 2013.

[28] A. Domnich, D. Panatto, A. Signori, P. L. Lai, R. Gasparini, andD. Amicizia, “Age-related differences in the accuracy of web query-based predictions of influenza-like illness,” PloS one, vol. 10, no. 5, p.e0127754, 2015.

[29] H. Woo, Y. Cho, E. Shim, J.-K. Lee, C.-G. Lee, and S. H. Kim,“Estimating influenza outbreaks using both search engine query data andsocial media data in south korea,” Journal of medical Internet research,vol. 18, no. 7, 2016.

[30] P. Guo, J. Zhang, L. Wang, S. Yang, G. Luo, C. Deng, Y. Wen, andQ. Zhang, “Monitoring seasonal influenza epidemics by using internetsearch data with an ensemble penalized regression model,” Scientificreports, vol. 7, p. 46469, 2017.

[31] Q. Xu, Y. R. Gel, L. L. R. Ramirez, K. Nezafati, Q. Zhang, and K.-L.Tsui, “Forecasting influenza in hong kong with google search queriesand statistical model fusion,” PloS one, vol. 12, no. 5, p. e0176690,2017.

[32] S. Yang, M. Santillana, J. S. Brownstein, J. Gray, S. Richardson, andS. Kou, “Using electronic health records and internet search informationfor accurate influenza forecasting,” BMC infectious diseases, vol. 17,no. 1, p. 332, 2017.

[33] Y. Zhang, H. Bambrick, K. Mengersen, S. Tong, and W. Hu, “Usinggoogle trends and ambient temperature to predict seasonal influenzaoutbreaks,” Environment international, vol. 117, pp. 284–291, 2018.

[34] F. Liang, P. Guan, W. Wu, and D. Huang, “Forecasting influenza epi-demics by integrating internet search queries and traditional surveillancedata with the support vector machine regression model in liaoning, from2011 to 2015,” PeerJ, vol. 6, p. e5134, 2018.

[35] V. Lampos, T. De Bie, and N. Cristianini, “Flu detector-tracking epi-demics on twitter,” in Joint European conference on machine learningand knowledge discovery in databases. Springer, 2010, pp. 599–602.

[36] E. Aramaki, S. Maskawa, and M. Morita, “Twitter catches the flu: detect-ing influenza epidemics using twitter,” in Proceedings of the conferenceon empirical methods in natural language processing. Association forComputational Linguistics, 2011, pp. 1568–1576.

Page 8: Predicting the Flu from Instagram - arxiv.org · related Instagram posts showed that majority of such posts by youth depict alcohol in a positive social context [52]. Topic modeling

[37] D. A. Broniatowski, M. J. Paul, and M. Dredze, “National and localinfluenza surveillance through twitter: an analysis of the 2012-2013influenza epidemic,” PloS one, vol. 8, no. 12, p. e83672, 2013.

[38] E.-K. Kim, J. H. Seok, J. S. Oh, H. W. Lee, and K. H. Kim, “Use ofhangeul twitter to track and predict human influenza infection,” PloSone, vol. 8, no. 7, p. e69305, 2013.

[39] S. Volkova, E. Ayton, K. Porterfield, and C. D. Corley, “Forecastinginfluenza-like illness dynamics for military populations using neuralnetworks and social media,” PloS one, vol. 12, no. 12, p. e0188941,2017.

[40] K. Zhang, R. Arablouei, and R. Jurdak, “Predicting prevalence ofinfluenza-like illness from geo-tagged tweets,” in Proceedings of the 26thInternational Conference on World Wide Web Companion. InternationalWorld Wide Web Conferences Steering Committee, 2017, pp. 1327–1334.

[41] K. Lee, A. Agrawal, and A. Choudhary, “Forecasting influenza levelsusing real-time social media streams,” in Healthcare Informatics (ICHI),2017 IEEE International Conference on. IEEE, 2017, pp. 409–414.

[42] Q. Zhang, N. Perra, D. Perrotta, M. Tizzoni, D. Paolotti, and A. Vespig-nani, “Forecasting seasonal influenza fusing digital indicators and amechanistic disease model,” in Proceedings of the 26th InternationalConference on World Wide Web. International World Wide WebConferences Steering Committee, 2017, pp. 311–319.

[43] B. Michiels, V. K. Nguyen, S. Coenen, P. Ryckebosch, N. Bossuyt,and N. Hens, “Influenza epidemic surveillance and prediction based onelectronic health record data from an out-of-hours general practitionercooperative: model development and validation on 2003–2015 data,”BMC infectious diseases, vol. 17, no. 1, p. 84, 2017.

[44] N. Generous, G. Fairchild, A. Deshpande, S. Y. Del Valle, and R. Pried-horsky, “Global disease monitoring and forecasting with wikipedia,”PLoS computational biology, vol. 10, no. 11, p. e1003892, 2014.

[45] D. J. McIver and J. S. Brownstein, “Wikipedia usage estimates preva-lence of influenza-like illness in the united states in near real-time,”PLoS computational biology, vol. 10, no. 4, p. e1003581, 2014.

[46] B. Bardak and M. Tan, “Prediction of influenza outbreaks by integratingwikipedia article access logs and google flu trend data,” in Bioin-formatics and Bioengineering (BIBE), 2015 IEEE 15th InternationalConference on. IEEE, 2015, pp. 1–6.

[47] D. A. Broniatowski, M. Dredze, M. J. Paul, and A. Dugas, “Using socialmedia to perform local influenza surveillance in an inner-city hospital: aretrospective observational study,” JMIR public health and surveillance,vol. 1, no. 1, 2015.

[48] M. De Choudhury, S. Sharma, and E. Kiciman, “Characterizing dietarychoices, nutrition, and language in food deserts via social media,”in Proceedings of the 19th acm conference on computer-supportedcooperative work & social computing. ACM, 2016, pp. 1157–1170.

[49] R. Cherian, M. Westbrook, D. Ramo, and U. Sarkar, “Representationsof codeine misuse on instagram: content analysis,” JMIR public healthand surveillance, vol. 4, no. 1, 2018.

[50] D. K. Cortese, G. Szczypka, S. Emery, S. Wang, E. Hair, and D. Vallone,“Smoking selfies: Using instagram to explore young womens smokingbehaviors,” Social Media+ Society, vol. 4, no. 3, p. 2056305118790762,2018.

[51] P. A. Cavazos-Rehg, M. J. Krauss, S. J. Sowles, and L. J. Bierut,“Marijuana-related posts on instagram,” Prevention Science, vol. 17,no. 6, pp. 710–720, 2016.

[52] H. Hendriks, B. Van den Putte, W. A. Gebhardt, and M. A. Moreno,“Social drinking on social media: Content analysis of the social aspectsof alcohol-related posts on facebook and instagram,” Journal of MedicalInternet Research, vol. 20, no. 6, p. e226, 2018.

[53] S. Muralidhara and M. J. Paul, “#healthy selfies: Exploration of healthtopics on instagram,” JMIR public health and surveillance, vol. 4, no. 2,2018.

[54] H.-H. Shuai, C.-Y. Shen, D.-N. Yang, Y.-F. Lan, W.-C. Lee, P. S. Yu,and M.-S. Chen, “Mining online social data for detecting social networkmental disorders,” in Proceedings of the 25th International Conferenceon World Wide Web. International World Wide Web ConferencesSteering Committee, 2016, pp. 275–285.

[55] K. Han, S. Lee, J. Y. Jang, Y. Jung, and D. Lee, “Teens are frommars, adults are from venus: analyzing and predicting age groups withbehavioral characteristics in instagram,” in Proceedings of the 8th ACMConference on Web Science. ACM, 2016, pp. 35–44.

[56] “Finland national institute for health and welfare (thl) - opendata license,” https://thl.fi/fi/web/thlfi-en/statistics/statistical-databases/open-data, accessed: 2018-11-13.

[57] “Finland national institute for health and welfare (thl) - official influenzastatistics,” https://thl.fi/ttr/gen/kuvaaja/ILI.csv, accessed: 2018-11-13.

[58] “Reference images,” https://bit.ly/2By8yrQ, https://bit.ly/2zr5IUh, https://bit.ly/2KDYdha, https://bit.ly/2zseJMF Accessed: 2018-11-26.

[59] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,inception-resnet and the impact of residual connections on learning.” inAAAI, vol. 4, 2017, p. 12.

[60] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in Proceedings of the IEEE conference on computer vision and patternrecognition, 2015, pp. 1–9.

[61] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 770–778.

[62] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on. Ieee,2009, pp. 248–255.

[63] Q. Bao and P. Guo, “Comparative studies on similarity measures forremote sensing image retrieval,” in Systems, Man and Cybernetics, 2004IEEE International Conference on, vol. 1. IEEE, 2004, pp. 1112–1116.

[64] A. Babenko and V. Lempitsky, “Aggregating local deep features forimage retrieval,” in Proceedings of the IEEE international conferenceon computer vision, 2015, pp. 1269–1277.

[65] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”in Proceedings of the 22nd acm sigkdd international conference onknowledge discovery and data mining. ACM, 2016, pp. 785–794.

[66] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al.,“Scikit-learn: Machine learning in python,” Journal of machine learningresearch, vol. 12, no. Oct, pp. 2825–2830, 2011.

[67] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.

[68] P. Copeland, R. Romano, T. Zhang, G. Hecht, D. Zigmond, andC. Stefansen, “Google disease trends: an update,” Nature, vol. 457, pp.1012–1014, 2013.

[69] D. Butler, “When google got flu wrong,” Nature, vol. 494, no. 7436, p.155, 2013.

[70] D. Lazer, R. Kennedy, G. King, and A. Vespignani, “The parable ofgoogle flu: traps in big data analysis,” Science, vol. 343, no. 6176, pp.1203–1205, 2014.

[71] E. K. Seltzer, N. Jean, E. Kramer-Golinkoff, D. A. Asch, and R. Mer-chant, “The content of social media’s shared images about ebola: aretrospective study,” Public health, vol. 129, no. 9, pp. 1273–1277, 2015.