public health and pipe breaks in water distribution systems: analysis with internet search volume as...

9
Public health and pipe breaks in water distribution systems: Analysis with internet search volume as a proxy Julie E. Shortridge*, Seth D. Guikema Department of Geography & Environmental Engineering, Johns Hopkins University, USA article info Article history: Received 19 September 2013 Received in revised form 1 December 2013 Accepted 9 January 2014 Available online 21 January 2014 Keywords: Distribution network Pipe breaks Gastrointestinal illness Non-linear regression abstract Drinking water distribution infrastructure has been identified as a factor in waterborne disease outbreaks and improved understanding of the public health risks associated with distribution system failures has been identified as a priority area for research. Pipe breaks may pose a risk, as their occurrence and repair can result in low or negative pressure, potentially allowing contamination of drinking water from adjacent soils. However, measuring this phenomenon is challenging because the most likely health impact is mild gastrointestinal (GI) illness, which is unlikely to result in a doctor or hospital visit. Here we present a novel method that uses data mining techniques and internet search volume to assess the relationship between pipe breaks and symptoms of GI illness in two U.S. cities. Weekly search volume for the terms diarrhea and vomiting was used as the response variable with the number of pipe breaks in each city as a covariate as well as additional covariates to control for seasonal patterns, search volume persistence, and other sources of GI illness. The fit and predictive accuracy of multiple regression and data mining tech- niques were compared, with the best performance obtained using random forest and bagged regression tree models. Pipe breaks were found to be an important and positively correlated predictor of internet search volume in multiple models in both cities, supporting previous investigations that indicated an increased risk of GI illness from distribution system disturbances. ª 2014 Elsevier Ltd. All rights reserved. 1. Introduction While drinking water in developed countries is consistently treated to be compliant with health guidelines, the aging condition of drinking water distribution system infrastructure presents a risk of contaminant intrusion and negative impacts on public health. Breaks and leaks in distribution pipelines can allow pathogens present in surrounding soil or water to enter the distribution system during low or negative pressure events. It is estimated that anywhere from 10 to 50% of waterborne disease outbreaks associated with treated drink- ing water are attributable to distribution system deficiencies (CDC, 2006; CDC, 2008; CDC, 2011). It is reasonable to assume that the outbreaks analyzed in CDC (2006, 2008, 2011) repre- sent only a small percentage of the overall disease attributable to drinking water, as they require that multiple cases of illness be reported and linked to drinking water exposure (CDC, 2011), * Corresponding author. Tel.: þ1 2026796535. E-mail address: [email protected] (J.E. Shortridge). Available online at www.sciencedirect.com ScienceDirect journal homepage: www.elsevier.com/locate/watres water research 53 (2014) 26 e34 0043-1354/$ e see front matter ª 2014 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.watres.2014.01.013

Upload: seth-d

Post on 30-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

ww.sciencedirect.com

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 4

Available online at w

ScienceDirect

journal homepage: www.elsevier .com/locate /watres

Public health and pipe breaks in water distributionsystems: Analysis with internet search volume as aproxy

Julie E. Shortridge*, Seth D. Guikema

Department of Geography & Environmental Engineering, Johns Hopkins University, USA

a r t i c l e i n f o

Article history:

Received 19 September 2013

Received in revised form

1 December 2013

Accepted 9 January 2014

Available online 21 January 2014

Keywords:

Distribution network

Pipe breaks

Gastrointestinal illness

Non-linear regression

* Corresponding author. Tel.: þ1 2026796535.E-mail address: [email protected] (J.E. Sho

0043-1354/$ e see front matter ª 2014 Elsevhttp://dx.doi.org/10.1016/j.watres.2014.01.013

a b s t r a c t

Drinking water distribution infrastructure has been identified as a factor in waterborne

disease outbreaks and improved understanding of the public health risks associated with

distribution system failures has been identified as a priority area for research. Pipe breaks

may pose a risk, as their occurrence and repair can result in low or negative pressure,

potentially allowing contamination of drinking water from adjacent soils. However,

measuring this phenomenon is challenging because the most likely health impact is mild

gastrointestinal (GI) illness, which is unlikely to result in a doctor or hospital visit. Here we

present a novel method that uses data mining techniques and internet search volume to

assess the relationship between pipe breaks and symptoms of GI illness in two U.S. cities.

Weekly search volume for the terms diarrhea and vomiting was used as the response

variable with the number of pipe breaks in each city as a covariate as well as additional

covariates to control for seasonal patterns, search volume persistence, and other sources of

GI illness. The fit and predictive accuracy of multiple regression and data mining tech-

niques were compared, with the best performance obtained using random forest and

bagged regression tree models. Pipe breaks were found to be an important and positively

correlated predictor of internet search volume in multiple models in both cities, supporting

previous investigations that indicated an increased risk of GI illness from distribution

system disturbances.

ª 2014 Elsevier Ltd. All rights reserved.

1. Introduction

While drinking water in developed countries is consistently

treated to be compliant with health guidelines, the aging

condition of drinking water distribution system infrastructure

presents a risk of contaminant intrusion and negative impacts

on public health. Breaks and leaks in distribution pipelines

can allow pathogens present in surrounding soil or water to

rtridge).

ier Ltd. All rights reserve

enter the distribution system during low or negative pressure

events. It is estimated that anywhere from 10 to 50% of

waterborne disease outbreaks associated with treated drink-

ing water are attributable to distribution system deficiencies

(CDC, 2006; CDC, 2008; CDC, 2011). It is reasonable to assume

that the outbreaks analyzed in CDC (2006, 2008, 2011) repre-

sent only a small percentage of the overall disease attributable

to drinkingwater, as they require thatmultiple cases of illness

be reported and linked to drinkingwater exposure (CDC, 2011),

d.

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 4 27

and few people seek medical care for mild to moderate

gastrointestinal (GI) illness (Wheeler et al., 1999). Messner

et al. (2006) estimate that community water systems are

responsible for 16.4million cases of acute GI illness per year in

the United States. While most of these are mild to moderate

cases that do not require a doctor or hospital visit, they can

still result in high societal costs; for example, in 1988 it was

estimated that mild GI illness resulted in $19.5 billion in lost

productivity annually (Garthright et al., 1988). Because of

these issues, improved understanding of the incidence and

severity of health impacts from water distribution systems

has been identified as a high priority research area (USEPA and

Water Research Foundation, 2010).

Despite the need for additional research on the public

health impacts of distribution system deficiencies, a number

of issuesmake detection andmeasurement of these impacts a

challenge. Exposure to pathogens through water distribution

systems requires a complex chain of events to occur. While a

conceptual model to support microbial risk assessment for

low-pressure events is presented by Besner et al. (2011),

existing data to support such an analysis is limited and sub-

ject to numerous uncertainties and assumptions. While sus-

tained low-pressure events caused by main breaks and

maintenance activities are relatively easy to identify, short

term pressure transients caused by changes in pump opera-

tion, power outages, and sudden changes in demand are un-

likely to be identified without high-speed pressuremonitoring

that is generally not in use in existing systems (Friedman

et al., 2004). Population exposure depends on the quantity of

pathogens able to enter the distribution system, their trans-

port and dilution throughout the system, and the number of

users who eventually consume the contaminated water. In-

dividuals may be exposed to contaminated water at many

places other than their homes, such as offices, schools, or

restaurants, confounding efforts to monitor illness at fine

spatial scales. Furthermore, only a small percentage of GI

illness results in a doctor or hospital visit, making health

outcomes difficult to track.

Because of these challenges, much existing research on

this topic has either focused on the occurrence of pressure

transients and external contamination in water distribution

systems or survey-based monitoring and intervention trials

that aim to estimate the incidence of GI illness attributable to

treated drinking water. Sampling studies have indicated that

pathogenic microorganisms are frequently present in soil and

water adjacent to drinking water pipes (Karim et al., 2003),

while low- and negative-pressure transients have been

documented in multiple systems (LeChevallier et al., 2003;

Karim et al., 2003). Water sampling studies have indicated

that distribution systems can allow introduction of viruses

into non-disinfected systems (Lambertini et al., 2012). Evalu-

ations of whether low-pressure events lead to measurable

increases in GI illness have beenmixed. Rates of self-reported

GI illnesswere found to increase following distribution system

disruptions in a study conducted in Norway (Nygard et al.,

2007) and following self-reported losses in water pressure in

the UK (Hunter et al., 2005). However, Malm et al. (2013)

monitored calls to a health care hotline system in Sweden

and found no statistically significant change in call volume

related to GI illness following distribution system disruptions.

While not specifically focused on low pressure events,

decreased incidence of GI illness have also been observed in

water systems with less pipe length per person (Nygard et al.,

2004; Tinker et al., 2009) and amongst study participants who

drank water that had been bottled at a treatment plant rather

than untreated tapwater (Payment et al., 1997), indicating that

increased rates of illness could be a result of distribution

system deficiencies more generally.

The intervention trials and survey-based monitoring

evaluations above can provide important insights into health

risks associated with the studied distribution systems. How-

ever, extrapolating these insights to distribution systems

more generally is difficult due to the tremendous variability in

water system characteristics. The likelihood of contamination

in a given system is likely to depend heavily on factors such as

water source, treatment procedures, and distribution system

condition and characteristics. For example, the distribution

system evaluated by Payment et al. (1991, 1997) was found to

be highly susceptible to negative pressure events

(LeChevallier et al., 2003). Furthermore, the chance that a

contamination event results in observable illness depends on

the population served by the system, as certain demographic

groups, such as children, the elderly, and immunocompro-

mised individuals, are more likely to become sick after a given

exposure. Differences in system characteristics could partly

explain the higher rates of illness attributed to drinking water

from those studies, as well as differences in research design.

While conducting a similar monitoring effort on a larger scale

could provide valuable insights into how risks differ amongst

different water systems, survey-based monitoring and inter-

vention trials tend to be very resource-intensive and practical

only over relatively small scales. Therefore, new methods are

needed to support broader studies that can evaluate wide-

scale risks, as well as relative risks in different types of

systems.

The use of internet search query data has the potential to

prove useful in this regard. It is estimated that 37e52% of

Americans search for health information on the internet

(Brownstein et al., 2009). Internet search volume has already

been shown to be strongly correlated with traditional disease

monitoring data in a number of cases. Search volume for

influenza-related search terms is capable of providing early

detection of influenza epidemics (Ginsberg et al., 2008;

Polgreen et al., 2008). This ability has also been demon-

strated for a number of GI illnesses, with strong correlations

between search volume and confirmed infections of rotavirus

(Desai et al., 2012), salmonella (Brownstein et al., 2009), and

gastroenteritis (Pelat et al., 2009). These results show that

internet search volume has the potential to be an easily and

rapidly accessible source of information regarding disease

incidence over large areas and long-time scales where tradi-

tional monitoring may be infeasible. Surveillance data can

also easily be collected through time to support longitudinal

evaluations, and thus avoid the difficulties associated with

cross-sectional comparisons between or within water service

areas.

The objective of this paper is to assess whether a statistical

relationship exists between pipe breaks in municipal drinking

water distribution systems and GI illness at the metropolitan

scale as estimated by internet search volume. We use a novel

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 428

approach where weekly internet search volume for terms

related to GI illness is the response variable and we test the

ability of various parametric models and non-parametric

data-mining techniques to model the relationship between

search volume, pipe breaks and other environmental factors.

Results from two cities are compared to assess whether

observed relationships are consistent.

2. Materials and methods

Two cities, referred to in the following sections as City A and

City B, were used as study areas. Both cities are located in the

mid-Atlantic region of the United States and have metro area

populations ranging from 2 to 5 million. The water systems in

each city were established over 200 years ago, and like many

cities in the Eastern United States, many components of the

water systems are reaching or have surpassed their planned

lifespans. The two cities have temperate climates with

average summer highs approaching 90 �F and average winter

lows of approximately 30 �F.In each city, metro-area weekly internet search volume for

the term “diarrhea vomiting-dog” was obtained through the

Google Trends website and used as a response variable. This

term captures the volume of searches for the words diarrhea

or vomiting, while removing those that also contained the

word “dog,” as this was identified by Google as a common

related search. We obtained pipe break and leak data directly

from the two cities. For City A, we receive information directly

from their work order system; each time they investigate a

possible pipe problem or fix a pipe break, we receive a notifi-

cation directly from a senior engineer in the organization

responsible for the water system performance. This data

captures all pipe breaks, leaks, and repairs that the City A

system managers are aware of, meaning that we capture a

range of break sizes for City A, from small leaks to cata-

strophic failures of large pipes. For City B we received a

database of historic breaks from the 1970s through 2006

directly from the City’s water department. This database

included all known breaks and leaks for the most recent years

but was incomplete for the earlier years. As with City A, this

database contains information about all pipe breaks and pipe

leaks that the system managers were aware of. There is not a

Table 1 e Summary of response variable and covariate data. Ssearches in eachweek for the termof interest relative to the totasearches.

Variable Description

Response Variable

Search volume Weekly search volume for “diarrhea vomiting-dog”

interest

Covariates

Lagged search Search volume from previous week

Season Categorical variable on season

Temperature (avg) Average daily temperature for the week

Precipitation (avg) Average daily precipitation for the week

Pipe breaks Number of pipe breaks in each city per week

Pipe breaks lagged Number of pipe breaks from previous week

SSO Number of sewer overflow events in City A per we

SSO lagged Number of sewer overflow events from previous w

clear distinctionmade between a pipe break and a pipe leak in

either of the databases used. Our analysis thus includes in-

formation on both full breaks of pipes and small leaks. In re-

ality, there is a continuum of problems, from pipe being

completely severed down to small, slow leaks. Of course,

neither database contains information about undetected

leaks or breaks.

Due to availability of pipe break data, weekly data from

January 2011 to February 2013 (109 weeks of observations with

a total of 1970 pipe breaks) was used in City A, and from

January 2005 to March 2006 in City B (62 weeks of observations

with a total of 908 breaks). The City B data was constrained by

both the completeness of the record and the availability of

Google search volume data for the earlier dates.

The temporal resolution of available search volume data

depends on the volume of searches for the term of interest.

For our search term, the maximum temporal resolution was

weekly for both cities. The search data available from

Google does not present a total number of searches, but

instead measures the number of searches for the term of

interest relative to the total number of searches in that

week on a scale of 0e100. This is done to account for times

with higher or lower internet activity generally. Other

search volume terms, including diarrhea and vomiting as

separate searches, and searches controlling for the term

“pregnant” (a term that was commonly combined with

diarrhea and vomiting) were also evaluated. These terms

were found to be highly correlated with our final search

term and did not lead to significantly different results.

Covariates included:

- Season, average daily temperature, and average daily pre-

cipitation to control for seasonal and climatic variations in

GI incidence. Climate data was taken from NOAA National

Weather Service Monthly Weather Summaries.

- Counts of pipe breaks from the week of interest and the

prior week (lagged pipe breaks) to account for disease in-

cubation periods.

- Sewer overflow events (including both combined sewer

overflows and sanitary sewer overflows) from the week of

interest and the prior week to control for potential illness

from sewer overflows (City A only due to data availability),

obtained from the state environmental agency.

earch volume numbers refer to the relative number ofl number of searches in thatweek, not the actual number of

Min Mean Max Min Mean Max

City A City B

in metro area of 42 65.4 95 40 66.7 99

City A City B

42 65.3 95 40 66.7 99

NA NA NA NA NA NA

26.5 57.2 85.6 19.6 53.6 83.1

0 0.13 1.21 0 0.11 0.86

3 18.1 79 3 14.6 56

3 18.3 79 5 15.0 56

ek 3 14.2 47 NA NA NA

eek 1 14.1 47 NA NA NA

Table 2 e Mean absolute in-sample and out-of-sampleerrors for each model. Bold-italic values indicatestatistically significant improvement over null model(Bonferroni-corrected p-value < 0.05).

Modelname

City A City B

In-sample

Out-of-sample In-sample Out-of-sample

GLM 7.59 7.83 8.96 10.99

GAM 7.33 8.21 8.61 12.03

MARS 7.58 8.97 8.58 11.55

CART 6.28 9.64 7.14 11.64

Random

forest

3.81 7.87 4.51 10.25

Bagged

CART

5.18 7.90 6.14 10.28

Null

model

8.65 8.44 12.43 12.47

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 4 29

- Internet search volume from the prior week, to control for

potential persistence in internet search volume

A summary of internet search volume and covariate data

in each city is provided in Table 1.

The relationship between environmental exposure, dis-

ease incidence, and internet search volume is complex and

has the potential to exhibit non-linearity and interactions

between covariates. To find models that could capture these

complexities, we compared the fit and predictive accuracy of

multiple regression and data mining techniques. Each model

was fit to the full data set to evaluate goodness of fit based on

mean absolute error between actual and modeled search

volume. Additionally, holdout cross validation was used to

compare the models’ out-of sample predictive accuracy using

a 50-fold holdout analysis. In each iteration, 90% of observa-

tions were randomly selected and used to fit the models,

which were then used to predict search volume in the

remaining 10% of observations. Predictive accuracy was

measured using mean absolute error between actual and

predicted search volume in these held-out samples.

Six models were tested to compare their in-sample and

out-of-sample accuracy in each city. Because these models

use different functional forms and mathematical algorithms

to fit and predict data, comparing multiple models is a way to

identify which methods can best capture relationships that

exist between the response variable and covariates. The six

models included:

1. A Poisson-transformed generalized linear model (GLM)

with variable removal based on Akaike information crite-

rion (AIC minimization) (Cameron and Trivedi, 1998).

2. A Poisson-transformed generalized additive model (GAM)

with individual cubic regression splines applied over each

covariate. GAMs use smoothing functions applied over

covariates, allowing them to capture non-linear relation-

ships between covariates and the response variable.

Smoothing functions are fit using penalized likelihood

maximization to prevent overfitting of the model, and can

be penalized to zero for covariates that don’t improve

model fit (Hastie and Tibshirani, 1990).

3. Multivariate adaptive regression splines (MARS): Data is

represented using a non-linear, multivariate function

estimated by multivariate spline basis functions fit to

recursively partitioned segments of the data (Friedman,

1991).

4. Classification and Regression Tree (CART): A single

regression tree was fit to the data and then pruned to the

optimal size using cross validation (Breiman et al., 1984).

5. Bagged CART (BC): 50 regression trees are each trained on a

separate bootstrapped subset of the data, and the final

model prediction is the average of each individual tree

prediction (Hastie et al., 2009).

6. Random Forest (RF): 500 regression trees are each trained

on a separate bootstrapped subset of the data, and corre-

lation between trees is reduced through random splitting

of nodes (Breiman, 2001).

A null model was also included for comparative purposes,

in which search volume was simply estimated as equal to

mean search volume in all observations used to fit the model.

For example, for the entire dataset, the null model would

predict a search volume of 65.4 in City A and 66.7 in City B (the

mean search volume in each city). For the holdout analysis,

the null model predicts search volume by calculating the

mean search volume for the 90% of weeks selected to train the

models. Models were also evaluated against a persistence

model (where search volume was assumed to equal search

volume from the previousweek); however, because thismodel

resulted in higher errors than all other models, the null model

was used as a comparison to evaluate model performance.

3. Results

Table 2 presents the mean absolute error for each model in

each city. The random forest model resulted in the lowest in-

sample and out-of-sample errors in both cities, and was able

to achieve statistically significant reduction in error when

compared to the null model based on Bonferroni-corrected

Wilcox Rank Sum tests. The bagged CART model also resul-

ted in a statistically significant reduction in both in-sample

and out-of-sample error in both cities compared to the null

model. Fig. 1 shows time series of actual and predicted search

volume in both cities for the random forest and bagged CART

models. Predicted search volume for eachweekwas estimated

by fitting the models to the whole data set with the week in

question removed, and then generating a prediction for that

week. That is, each point in the time series is a holdout esti-

mate for that week. These time series indicate that themodels

are capable of capturing relatively slow trends in search vol-

ume, but are unable to capture some week-to-week vari-

ability, as well as some extreme values. While the effects of a

given pipe break would generally be realized within a 2e3 day

period, there can be a significant delay between occurrence of

a pipe break and detection and repair, particularly for smaller

pipe breaks. Because of this, the pipe break data available

would not be sufficient for modeling short-term trends and

health impacts on, for example, a daily basis. The ability of the

model to capture the longer-term, weekly trends suggests that

it is appropriate for gaining insights into the relationship

Fig. 1 e Actual and predicted search volume for the random forest and bagged CART models.

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 430

between pipe breaks and health impacts at these longer time

scales. However, the model is not appropriate for real-time

monitoring of pipe breaks.

Because they significantly outperformed the null model in

terms of fit and predictive accuracy in both cities, the random

forest and bagged CART models were used to evaluate co-

variate influence in each city. Because typical measures of

influence that are used in parametric models (such regression

coefficients and p-values) do not exist for tree-based models,

partial dependence plots were developed to assess covariate

importance and influence in each model.

Partial dependence plots measure the marginal influence

that changing each covariate of interest, while keeping all

other covariates equal, has on model predictions. A relatively

flat partial dependence plot indicates that the covariate of

interest has little influence on themodel’s predictions, while a

large change in response variable values indicates that the

covariate has a large degree of influence in model predictions.

This variationwasmeasured for eachmodel by estimating the

relative “swing” attributable to each covariate n, which con-

sisted of the range of partial dependence values associated

with the covariate of interest, divided by the total swing over

all n covariates in that model (Equation (1)). A relative swing of

0 would indicate that the model did not use that covariate in

its predictions at all, while a relative swing of 1 would indicate

that model relied entirely on one covariate.

Relative Swingn ¼ maxðPDnÞ �minðPDnÞPnSwingn

(1)

Table 3 shows the relative swing associated with each co-

variate compared to all other covariates, with a higher swing

associated with greater influence over model predictions. In

City A, the most influential covariate in both models is the

season (which is responsible for 29% of bagged CART vari-

ability and 18% of random forest variability), while the most

influential covariate in City B is lagged search volume

(responsible for 42% of bagged CART variability and 32% of

random forest variability). The influence of lagged pipe breaks

ranges from 4% in the City B bagged CARTmodel to 17% in the

City A random forest model. Covariate influence varies

somewhat between the two cities, although a large degree of

agreement in the variable importance estimated from each of

Table 3 e Relative swing in partial dependence plots forthe random forest, and bagged CARTmodels in each city.A greater value of relative swing indicates that the modelrelies more heavily on that covariate, while a low valueindicates that a covariate is not very important ingeneratingmodel predictions. A value of 0would indicatethat the covariate is not used in the model at all.

Bagged CART Random forest

City A

Season 0.29 Season 0.18

SSO 0.17 BreaksLag 0.17

BreaksLag 0.14 SSO 0.17

SearchLag 0.14 SearchLag 0.14

Temp 0.08 Temp 0.12

SSOlag 0.07 Prec 0.10

Breaks 0.07 Breaks 0.07

Prec 0.05 SSOlag 0.06

City B

SearchLag 0.42 SearchLag 0.32

Temp 0.24 Temp 0.20

Season 0.15 Season 0.19

Prec 0.13 BreakLag 0.11

BreakLag 0.04 Breaks 0.10

Breaks 0.02 Prec 0.08

Fig. 2 e Normalized partial dependence plots for City A.

Internal tick marks show 10% sample quantiles for each

covariate. Plots show the marginal influence of changing

the covariate of interest while all over covariates are held

constant.

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 4 31

the models is observed in City A (Kendall’s W equal to 0.94)

and a moderate degree of agreement is observed in City B

(Kendall’s W equal to 0.57). Seasonal variables (month and

temperature) and search lag show a high degree of influence

overall. Environmental indicators (SSO and lagged breaks)

account for a large portion of variance in City A, but appear

less influential in City B.

The partial dependence plots shown in Figs. 2 and 3 also

show some consistent patterns across models and cities. In all

instances, there is a clear positive relationship between search

volume and lagged search volume, indicating that theremay be

some persistence in search volume that lasts longer than the

oneweek periods evaluated in this study. In both cities, there is

a positive relationship between lagged pipe breaks and search

volume in both models. In City A, an increase from 0 to 20 pipe

breaks results in an increase in an approximately 6% increase

in search volume in the bagged CART model, and an 8% in-

crease in the random forest model. However, increasing the

number of pipe breaks beyond 30 has no additional impact on

search volume. In City B, an increase from 0 to 55 breaks results

in an approximately 6% increase in search volume in the

random forest model, but only a 2% increase in the bagged

CART model. For current week pipe breaks, a negative rela-

tionship is evident in City A, while no clear relationship is

observed in City B. This is as expected, assuming a required

incubation period between exposure and effect. Both cities also

exhibit evidence of a seasonal pattern, with higher search

volume in the summer relative to thewinter and inweekswith

high temperatures. In City A, a positive relationship also exists

between sewer overflows for the current week, but no such

relationship is observed for lagged sewer overflows.

4. Discussion

All models tested were able to provide an improved fit

compared to the null model, but only the bagged CART and

random forest models were able to provide statistically sig-

nificant improvements in both fit and predictive accuracy in

both cities. Time series plots of predicted and actual search

volume in the highest performing models indicate that the

models were capable of capturing trends at a monthly time

scale, but not shorter term variability and particularly

extreme values. This unexplained variance is not surprising

considering (1) the low number of covariates explored here

when compared to the numerous factors that could contribute

to illness rates and online activity related to disease symp-

toms and (2) the possible lags between a break or leak and

detection of that event. It could also be the result of minor

Fig. 3 e Normalized partial dependence plots for City B.

Internal tick marks show 10% sample quantiles for each

covariate.

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 432

discrepancies in the geographic regions covered by the

different data sets; in particular, the metro areas represented

in the Google search data include some suburbs not included

in the pipe break service areas. However, population density is

highest in the areas covered by the pipe break data, which is

also where many people commute for work during the day.

Therefore, the disparity in populations covered by the two

data sets is expected to be minimal, and in any case would

result in underreporting of correlation between pipe breaks

and search volume and decreased model performance, rather

than the opposite. While this unexplained variance would

make the models unsuitable for activities requiring refined

predictions (such as real-time monitoring), the statistically

significant improvements demonstrated by the bagged CART

and random forest models make them suitable for generating

insights on covariate importance and influence.

In terms of covariate influence, themodels in City B appear

to be largely informed by search volume persistence and

seasonal characteristics, with lagged search volume, season,

and temperature accounting for a total of 71e79% of the

variation in partial dependence values for these models. In

City A, the models were informed by a larger number of

covariates, with the top three covariates in each model

accounting for a total of 52e60% of partial dependence swing.

Partial dependence variability was largely determined by

season and environmental factors (lagged pipe breaks and

SSOs) in City A, with a moderate degree of variability attrib-

uted to lagged search volume. The direction of influence

showed consistent results between cities and models for a

lagged search volume, temperature, season, and lagged pipe

breaks. Counts of lagged pipe breaks show a consistently

positive relationship with search volume in each city and

evaluatedmodel.While this result cannot provide any proof of

a causal relationship between the two, the presence of a

similar relationship across two cities and multiple models is

consistent with previous studies that identified water distri-

bution system inadequacies as a contributor to GI illness.

The use of internet search volume as a proxy for sub-

clinical GI illness and counts of all pipe breaks in a water

system are both novel methods that present an interesting

comparison to the existing literature on this topic. One

important distinction between this work and that of Malm

et al. (2013) and Nygard et al. (2007) is the magnitude of the

events analyzed. Those analyses evaluated the results of large

disturbances that affected hundreds to thousands of people,

whereas our evaluation makes no distinction based on pipe

size or number of affected customers. It is reasonable to as-

sume that the majority of pipe breaks in our dataset, as in

most cities, are rather small and did not result in total pres-

sure loss. Therefore, our results point more towards a health

impact associated with small scale disturbances that occur

relatively frequently in water systems, rather than large dis-

ruptions that may affect many customers but only occur

rarely. Additionally, it means that our results may be more

comparable to the work of Malm et al. (2013), as their dataset

include breaks that resulted in varying degree of pressure loss,

rather than only focusing on disturbances that resulted in

total pressure loss as in Nygard et al. (2007). It is also worth

noting that Sweden experiences an estimated 5000 pipe re-

pairs per year, which includes pipe breaks and leakage repairs

(Malm et al., 2013), compared to 240,000 main breaks alone in

the US (ASCE, 2013). This results in a per-capita repair rate of

0.0005 in Sweden, compared to 0.0008 for breaks alone in the

US. While it is impossible to say whether this is the case for

the specific cities evaluated, it seems reasonable to assume

that systemsmore prone to pipe failurewould result in greater

health impacts.

Aside from Malm et al. (2013), the majority of studies on

drinking water distribution systems and GI illness have relied

on self-reporting during intervention trials or interviews with

researchers following system disruptions, as in Nygard et al.

(2007). One primary difference between these studies and

ours is the spatial and temporal scale of evaluation. While

Nygard et al. (2004) evaluates campylobacter infections across

Sweden in a cross sectional comparison between municipal-

ities, they do not evaluate any temporal changes in infection

rates. Similarly, intervention trials such as those presented by

Payment et al. (1997) and post-disturbance monitoring

(Nygard et al., 2007) compare health outcomes in a limited

number of households in a small geographic area. Scaling

these evaluations up to dowide-scale or long-termmonitoring

would be very resource intensive, whereas internet search

data to support a longitudinalmetropolitan-level evaluation is

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 4 33

freely available. Monitoring illness at the city-level also re-

duces the confounding issue of mobility, where participants

may be exposed to an illness in a place other than their resi-

dence. Therefore, internet search data has the potential to be

a tool in instances where traditional monitoring is infeasible,

allowing for evaluations to proceed over greater temporal and

spatial scales.

Nevertheless, there are some important limitations that

should be considered when using internet search data to es-

timate public health impacts. Naturally, it is unlikely that a

proxy measure such as internet search volume could provide

as accurate an estimate of illness rates as direct questioning.

Therefore, quantitative estimates of disease risk based on

internet search data would require additional analysis

relating search volume to monitoring data collected via

traditional means. Furthermore, direct questioning has the

advantage of allowing researchers to control for other expo-

sure factors that cannot be monitored by proxy (such as

contact with sensitive subpopulations or international travel).

Internet data might also be non-representative of the popu-

lation as a whole, particularly with regard to age and com-

puter access and literacy, andmay be subject to bias when the

disease of interest is featured on the news (Lee, 2010). Because

of these issues, it is important to consider internet surveil-

lance data as a potential tool to be used in tandem with

traditional survey methods.

Despite these limitations, there are a number of possibil-

ities where these methods could support further research.

One simplification in ourmodels was that all pipe breaks were

treated equally, regardless of the size of the pipe, the occur-

rence and magnitude of pressure transients, duration of the

leak and repair, and number of customers affected. This was a

necessary simplification in our work because the pipe break

data we used lacked information on the size of pipe or popu-

lation affected. More explicit modeling of these factors could

be useful in developing more accurate models and under-

standing which types of breaks are most likely to result in

health impacts. Our evaluation could also be scaled up to

include additional cities and water systems where break in-

formation was available. This would not only provide more

statistical power to support evidence of a relationship be-

tween pipe breaks and public health outcomes, but could also

allow insights into the systems and conditions where pipe

breaks have the greatest impact on illness. In addition, if

internet search volume data were made available at a more

geographically detailed scale such as the zip code level, this

could substantially enhance the ability to examine the rela-

tionship between internet search volume and pipe breaks,

which are often located at the scale of individual street seg-

ments in utility-provided data.

5. Conclusion

Disruptions and inadequacies in drinking water distribution

systems are recognized as an issue with potential public

health impacts and an important area for research. However,

measuring the health impacts associated with distribution

system disturbances, such as pipe failures, presents a number

of challenges. The physical mechanisms by which a pipe

failure could lead to pathogen exposure e low or negative

pressure events, contaminant intrusion, and transport to

water userse are difficult tomonitor andmodel. Furthermore,

gastrointestinal health outcomes are likely to be sub-clinical,

meaning traditional monitoring of doctor and hospital visits

will only capture a small percentage of cases. We present a

novel method that compares weekly internet search volume

for symptoms of GI illness with pipe break counts, while

controlling for seasonal patterns, climatic fluctuations, and

other possible environmental factors. We observed a positive

relationship between search volume and counts of pipe

breaks from the previous week in both cities using multiple

models. These results support previous investigations indi-

cating that drinking water distribution system disruptions

contributed to higher rates of GI illness, and point towards the

potential importance of frequent, relatively-small pipe breaks.

Our results also indicate that internet search data is very

promising in that it can be easily scaled up to conduct long-

term or wide-scale evaluations that are infeasible using

traditional monitoring.

Acknowledgments

This work was partially funded by NSF grants 1031046 (CMMI)

and 1069213 (IGERT). This support is gratefully acknowledged.

We also acknowledge and thank the two utilities that pro-

vided the pipe break data used in this research. All opinions

are those of the authors and do not necessarily reflect the

positions of the NSF or the participating utilities.

r e f e r e n c e s

American Society of Civil Engineers (ASCE), 2013. 2013 ReportCard for America’s Infrastructure. Retrieved August/9, 2013,from. http://www.infrastructurereportcard.org/.

Besner, M., Prevost, M., Regli, S., 2011. Assessing the public healthrisk of microbial intrusion events in distribution systems:conceptual model, available data, and challenges. Water Res.45 (3), 961e979.

Breiman, L., 2001. Random forests. Mach. Learn. 45 (1), 5e32.Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984.

Classification and Regression Trees. Wadsworth & Brooks,Monterey, CA.

Brownstein, J.S., Freifeld, C.C., Madoff, L.C., 2009. Digital diseasedetectiondharnessing the web for public health surveillance.N. Engl. J. Med. 360 (21), 2153e2157.

Cameron, A.C., Trivedi, P., 1998. Regression Analysis of CountData. Cambridge University Press.

Centers for Disease Control and Prevention (CDC), 2006.Surveillance for Waterborne Disease and OutbreaksAssociated with Drinking Water and Water Not Intended forDrinking d United States, 2003e2004. In: SurveillanceSummaries, December 22, 2006, vol. 55. MMWR (No. SS-12).

CDC, 2008. Surveillance for Waterborne Disease andOutbreaks Associated with Drinking Water and Water NotIntended for Drinking d United States, 2005e2006. In:Surveillance Summaries, September 12, 2008, vol. 57.MMWR (No. SS-9).

CDC, 2011. Surveillance for Waterborne Disease and OutbreaksAssociated with Drinking Water and Water Not Intended for

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 434

Drinking d United States, 2007e2008. In: SurveillanceSummaries, September 23, 2011, vol. 60. MMWR (No. RR-12).

Desai, R., Lopman, B.A., Shimshoni, Y., Harris, J.P., Patel, M.M.,Parashar, U.D., 2012. Use of internet search data to monitorimpact of rotavirus vaccination in the united states. Clin.Infect. Dis. 54 (9), e115ee118.

Friedman, J.H., 1991. Multivariate adaptive regression splines.Ann. Stat., 1e67.

Friedman, M., Radder, L., Harrison, S., Howie, D., Britton, M.,Boyd, G., Wood, D., 2004. Verification and Control of PressureTransients and Intrusion in Distribution Systems. AWWAResearch Foundation and US Environmental ProtectionAgency.

Garthright, W.E., Archer, D.L., Kvenberg, J.E., 1988. Estimates ofincidence and costs of intestinal infectious diseases in theUnited States. Public Health Rep. 103 (2), 107.

Ginsberg, J., Mohebbi, M.H., Patel, R.S., Brammer, L.,Smolinski, M.S., Brilliant, L., 2008. Detecting influenzaepidemics using search engine query data. Nature 457 (7232),1012e1014.

Hastie, T., Tibshirani, R., 1990. Generalized Additive Models.Chapman, Hall, London.

Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements ofStatistical Learning: Data Mining, Inference and Prediction,second ed. Springer, New York.

Hunter, P.R., Chalmers, R.M., Hughes, S., Syed, Q., 2005. Self-reported diarrhea in a control group: a strong association withreporting of low-pressure events in tap water. Clin. Infect. Dis.40 (4), e32ee34.

Karim, M.R., Abbaszadegan, M., LeChevallier, M., 2003. Potentialfor pathogen intrusion during pressure transients. J.-Am.Water Works Assoc. 95 (5).

Lambertini, E., Borchardt, M.A., Kieke Jr., B.A., Spencer, S.K.,Loge, F.J., 2012. Risk of viral acute gastrointestinal illness fromnondisinfected drinking water distribution systems. Environ.Sci. Technol. 46 (17), 9299e9307.

LeChevallier, M., Gullick, R., Karim, M., Friedman, M., Funk, J.,2003. The potential for health risks from intrusion ofcontaminants into the distribution system from pressuretransients. J. Water Health 1, 3e14.

Lee, B.K., 2010. Epidemiologic research and web 2.0dthe user-driven web. Epidemiology 21 (6), 760e763.

Malm, A., Axelsson, G., Barregard, L., Ljungqvist, J., Forsberg, B.,Bergstedt, O., Pettersson, T.J., 2013. The association of drinkingwater treatment and distribution network disturbances with

health call centre contacts for gastrointestinal illnesssymptoms. Water Res. 47 (13).

Messner, M., Shaw, S., Regli, S., Rotert, K., Blank, V., Soller, J., 2006.An approach for developing a national estimate of waterbornedisease due to drinking water and a national estimate modelapplication. J. Water Health 4 (Suppl. 2), 201e240.

Nygard, K., Andersson, Y., Røttingen, J., Svensson, A., Lindback, J.,Kistemann, T., Giesecke, J., 2004. Association betweenenvironmental risk factors and campylobacter infections inSweden. Epidemiol. Infect. 132 (02), 317e325.

Nygard, K., Wahl, E., Krogh, T., Tveit, O.A., Bøhleng, E., Tverdal, A.,Aavitsland, P., 2007. Breaks and maintenance work in thewater distribution systems and gastrointestinal illness: acohort study. Int. J. Epidemiol. 36 (4), 873e880.

Payment, P., Siemiatycki, J., Richardson, L., Renaud, G., Franco, E.,Prevost, M., 1997. A prospective epidemiological study ofgastrointestinal health effects due to the consumption ofdrinking water. Int. J. Environ. Health Res. 7 (1), 5e31.

Payment, P., Richardson, L., Siemiatycki, J., Dewar, R.,Edwardes, M., Franco, E., 1991. A randomized trial to evaluatethe risk of gastrointestinal disease due to consumption ofdrinking water meeting current microbiological standards.Am. J. Public Health 81 (6), 703e708.

Pelat, C., Turbelin, C., Bar-Hen, A., Flahault, A., Valleron, A., 2009.More diseases tracked by using Google Trends. Emerg. Infect.Dis. 15 (8), 1327.

Polgreen, P.M., Chen, Y., Pennock, D.M., Nelson, F.D.,Weinstein, R.A., 2008. Using internet searches for influenzasurveillance. Clin. Infect. Dis. 47 (11), 1443e1448.

Tinker, S., Moe, C., Klein, M., Flanders, W., Uber, J.,Amirtharajah, A., Tolbert, P., 2009. Drinking water residencetime in distribution networks and emergency departmentvisits for gastrointestinal illness in metro Atlanta, Georgia. J.Water Health 7 (2), 332e343.

United States Environmental Protection Agency (USEPA) andWater Research Foundation, 2010. Final Priorities of theDistribution System Research and Information CollectionPartnership. April. http://www.epa.gov/safewater/disinfection/tcr/pdfs/tcrdsac/finpridsricp051010.pdf (accessed17.07.12.).

Wheeler, J.G., Sethi, D., Cowden, J.M., Wall, P.G., Rodrigues, L.C.,Tompkins, D.S., Roderick, P.J., 1999. Study of infectiousintestinal disease in England: rates in the community,presenting to general practice, and reported to nationalsurveillance. Br. Med. J. 318 (7190), 1046e1050.