public health and pipe breaks in water distribution systems: analysis with internet search volume as...
TRANSCRIPT
ww.sciencedirect.com
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 4
Available online at w
ScienceDirect
journal homepage: www.elsevier .com/locate /watres
Public health and pipe breaks in water distributionsystems: Analysis with internet search volume as aproxy
Julie E. Shortridge*, Seth D. Guikema
Department of Geography & Environmental Engineering, Johns Hopkins University, USA
a r t i c l e i n f o
Article history:
Received 19 September 2013
Received in revised form
1 December 2013
Accepted 9 January 2014
Available online 21 January 2014
Keywords:
Distribution network
Pipe breaks
Gastrointestinal illness
Non-linear regression
* Corresponding author. Tel.: þ1 2026796535.E-mail address: [email protected] (J.E. Sho
0043-1354/$ e see front matter ª 2014 Elsevhttp://dx.doi.org/10.1016/j.watres.2014.01.013
a b s t r a c t
Drinking water distribution infrastructure has been identified as a factor in waterborne
disease outbreaks and improved understanding of the public health risks associated with
distribution system failures has been identified as a priority area for research. Pipe breaks
may pose a risk, as their occurrence and repair can result in low or negative pressure,
potentially allowing contamination of drinking water from adjacent soils. However,
measuring this phenomenon is challenging because the most likely health impact is mild
gastrointestinal (GI) illness, which is unlikely to result in a doctor or hospital visit. Here we
present a novel method that uses data mining techniques and internet search volume to
assess the relationship between pipe breaks and symptoms of GI illness in two U.S. cities.
Weekly search volume for the terms diarrhea and vomiting was used as the response
variable with the number of pipe breaks in each city as a covariate as well as additional
covariates to control for seasonal patterns, search volume persistence, and other sources of
GI illness. The fit and predictive accuracy of multiple regression and data mining tech-
niques were compared, with the best performance obtained using random forest and
bagged regression tree models. Pipe breaks were found to be an important and positively
correlated predictor of internet search volume in multiple models in both cities, supporting
previous investigations that indicated an increased risk of GI illness from distribution
system disturbances.
ª 2014 Elsevier Ltd. All rights reserved.
1. Introduction
While drinking water in developed countries is consistently
treated to be compliant with health guidelines, the aging
condition of drinking water distribution system infrastructure
presents a risk of contaminant intrusion and negative impacts
on public health. Breaks and leaks in distribution pipelines
can allow pathogens present in surrounding soil or water to
rtridge).
ier Ltd. All rights reserve
enter the distribution system during low or negative pressure
events. It is estimated that anywhere from 10 to 50% of
waterborne disease outbreaks associated with treated drink-
ing water are attributable to distribution system deficiencies
(CDC, 2006; CDC, 2008; CDC, 2011). It is reasonable to assume
that the outbreaks analyzed in CDC (2006, 2008, 2011) repre-
sent only a small percentage of the overall disease attributable
to drinkingwater, as they require thatmultiple cases of illness
be reported and linked to drinkingwater exposure (CDC, 2011),
d.
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 4 27
and few people seek medical care for mild to moderate
gastrointestinal (GI) illness (Wheeler et al., 1999). Messner
et al. (2006) estimate that community water systems are
responsible for 16.4million cases of acute GI illness per year in
the United States. While most of these are mild to moderate
cases that do not require a doctor or hospital visit, they can
still result in high societal costs; for example, in 1988 it was
estimated that mild GI illness resulted in $19.5 billion in lost
productivity annually (Garthright et al., 1988). Because of
these issues, improved understanding of the incidence and
severity of health impacts from water distribution systems
has been identified as a high priority research area (USEPA and
Water Research Foundation, 2010).
Despite the need for additional research on the public
health impacts of distribution system deficiencies, a number
of issuesmake detection andmeasurement of these impacts a
challenge. Exposure to pathogens through water distribution
systems requires a complex chain of events to occur. While a
conceptual model to support microbial risk assessment for
low-pressure events is presented by Besner et al. (2011),
existing data to support such an analysis is limited and sub-
ject to numerous uncertainties and assumptions. While sus-
tained low-pressure events caused by main breaks and
maintenance activities are relatively easy to identify, short
term pressure transients caused by changes in pump opera-
tion, power outages, and sudden changes in demand are un-
likely to be identified without high-speed pressuremonitoring
that is generally not in use in existing systems (Friedman
et al., 2004). Population exposure depends on the quantity of
pathogens able to enter the distribution system, their trans-
port and dilution throughout the system, and the number of
users who eventually consume the contaminated water. In-
dividuals may be exposed to contaminated water at many
places other than their homes, such as offices, schools, or
restaurants, confounding efforts to monitor illness at fine
spatial scales. Furthermore, only a small percentage of GI
illness results in a doctor or hospital visit, making health
outcomes difficult to track.
Because of these challenges, much existing research on
this topic has either focused on the occurrence of pressure
transients and external contamination in water distribution
systems or survey-based monitoring and intervention trials
that aim to estimate the incidence of GI illness attributable to
treated drinking water. Sampling studies have indicated that
pathogenic microorganisms are frequently present in soil and
water adjacent to drinking water pipes (Karim et al., 2003),
while low- and negative-pressure transients have been
documented in multiple systems (LeChevallier et al., 2003;
Karim et al., 2003). Water sampling studies have indicated
that distribution systems can allow introduction of viruses
into non-disinfected systems (Lambertini et al., 2012). Evalu-
ations of whether low-pressure events lead to measurable
increases in GI illness have beenmixed. Rates of self-reported
GI illnesswere found to increase following distribution system
disruptions in a study conducted in Norway (Nygard et al.,
2007) and following self-reported losses in water pressure in
the UK (Hunter et al., 2005). However, Malm et al. (2013)
monitored calls to a health care hotline system in Sweden
and found no statistically significant change in call volume
related to GI illness following distribution system disruptions.
While not specifically focused on low pressure events,
decreased incidence of GI illness have also been observed in
water systems with less pipe length per person (Nygard et al.,
2004; Tinker et al., 2009) and amongst study participants who
drank water that had been bottled at a treatment plant rather
than untreated tapwater (Payment et al., 1997), indicating that
increased rates of illness could be a result of distribution
system deficiencies more generally.
The intervention trials and survey-based monitoring
evaluations above can provide important insights into health
risks associated with the studied distribution systems. How-
ever, extrapolating these insights to distribution systems
more generally is difficult due to the tremendous variability in
water system characteristics. The likelihood of contamination
in a given system is likely to depend heavily on factors such as
water source, treatment procedures, and distribution system
condition and characteristics. For example, the distribution
system evaluated by Payment et al. (1991, 1997) was found to
be highly susceptible to negative pressure events
(LeChevallier et al., 2003). Furthermore, the chance that a
contamination event results in observable illness depends on
the population served by the system, as certain demographic
groups, such as children, the elderly, and immunocompro-
mised individuals, are more likely to become sick after a given
exposure. Differences in system characteristics could partly
explain the higher rates of illness attributed to drinking water
from those studies, as well as differences in research design.
While conducting a similar monitoring effort on a larger scale
could provide valuable insights into how risks differ amongst
different water systems, survey-based monitoring and inter-
vention trials tend to be very resource-intensive and practical
only over relatively small scales. Therefore, new methods are
needed to support broader studies that can evaluate wide-
scale risks, as well as relative risks in different types of
systems.
The use of internet search query data has the potential to
prove useful in this regard. It is estimated that 37e52% of
Americans search for health information on the internet
(Brownstein et al., 2009). Internet search volume has already
been shown to be strongly correlated with traditional disease
monitoring data in a number of cases. Search volume for
influenza-related search terms is capable of providing early
detection of influenza epidemics (Ginsberg et al., 2008;
Polgreen et al., 2008). This ability has also been demon-
strated for a number of GI illnesses, with strong correlations
between search volume and confirmed infections of rotavirus
(Desai et al., 2012), salmonella (Brownstein et al., 2009), and
gastroenteritis (Pelat et al., 2009). These results show that
internet search volume has the potential to be an easily and
rapidly accessible source of information regarding disease
incidence over large areas and long-time scales where tradi-
tional monitoring may be infeasible. Surveillance data can
also easily be collected through time to support longitudinal
evaluations, and thus avoid the difficulties associated with
cross-sectional comparisons between or within water service
areas.
The objective of this paper is to assess whether a statistical
relationship exists between pipe breaks in municipal drinking
water distribution systems and GI illness at the metropolitan
scale as estimated by internet search volume. We use a novel
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 428
approach where weekly internet search volume for terms
related to GI illness is the response variable and we test the
ability of various parametric models and non-parametric
data-mining techniques to model the relationship between
search volume, pipe breaks and other environmental factors.
Results from two cities are compared to assess whether
observed relationships are consistent.
2. Materials and methods
Two cities, referred to in the following sections as City A and
City B, were used as study areas. Both cities are located in the
mid-Atlantic region of the United States and have metro area
populations ranging from 2 to 5 million. The water systems in
each city were established over 200 years ago, and like many
cities in the Eastern United States, many components of the
water systems are reaching or have surpassed their planned
lifespans. The two cities have temperate climates with
average summer highs approaching 90 �F and average winter
lows of approximately 30 �F.In each city, metro-area weekly internet search volume for
the term “diarrhea vomiting-dog” was obtained through the
Google Trends website and used as a response variable. This
term captures the volume of searches for the words diarrhea
or vomiting, while removing those that also contained the
word “dog,” as this was identified by Google as a common
related search. We obtained pipe break and leak data directly
from the two cities. For City A, we receive information directly
from their work order system; each time they investigate a
possible pipe problem or fix a pipe break, we receive a notifi-
cation directly from a senior engineer in the organization
responsible for the water system performance. This data
captures all pipe breaks, leaks, and repairs that the City A
system managers are aware of, meaning that we capture a
range of break sizes for City A, from small leaks to cata-
strophic failures of large pipes. For City B we received a
database of historic breaks from the 1970s through 2006
directly from the City’s water department. This database
included all known breaks and leaks for the most recent years
but was incomplete for the earlier years. As with City A, this
database contains information about all pipe breaks and pipe
leaks that the system managers were aware of. There is not a
Table 1 e Summary of response variable and covariate data. Ssearches in eachweek for the termof interest relative to the totasearches.
Variable Description
Response Variable
Search volume Weekly search volume for “diarrhea vomiting-dog”
interest
Covariates
Lagged search Search volume from previous week
Season Categorical variable on season
Temperature (avg) Average daily temperature for the week
Precipitation (avg) Average daily precipitation for the week
Pipe breaks Number of pipe breaks in each city per week
Pipe breaks lagged Number of pipe breaks from previous week
SSO Number of sewer overflow events in City A per we
SSO lagged Number of sewer overflow events from previous w
clear distinctionmade between a pipe break and a pipe leak in
either of the databases used. Our analysis thus includes in-
formation on both full breaks of pipes and small leaks. In re-
ality, there is a continuum of problems, from pipe being
completely severed down to small, slow leaks. Of course,
neither database contains information about undetected
leaks or breaks.
Due to availability of pipe break data, weekly data from
January 2011 to February 2013 (109 weeks of observations with
a total of 1970 pipe breaks) was used in City A, and from
January 2005 to March 2006 in City B (62 weeks of observations
with a total of 908 breaks). The City B data was constrained by
both the completeness of the record and the availability of
Google search volume data for the earlier dates.
The temporal resolution of available search volume data
depends on the volume of searches for the term of interest.
For our search term, the maximum temporal resolution was
weekly for both cities. The search data available from
Google does not present a total number of searches, but
instead measures the number of searches for the term of
interest relative to the total number of searches in that
week on a scale of 0e100. This is done to account for times
with higher or lower internet activity generally. Other
search volume terms, including diarrhea and vomiting as
separate searches, and searches controlling for the term
“pregnant” (a term that was commonly combined with
diarrhea and vomiting) were also evaluated. These terms
were found to be highly correlated with our final search
term and did not lead to significantly different results.
Covariates included:
- Season, average daily temperature, and average daily pre-
cipitation to control for seasonal and climatic variations in
GI incidence. Climate data was taken from NOAA National
Weather Service Monthly Weather Summaries.
- Counts of pipe breaks from the week of interest and the
prior week (lagged pipe breaks) to account for disease in-
cubation periods.
- Sewer overflow events (including both combined sewer
overflows and sanitary sewer overflows) from the week of
interest and the prior week to control for potential illness
from sewer overflows (City A only due to data availability),
obtained from the state environmental agency.
earch volume numbers refer to the relative number ofl number of searches in thatweek, not the actual number of
Min Mean Max Min Mean Max
City A City B
in metro area of 42 65.4 95 40 66.7 99
City A City B
42 65.3 95 40 66.7 99
NA NA NA NA NA NA
26.5 57.2 85.6 19.6 53.6 83.1
0 0.13 1.21 0 0.11 0.86
3 18.1 79 3 14.6 56
3 18.3 79 5 15.0 56
ek 3 14.2 47 NA NA NA
eek 1 14.1 47 NA NA NA
Table 2 e Mean absolute in-sample and out-of-sampleerrors for each model. Bold-italic values indicatestatistically significant improvement over null model(Bonferroni-corrected p-value < 0.05).
Modelname
City A City B
In-sample
Out-of-sample In-sample Out-of-sample
GLM 7.59 7.83 8.96 10.99
GAM 7.33 8.21 8.61 12.03
MARS 7.58 8.97 8.58 11.55
CART 6.28 9.64 7.14 11.64
Random
forest
3.81 7.87 4.51 10.25
Bagged
CART
5.18 7.90 6.14 10.28
Null
model
8.65 8.44 12.43 12.47
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 4 29
- Internet search volume from the prior week, to control for
potential persistence in internet search volume
A summary of internet search volume and covariate data
in each city is provided in Table 1.
The relationship between environmental exposure, dis-
ease incidence, and internet search volume is complex and
has the potential to exhibit non-linearity and interactions
between covariates. To find models that could capture these
complexities, we compared the fit and predictive accuracy of
multiple regression and data mining techniques. Each model
was fit to the full data set to evaluate goodness of fit based on
mean absolute error between actual and modeled search
volume. Additionally, holdout cross validation was used to
compare the models’ out-of sample predictive accuracy using
a 50-fold holdout analysis. In each iteration, 90% of observa-
tions were randomly selected and used to fit the models,
which were then used to predict search volume in the
remaining 10% of observations. Predictive accuracy was
measured using mean absolute error between actual and
predicted search volume in these held-out samples.
Six models were tested to compare their in-sample and
out-of-sample accuracy in each city. Because these models
use different functional forms and mathematical algorithms
to fit and predict data, comparing multiple models is a way to
identify which methods can best capture relationships that
exist between the response variable and covariates. The six
models included:
1. A Poisson-transformed generalized linear model (GLM)
with variable removal based on Akaike information crite-
rion (AIC minimization) (Cameron and Trivedi, 1998).
2. A Poisson-transformed generalized additive model (GAM)
with individual cubic regression splines applied over each
covariate. GAMs use smoothing functions applied over
covariates, allowing them to capture non-linear relation-
ships between covariates and the response variable.
Smoothing functions are fit using penalized likelihood
maximization to prevent overfitting of the model, and can
be penalized to zero for covariates that don’t improve
model fit (Hastie and Tibshirani, 1990).
3. Multivariate adaptive regression splines (MARS): Data is
represented using a non-linear, multivariate function
estimated by multivariate spline basis functions fit to
recursively partitioned segments of the data (Friedman,
1991).
4. Classification and Regression Tree (CART): A single
regression tree was fit to the data and then pruned to the
optimal size using cross validation (Breiman et al., 1984).
5. Bagged CART (BC): 50 regression trees are each trained on a
separate bootstrapped subset of the data, and the final
model prediction is the average of each individual tree
prediction (Hastie et al., 2009).
6. Random Forest (RF): 500 regression trees are each trained
on a separate bootstrapped subset of the data, and corre-
lation between trees is reduced through random splitting
of nodes (Breiman, 2001).
A null model was also included for comparative purposes,
in which search volume was simply estimated as equal to
mean search volume in all observations used to fit the model.
For example, for the entire dataset, the null model would
predict a search volume of 65.4 in City A and 66.7 in City B (the
mean search volume in each city). For the holdout analysis,
the null model predicts search volume by calculating the
mean search volume for the 90% of weeks selected to train the
models. Models were also evaluated against a persistence
model (where search volume was assumed to equal search
volume from the previousweek); however, because thismodel
resulted in higher errors than all other models, the null model
was used as a comparison to evaluate model performance.
3. Results
Table 2 presents the mean absolute error for each model in
each city. The random forest model resulted in the lowest in-
sample and out-of-sample errors in both cities, and was able
to achieve statistically significant reduction in error when
compared to the null model based on Bonferroni-corrected
Wilcox Rank Sum tests. The bagged CART model also resul-
ted in a statistically significant reduction in both in-sample
and out-of-sample error in both cities compared to the null
model. Fig. 1 shows time series of actual and predicted search
volume in both cities for the random forest and bagged CART
models. Predicted search volume for eachweekwas estimated
by fitting the models to the whole data set with the week in
question removed, and then generating a prediction for that
week. That is, each point in the time series is a holdout esti-
mate for that week. These time series indicate that themodels
are capable of capturing relatively slow trends in search vol-
ume, but are unable to capture some week-to-week vari-
ability, as well as some extreme values. While the effects of a
given pipe break would generally be realized within a 2e3 day
period, there can be a significant delay between occurrence of
a pipe break and detection and repair, particularly for smaller
pipe breaks. Because of this, the pipe break data available
would not be sufficient for modeling short-term trends and
health impacts on, for example, a daily basis. The ability of the
model to capture the longer-term, weekly trends suggests that
it is appropriate for gaining insights into the relationship
Fig. 1 e Actual and predicted search volume for the random forest and bagged CART models.
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 430
between pipe breaks and health impacts at these longer time
scales. However, the model is not appropriate for real-time
monitoring of pipe breaks.
Because they significantly outperformed the null model in
terms of fit and predictive accuracy in both cities, the random
forest and bagged CART models were used to evaluate co-
variate influence in each city. Because typical measures of
influence that are used in parametric models (such regression
coefficients and p-values) do not exist for tree-based models,
partial dependence plots were developed to assess covariate
importance and influence in each model.
Partial dependence plots measure the marginal influence
that changing each covariate of interest, while keeping all
other covariates equal, has on model predictions. A relatively
flat partial dependence plot indicates that the covariate of
interest has little influence on themodel’s predictions, while a
large change in response variable values indicates that the
covariate has a large degree of influence in model predictions.
This variationwasmeasured for eachmodel by estimating the
relative “swing” attributable to each covariate n, which con-
sisted of the range of partial dependence values associated
with the covariate of interest, divided by the total swing over
all n covariates in that model (Equation (1)). A relative swing of
0 would indicate that the model did not use that covariate in
its predictions at all, while a relative swing of 1 would indicate
that model relied entirely on one covariate.
Relative Swingn ¼ maxðPDnÞ �minðPDnÞPnSwingn
(1)
Table 3 shows the relative swing associated with each co-
variate compared to all other covariates, with a higher swing
associated with greater influence over model predictions. In
City A, the most influential covariate in both models is the
season (which is responsible for 29% of bagged CART vari-
ability and 18% of random forest variability), while the most
influential covariate in City B is lagged search volume
(responsible for 42% of bagged CART variability and 32% of
random forest variability). The influence of lagged pipe breaks
ranges from 4% in the City B bagged CARTmodel to 17% in the
City A random forest model. Covariate influence varies
somewhat between the two cities, although a large degree of
agreement in the variable importance estimated from each of
Table 3 e Relative swing in partial dependence plots forthe random forest, and bagged CARTmodels in each city.A greater value of relative swing indicates that the modelrelies more heavily on that covariate, while a low valueindicates that a covariate is not very important ingeneratingmodel predictions. A value of 0would indicatethat the covariate is not used in the model at all.
Bagged CART Random forest
City A
Season 0.29 Season 0.18
SSO 0.17 BreaksLag 0.17
BreaksLag 0.14 SSO 0.17
SearchLag 0.14 SearchLag 0.14
Temp 0.08 Temp 0.12
SSOlag 0.07 Prec 0.10
Breaks 0.07 Breaks 0.07
Prec 0.05 SSOlag 0.06
City B
SearchLag 0.42 SearchLag 0.32
Temp 0.24 Temp 0.20
Season 0.15 Season 0.19
Prec 0.13 BreakLag 0.11
BreakLag 0.04 Breaks 0.10
Breaks 0.02 Prec 0.08
Fig. 2 e Normalized partial dependence plots for City A.
Internal tick marks show 10% sample quantiles for each
covariate. Plots show the marginal influence of changing
the covariate of interest while all over covariates are held
constant.
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 4 31
the models is observed in City A (Kendall’s W equal to 0.94)
and a moderate degree of agreement is observed in City B
(Kendall’s W equal to 0.57). Seasonal variables (month and
temperature) and search lag show a high degree of influence
overall. Environmental indicators (SSO and lagged breaks)
account for a large portion of variance in City A, but appear
less influential in City B.
The partial dependence plots shown in Figs. 2 and 3 also
show some consistent patterns across models and cities. In all
instances, there is a clear positive relationship between search
volume and lagged search volume, indicating that theremay be
some persistence in search volume that lasts longer than the
oneweek periods evaluated in this study. In both cities, there is
a positive relationship between lagged pipe breaks and search
volume in both models. In City A, an increase from 0 to 20 pipe
breaks results in an increase in an approximately 6% increase
in search volume in the bagged CART model, and an 8% in-
crease in the random forest model. However, increasing the
number of pipe breaks beyond 30 has no additional impact on
search volume. In City B, an increase from 0 to 55 breaks results
in an approximately 6% increase in search volume in the
random forest model, but only a 2% increase in the bagged
CART model. For current week pipe breaks, a negative rela-
tionship is evident in City A, while no clear relationship is
observed in City B. This is as expected, assuming a required
incubation period between exposure and effect. Both cities also
exhibit evidence of a seasonal pattern, with higher search
volume in the summer relative to thewinter and inweekswith
high temperatures. In City A, a positive relationship also exists
between sewer overflows for the current week, but no such
relationship is observed for lagged sewer overflows.
4. Discussion
All models tested were able to provide an improved fit
compared to the null model, but only the bagged CART and
random forest models were able to provide statistically sig-
nificant improvements in both fit and predictive accuracy in
both cities. Time series plots of predicted and actual search
volume in the highest performing models indicate that the
models were capable of capturing trends at a monthly time
scale, but not shorter term variability and particularly
extreme values. This unexplained variance is not surprising
considering (1) the low number of covariates explored here
when compared to the numerous factors that could contribute
to illness rates and online activity related to disease symp-
toms and (2) the possible lags between a break or leak and
detection of that event. It could also be the result of minor
Fig. 3 e Normalized partial dependence plots for City B.
Internal tick marks show 10% sample quantiles for each
covariate.
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 432
discrepancies in the geographic regions covered by the
different data sets; in particular, the metro areas represented
in the Google search data include some suburbs not included
in the pipe break service areas. However, population density is
highest in the areas covered by the pipe break data, which is
also where many people commute for work during the day.
Therefore, the disparity in populations covered by the two
data sets is expected to be minimal, and in any case would
result in underreporting of correlation between pipe breaks
and search volume and decreased model performance, rather
than the opposite. While this unexplained variance would
make the models unsuitable for activities requiring refined
predictions (such as real-time monitoring), the statistically
significant improvements demonstrated by the bagged CART
and random forest models make them suitable for generating
insights on covariate importance and influence.
In terms of covariate influence, themodels in City B appear
to be largely informed by search volume persistence and
seasonal characteristics, with lagged search volume, season,
and temperature accounting for a total of 71e79% of the
variation in partial dependence values for these models. In
City A, the models were informed by a larger number of
covariates, with the top three covariates in each model
accounting for a total of 52e60% of partial dependence swing.
Partial dependence variability was largely determined by
season and environmental factors (lagged pipe breaks and
SSOs) in City A, with a moderate degree of variability attrib-
uted to lagged search volume. The direction of influence
showed consistent results between cities and models for a
lagged search volume, temperature, season, and lagged pipe
breaks. Counts of lagged pipe breaks show a consistently
positive relationship with search volume in each city and
evaluatedmodel.While this result cannot provide any proof of
a causal relationship between the two, the presence of a
similar relationship across two cities and multiple models is
consistent with previous studies that identified water distri-
bution system inadequacies as a contributor to GI illness.
The use of internet search volume as a proxy for sub-
clinical GI illness and counts of all pipe breaks in a water
system are both novel methods that present an interesting
comparison to the existing literature on this topic. One
important distinction between this work and that of Malm
et al. (2013) and Nygard et al. (2007) is the magnitude of the
events analyzed. Those analyses evaluated the results of large
disturbances that affected hundreds to thousands of people,
whereas our evaluation makes no distinction based on pipe
size or number of affected customers. It is reasonable to as-
sume that the majority of pipe breaks in our dataset, as in
most cities, are rather small and did not result in total pres-
sure loss. Therefore, our results point more towards a health
impact associated with small scale disturbances that occur
relatively frequently in water systems, rather than large dis-
ruptions that may affect many customers but only occur
rarely. Additionally, it means that our results may be more
comparable to the work of Malm et al. (2013), as their dataset
include breaks that resulted in varying degree of pressure loss,
rather than only focusing on disturbances that resulted in
total pressure loss as in Nygard et al. (2007). It is also worth
noting that Sweden experiences an estimated 5000 pipe re-
pairs per year, which includes pipe breaks and leakage repairs
(Malm et al., 2013), compared to 240,000 main breaks alone in
the US (ASCE, 2013). This results in a per-capita repair rate of
0.0005 in Sweden, compared to 0.0008 for breaks alone in the
US. While it is impossible to say whether this is the case for
the specific cities evaluated, it seems reasonable to assume
that systemsmore prone to pipe failurewould result in greater
health impacts.
Aside from Malm et al. (2013), the majority of studies on
drinking water distribution systems and GI illness have relied
on self-reporting during intervention trials or interviews with
researchers following system disruptions, as in Nygard et al.
(2007). One primary difference between these studies and
ours is the spatial and temporal scale of evaluation. While
Nygard et al. (2004) evaluates campylobacter infections across
Sweden in a cross sectional comparison between municipal-
ities, they do not evaluate any temporal changes in infection
rates. Similarly, intervention trials such as those presented by
Payment et al. (1997) and post-disturbance monitoring
(Nygard et al., 2007) compare health outcomes in a limited
number of households in a small geographic area. Scaling
these evaluations up to dowide-scale or long-termmonitoring
would be very resource intensive, whereas internet search
data to support a longitudinalmetropolitan-level evaluation is
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 4 33
freely available. Monitoring illness at the city-level also re-
duces the confounding issue of mobility, where participants
may be exposed to an illness in a place other than their resi-
dence. Therefore, internet search data has the potential to be
a tool in instances where traditional monitoring is infeasible,
allowing for evaluations to proceed over greater temporal and
spatial scales.
Nevertheless, there are some important limitations that
should be considered when using internet search data to es-
timate public health impacts. Naturally, it is unlikely that a
proxy measure such as internet search volume could provide
as accurate an estimate of illness rates as direct questioning.
Therefore, quantitative estimates of disease risk based on
internet search data would require additional analysis
relating search volume to monitoring data collected via
traditional means. Furthermore, direct questioning has the
advantage of allowing researchers to control for other expo-
sure factors that cannot be monitored by proxy (such as
contact with sensitive subpopulations or international travel).
Internet data might also be non-representative of the popu-
lation as a whole, particularly with regard to age and com-
puter access and literacy, andmay be subject to bias when the
disease of interest is featured on the news (Lee, 2010). Because
of these issues, it is important to consider internet surveil-
lance data as a potential tool to be used in tandem with
traditional survey methods.
Despite these limitations, there are a number of possibil-
ities where these methods could support further research.
One simplification in ourmodels was that all pipe breaks were
treated equally, regardless of the size of the pipe, the occur-
rence and magnitude of pressure transients, duration of the
leak and repair, and number of customers affected. This was a
necessary simplification in our work because the pipe break
data we used lacked information on the size of pipe or popu-
lation affected. More explicit modeling of these factors could
be useful in developing more accurate models and under-
standing which types of breaks are most likely to result in
health impacts. Our evaluation could also be scaled up to
include additional cities and water systems where break in-
formation was available. This would not only provide more
statistical power to support evidence of a relationship be-
tween pipe breaks and public health outcomes, but could also
allow insights into the systems and conditions where pipe
breaks have the greatest impact on illness. In addition, if
internet search volume data were made available at a more
geographically detailed scale such as the zip code level, this
could substantially enhance the ability to examine the rela-
tionship between internet search volume and pipe breaks,
which are often located at the scale of individual street seg-
ments in utility-provided data.
5. Conclusion
Disruptions and inadequacies in drinking water distribution
systems are recognized as an issue with potential public
health impacts and an important area for research. However,
measuring the health impacts associated with distribution
system disturbances, such as pipe failures, presents a number
of challenges. The physical mechanisms by which a pipe
failure could lead to pathogen exposure e low or negative
pressure events, contaminant intrusion, and transport to
water userse are difficult tomonitor andmodel. Furthermore,
gastrointestinal health outcomes are likely to be sub-clinical,
meaning traditional monitoring of doctor and hospital visits
will only capture a small percentage of cases. We present a
novel method that compares weekly internet search volume
for symptoms of GI illness with pipe break counts, while
controlling for seasonal patterns, climatic fluctuations, and
other possible environmental factors. We observed a positive
relationship between search volume and counts of pipe
breaks from the previous week in both cities using multiple
models. These results support previous investigations indi-
cating that drinking water distribution system disruptions
contributed to higher rates of GI illness, and point towards the
potential importance of frequent, relatively-small pipe breaks.
Our results also indicate that internet search data is very
promising in that it can be easily scaled up to conduct long-
term or wide-scale evaluations that are infeasible using
traditional monitoring.
Acknowledgments
This work was partially funded by NSF grants 1031046 (CMMI)
and 1069213 (IGERT). This support is gratefully acknowledged.
We also acknowledge and thank the two utilities that pro-
vided the pipe break data used in this research. All opinions
are those of the authors and do not necessarily reflect the
positions of the NSF or the participating utilities.
r e f e r e n c e s
American Society of Civil Engineers (ASCE), 2013. 2013 ReportCard for America’s Infrastructure. Retrieved August/9, 2013,from. http://www.infrastructurereportcard.org/.
Besner, M., Prevost, M., Regli, S., 2011. Assessing the public healthrisk of microbial intrusion events in distribution systems:conceptual model, available data, and challenges. Water Res.45 (3), 961e979.
Breiman, L., 2001. Random forests. Mach. Learn. 45 (1), 5e32.Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984.
Classification and Regression Trees. Wadsworth & Brooks,Monterey, CA.
Brownstein, J.S., Freifeld, C.C., Madoff, L.C., 2009. Digital diseasedetectiondharnessing the web for public health surveillance.N. Engl. J. Med. 360 (21), 2153e2157.
Cameron, A.C., Trivedi, P., 1998. Regression Analysis of CountData. Cambridge University Press.
Centers for Disease Control and Prevention (CDC), 2006.Surveillance for Waterborne Disease and OutbreaksAssociated with Drinking Water and Water Not Intended forDrinking d United States, 2003e2004. In: SurveillanceSummaries, December 22, 2006, vol. 55. MMWR (No. SS-12).
CDC, 2008. Surveillance for Waterborne Disease andOutbreaks Associated with Drinking Water and Water NotIntended for Drinking d United States, 2005e2006. In:Surveillance Summaries, September 12, 2008, vol. 57.MMWR (No. SS-9).
CDC, 2011. Surveillance for Waterborne Disease and OutbreaksAssociated with Drinking Water and Water Not Intended for
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 2 6e3 434
Drinking d United States, 2007e2008. In: SurveillanceSummaries, September 23, 2011, vol. 60. MMWR (No. RR-12).
Desai, R., Lopman, B.A., Shimshoni, Y., Harris, J.P., Patel, M.M.,Parashar, U.D., 2012. Use of internet search data to monitorimpact of rotavirus vaccination in the united states. Clin.Infect. Dis. 54 (9), e115ee118.
Friedman, J.H., 1991. Multivariate adaptive regression splines.Ann. Stat., 1e67.
Friedman, M., Radder, L., Harrison, S., Howie, D., Britton, M.,Boyd, G., Wood, D., 2004. Verification and Control of PressureTransients and Intrusion in Distribution Systems. AWWAResearch Foundation and US Environmental ProtectionAgency.
Garthright, W.E., Archer, D.L., Kvenberg, J.E., 1988. Estimates ofincidence and costs of intestinal infectious diseases in theUnited States. Public Health Rep. 103 (2), 107.
Ginsberg, J., Mohebbi, M.H., Patel, R.S., Brammer, L.,Smolinski, M.S., Brilliant, L., 2008. Detecting influenzaepidemics using search engine query data. Nature 457 (7232),1012e1014.
Hastie, T., Tibshirani, R., 1990. Generalized Additive Models.Chapman, Hall, London.
Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements ofStatistical Learning: Data Mining, Inference and Prediction,second ed. Springer, New York.
Hunter, P.R., Chalmers, R.M., Hughes, S., Syed, Q., 2005. Self-reported diarrhea in a control group: a strong association withreporting of low-pressure events in tap water. Clin. Infect. Dis.40 (4), e32ee34.
Karim, M.R., Abbaszadegan, M., LeChevallier, M., 2003. Potentialfor pathogen intrusion during pressure transients. J.-Am.Water Works Assoc. 95 (5).
Lambertini, E., Borchardt, M.A., Kieke Jr., B.A., Spencer, S.K.,Loge, F.J., 2012. Risk of viral acute gastrointestinal illness fromnondisinfected drinking water distribution systems. Environ.Sci. Technol. 46 (17), 9299e9307.
LeChevallier, M., Gullick, R., Karim, M., Friedman, M., Funk, J.,2003. The potential for health risks from intrusion ofcontaminants into the distribution system from pressuretransients. J. Water Health 1, 3e14.
Lee, B.K., 2010. Epidemiologic research and web 2.0dthe user-driven web. Epidemiology 21 (6), 760e763.
Malm, A., Axelsson, G., Barregard, L., Ljungqvist, J., Forsberg, B.,Bergstedt, O., Pettersson, T.J., 2013. The association of drinkingwater treatment and distribution network disturbances with
health call centre contacts for gastrointestinal illnesssymptoms. Water Res. 47 (13).
Messner, M., Shaw, S., Regli, S., Rotert, K., Blank, V., Soller, J., 2006.An approach for developing a national estimate of waterbornedisease due to drinking water and a national estimate modelapplication. J. Water Health 4 (Suppl. 2), 201e240.
Nygard, K., Andersson, Y., Røttingen, J., Svensson, A., Lindback, J.,Kistemann, T., Giesecke, J., 2004. Association betweenenvironmental risk factors and campylobacter infections inSweden. Epidemiol. Infect. 132 (02), 317e325.
Nygard, K., Wahl, E., Krogh, T., Tveit, O.A., Bøhleng, E., Tverdal, A.,Aavitsland, P., 2007. Breaks and maintenance work in thewater distribution systems and gastrointestinal illness: acohort study. Int. J. Epidemiol. 36 (4), 873e880.
Payment, P., Siemiatycki, J., Richardson, L., Renaud, G., Franco, E.,Prevost, M., 1997. A prospective epidemiological study ofgastrointestinal health effects due to the consumption ofdrinking water. Int. J. Environ. Health Res. 7 (1), 5e31.
Payment, P., Richardson, L., Siemiatycki, J., Dewar, R.,Edwardes, M., Franco, E., 1991. A randomized trial to evaluatethe risk of gastrointestinal disease due to consumption ofdrinking water meeting current microbiological standards.Am. J. Public Health 81 (6), 703e708.
Pelat, C., Turbelin, C., Bar-Hen, A., Flahault, A., Valleron, A., 2009.More diseases tracked by using Google Trends. Emerg. Infect.Dis. 15 (8), 1327.
Polgreen, P.M., Chen, Y., Pennock, D.M., Nelson, F.D.,Weinstein, R.A., 2008. Using internet searches for influenzasurveillance. Clin. Infect. Dis. 47 (11), 1443e1448.
Tinker, S., Moe, C., Klein, M., Flanders, W., Uber, J.,Amirtharajah, A., Tolbert, P., 2009. Drinking water residencetime in distribution networks and emergency departmentvisits for gastrointestinal illness in metro Atlanta, Georgia. J.Water Health 7 (2), 332e343.
United States Environmental Protection Agency (USEPA) andWater Research Foundation, 2010. Final Priorities of theDistribution System Research and Information CollectionPartnership. April. http://www.epa.gov/safewater/disinfection/tcr/pdfs/tcrdsac/finpridsricp051010.pdf (accessed17.07.12.).
Wheeler, J.G., Sethi, D., Cowden, J.M., Wall, P.G., Rodrigues, L.C.,Tompkins, D.S., Roderick, P.J., 1999. Study of infectiousintestinal disease in England: rates in the community,presenting to general practice, and reported to nationalsurveillance. Br. Med. J. 318 (7190), 1046e1050.