predictive modeling of cholera outbreaks in bangladesh

21
The Annals of Applied Statistics 2016, Vol. 10, No. 2, 575–595 DOI: 10.1214/16-AOAS908 © Institute of Mathematical Statistics, 2016 PREDICTIVE MODELING OF CHOLERA OUTBREAKS IN BANGLADESH BY AMANDA A. KOEPKE , 1 ,I RA M. LONGINI ,J R. , 1 , M. ELIZABETH HALLORAN , , 1 ,J ON WAKEFIELD , 2 AND VLADIMIR N. MININ , 3 Fred Hutchinson Cancer Research Center , University of Florida and University of Washington Despite seasonal cholera outbreaks in Bangladesh, little is known about the relationship between environmental conditions and cholera cases. We seek to develop a predictive model for cholera outbreaks in Bangladesh based on environmental predictors. To do this, we estimate the contribution of envi- ronmental variables, such as water depth and water temperature, to cholera outbreaks in the context of a disease transmission model. We implement a method which simultaneously accounts for disease dynamics and envi- ronmental variables in a Susceptible-Infected-Recovered-Susceptible (SIRS) model. The entire system is treated as a continuous-time hidden Markov model, where the hidden Markov states are the numbers of people who are susceptible, infected or recovered at each time point, and the observed states are the numbers of cholera cases reported. We use a Bayesian framework to fit this hidden SIRS model, implementing particle Markov chain Monte Carlo methods to sample from the posterior distribution of the environmental and transmission parameters given the observed data. We test this method using both simulation and data from Mathbaria, Bangladesh. Parameter estimates are used to make short-term predictions that capture the formation and decline of epidemic peaks. We demonstrate that our model can successfully predict an increase in the number of infected individuals in the population weeks be- fore the observed number of cholera cases increases, which could allow for early notification of an epidemic and timely allocation of resources. 1. Introduction. In Bangladesh cholera is an endemic disease that demon- strates seasonal outbreaks [Huq et al. (2005), Koelle and Pascual (2004), Koelle et al. (2005), Longini et al. (2002)]. The burden of cholera is high in that country, with an estimated 352,000 cases and 3500 to 7000 deaths annually [International Vaccine Institute (2012)]. We seek to understand the dynamics of cholera and to develop a model that will be able to predict outbreaks several weeks in advance. If the timing and size of a seasonal epidemic could be predicted reliably, vaccines and other resources could be allocated effectively to curb the impact of the disease. Received November 2014; revised December 2015. 1 Supported by NIH Grants R01-AI039129, U01-GM070749 and U54-GM111274. 2 Supported by NIH Grants R01-AI029168 and R01 CA095994-05A1. 3 Supported by NIH Grant R01-AI107034. Key words and phrases. Hidden Markov model, particle filter, MCMC, Bayesian, SIR. 575

Upload: others

Post on 10-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

The Annals of Applied Statistics2016, Vol. 10, No. 2, 575–595DOI: 10.1214/16-AOAS908© Institute of Mathematical Statistics, 2016

PREDICTIVE MODELING OF CHOLERA OUTBREAKSIN BANGLADESH

BY AMANDA A. KOEPKE∗,1, IRA M. LONGINI, JR.†,1,M. ELIZABETH HALLORAN∗,‡,1, JON WAKEFIELD‡,2

AND VLADIMIR N. MININ‡,3

Fred Hutchinson Cancer Research Center∗, University of Florida†

and University of Washington‡

Despite seasonal cholera outbreaks in Bangladesh, little is known aboutthe relationship between environmental conditions and cholera cases. Weseek to develop a predictive model for cholera outbreaks in Bangladesh basedon environmental predictors. To do this, we estimate the contribution of envi-ronmental variables, such as water depth and water temperature, to choleraoutbreaks in the context of a disease transmission model. We implementa method which simultaneously accounts for disease dynamics and envi-ronmental variables in a Susceptible-Infected-Recovered-Susceptible (SIRS)model. The entire system is treated as a continuous-time hidden Markovmodel, where the hidden Markov states are the numbers of people who aresusceptible, infected or recovered at each time point, and the observed statesare the numbers of cholera cases reported. We use a Bayesian framework tofit this hidden SIRS model, implementing particle Markov chain Monte Carlomethods to sample from the posterior distribution of the environmental andtransmission parameters given the observed data. We test this method usingboth simulation and data from Mathbaria, Bangladesh. Parameter estimatesare used to make short-term predictions that capture the formation and declineof epidemic peaks. We demonstrate that our model can successfully predictan increase in the number of infected individuals in the population weeks be-fore the observed number of cholera cases increases, which could allow forearly notification of an epidemic and timely allocation of resources.

1. Introduction. In Bangladesh cholera is an endemic disease that demon-strates seasonal outbreaks [Huq et al. (2005), Koelle and Pascual (2004), Koelleet al. (2005), Longini et al. (2002)]. The burden of cholera is high in that country,with an estimated 352,000 cases and 3500 to 7000 deaths annually [InternationalVaccine Institute (2012)]. We seek to understand the dynamics of cholera and todevelop a model that will be able to predict outbreaks several weeks in advance.If the timing and size of a seasonal epidemic could be predicted reliably, vaccinesand other resources could be allocated effectively to curb the impact of the disease.

Received November 2014; revised December 2015.1Supported by NIH Grants R01-AI039129, U01-GM070749 and U54-GM111274.2Supported by NIH Grants R01-AI029168 and R01 CA095994-05A1.3Supported by NIH Grant R01-AI107034.Key words and phrases. Hidden Markov model, particle filter, MCMC, Bayesian, SIR.

575

576 A. A. KOEPKE ET AL.

Specifically, we want to understand how the disease dynamics are related to en-vironmental covariates. It is currently not known what triggers the seasonal choleraoutbreaks in Bangladesh, but it has been shown that Vibrio cholerae, the causativebacterial agent of cholera, can be detected in the environment year round [Colwelland Huq (1994), Huq et al. (1990)]. Environmental forces are thought to contributeto the spread of cholera, evident from the many cholera disease dynamics mod-els that incorporate the role of the aquatic environment on cholera transmissionthrough an environmental reservoir effect [Codeço (2001), Tien and Earn (2010)].One hypothesis is that proliferation of V. cholerae in the environment triggers theseasonal epidemic, feedback from infected individuals drives the epidemic, andthen cholera outbreaks wane, either due to an exhaustion of the susceptibles or dueto the deteriorating ecological conditions for propagation of V. cholerae in the en-vironment. We probe this hypothesis using cholera incidence data and ecologicaldata collected from multiple thanas (administrative subdistricts with a police sta-tion) in rural Bangladesh over sixteen years. There have been three phases of datacollection so far, each lasting approximately three years and being separated bygaps of a few years; the current collection phase is ongoing. For a subset of thesedata, Huq et al. (2005) used Poisson regression to study the association betweenlagged predictors from a particular water body to cholera cases in that thana. Thisresulted in different lags and different significant covariates across multiple wa-ter bodies and thanas. Thus, it was hard to derive a cohesive model for predictingcholera outbreaks from the environmental covariates. Also, there is no easy wayto account for disease dynamics in this Poisson regression framework. We wantto measure the effect of the environmental covariates while accounting for diseasedynamics via mechanistic models of disease transmission. Moreover, we want tosee if we can make reliable short-term predictions with our model—a task that wasnot attempted by Huq et al. (2005).

Mechanistic infectious disease models use scientific understanding of the trans-mission process to develop dynamical systems that describe the evolution of theprocess [Bretó et al. (2009)]. Realistic models of disease transmission incorpo-rate nonlinear dynamics [He, Ionides and King (2010)], which leads to difficul-ties with statistical inference under these models, specifically in the tractability ofthe likelihood. Keeling and Ross (2008) demonstrate some of these difficulties;they use an exact stochastic continuous-time, discrete-state model which evolvesMarkov processes using the deterministic Kolmogorov forward equations to ex-press the probabilities of being in all possible states. However, that method onlyworks for small populations due to computational limitations. To overcome thisintractability, Finkenstädt and Grenfell (2000) develop a time-series Susceptible-Infected-Recovered (SIR) model which extends mechanistic models of disease dy-namics to larger populations. A similar development is the auto-Poisson model ofHeld, Höhle and Hofmann (2005). To facilitate tractability of the likelihood, bothof the above approaches make simplifying assumptions that are difficult to test.

PREDICTIVE MODELING OF CHOLERA OUTBREAKS IN BANGLADESH 577

Moreover, these discrete-time approaches work only for evenly spaced data or re-quire aggregating the data into evenly spaced intervals. Cauchemez and Ferguson(2008) develop a different, continuous-time, approach to analyze epidemiologicaltime-series data, but assume the transmission parameter and number of suscepti-bles remain relatively constant within an observation period. Our current under-standing of cholera disease dynamics leads us to think that this assumption is notappropriate for modeling endemic cholera with seasonal outbreaks.

To implement a mechanistic approach without these approximations, both max-imum likelihood and Bayesian methods can be used. Maximum likelihood-basedstatistical inference techniques use Monte Carlo methods to allow maximizationof the likelihood without explicitly evaluating it [Bhadra et al. (2011), Bretó et al.(2009), He, Ionides and King (2010), Ionides, Bretó and King (2006)]. Ionides,Bretó and King (2006) use this methodology to study how large-scale climate fluc-tuations influence cholera transmission in Bangladesh. Bhadra et al. (2011) use thisframework to study malaria transmission in India. They are able to incorporate arainfall covariate into their model and study how climate fluctuations influencedisease incidence when one controls for disease dynamics, such as waning im-munity. Under a Bayesian approach, approximate Bayesian computation (ABC)techniques exist which avoid computation of the likelihood [Rubin (1984)]. Toniet al. (2009) use ABC to estimate parameters of dynamical models, and McKinley,Cook and Deardon (2009) utilize ABC in the context of epidemic models. Alter-natively, particle Markov chain Monte Carlo (MCMC) methods have been devel-oped which require only an unbiased estimate of the likelihood [Andrieu, Doucetand Holenstein (2010)]. Rasmussen, Ratmann and Koelle (2011) use this particleMCMC methodology to simultaneously estimate the epidemiological parametersof a SIR model and past disease dynamics from time series data and gene ge-nealogies. Using Google flu trends data [Ginsberg et al. (2008)], Dukic, Lopesand Polson (2012) implement a particle filtering algorithm which sequentially es-timates the odds of a pandemic. Notably, Dukic, Lopes and Polson (2012) concen-trate on predicting influenza activity. Analyzing the same data, Fearnhead, Giagosand Sherlock (2014) also develop a predictive model for flu outbreaks using a lin-ear noise approximation [Ferm, Lötstedt and Hellander (2008), Komorowski et al.(2009), van Kampen (1992)]. Similarly, here we develop a model-based predictiveframework for seasonal cholera epidemics in Bangladesh.

In this paper we use a combination of sequential Monte Carlo and MCMCmethods. Specifically, we develop a hidden Susceptible-Infected-Recovered-Susceptible (SIRS) model for cholera transmission in Bangladesh, incorporatingenvironmental covariates. We use a particle MCMC method to sample from theposterior distribution of the environmental and transmission parameters given theobserved data, as described by Andrieu, Doucet and Holenstein (2010). Further, wepredict future behavior of the epidemic within our Bayesian framework. Choleratransmission dynamics in our model are described by a continuous-time, rather

578 A. A. KOEPKE ET AL.

than a discrete-time, Markov process to easily incorporate data with irregular ob-servation times. Also, the continuous-time framework allows for greater parame-ter interpretability and comparability to models based on deterministic differentialequations. We test our Bayesian inference procedure using simulated cholera data,generated from a model with a time-varying environmental covariate. We then an-alyze cholera data from Mathbaria, Bangladesh, similar to the data studied by Huqet al. (2005). Parameter estimates indicate that most of the transmission is comingfrom environmental sources. We test the ability of our model to make short-termpredictions during different time intervals in the data observation period and findthat the pattern of predictive distribution dynamics matches the pattern of changesin the reported number of cases. Moreover, we find that the predictive distributionof the hidden states, specifically the unobserved number of infected individuals,clearly pinpoints the beginning of an epidemic approximately two to three weeksin advance, making our methodology potentially useful during cholera surveillancein Bangladesh.

2. SIRS model with environmental predictors. We consider a compartmen-tal model of disease transmission [Keeling and Rohani (2008), May and Anderson(1991)] where the population is divided into three disease states, or compartments:susceptible, infected and recovered. We model a continuous process observed atdiscrete time points. The vector Xt = (St , It ,Rt ) contains the numbers of sus-ceptible, infected and recovered individuals at time t , and we consider a closedpopulation of size N such that N = St + It + Rt for all t . Individuals move be-tween the compartments with different rates; for cholera transmission we considerthe transition rates shown in Figure 1. In this framework, a susceptible individ-ual’s rate of infection is proportional to the number of infected people and thecovariates that serve as proxy for the amount of V. cholerae in the environment.Thus, the hazard rate of infection, also called the force of infection, is βIt + α(t)

for each time t , where β represents the infectious contact rate between infectedindividuals and susceptible individuals and α(t) represents the time-varying envi-ronmental force of infection. Possible mechanisms for infectious contact includedirect person-to-person transmission of cholera and consumption of water that hasbeen contaminated by infected individuals. If It = 0, as it might between seasonalcholera epidemics, the hazard rate of infection is just α(t), so all of the force ofinfection comes from the environment. Infected individuals recover from infectionat a rate γ , where 1/γ is the average length of the infectious period. Once theinfected individual has recovered from infection, they move to the recovered com-partment. Recovered individuals develop a temporary immunity to the disease afterinfection. They move from the recovered compartment to the susceptible compart-ment with rate μ, where 1/μ is the average length of immunity. Similar to Codeço(2001) and Koelle and Pascual (2004), birth and death are incorporated into thesystem indirectly through the waning of immunity; thus, instead of representing

PREDICTIVE MODELING OF CHOLERA OUTBREAKS IN BANGLADESH 579

FIG. 1. State transitions for Susceptible-Infected-Recovered-Susceptible (SIRS) model for cholera.S, I and R denote the numbers of susceptible, infected and recovered individuals. From the currentstate (S, I,R), the system can transition to one of three new states. These new states correspondto a susceptible becoming infected, an infected recovering from infection, or a recovered individuallosing immunity to infection and becoming susceptible. The parameter β is the infectious contactrate, α(t) is the time-varying environmental force of infection, γ is the recovery rate, and μ is therate at which immunity is lost.

natural loss of immunity only, μ also represents the loss of immunity through thedeath of recovered individuals and the birth of new susceptible individuals.

Under this model, Xt is an inhomogeneous Markov process [Taylor and Karlin(1998)] with infinitesimal rates

λ(S,I,R),(S′,I ′,R′)(t)(1)

=

⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

(βI + α(t)

)S, if S′ = S − 1, I ′ = I + 1, R′ = R,

γ I, if S′ = S, I ′ = I − 1, R′ = R + 1,

μR, if S′ = S + 1, I ′ = I , R′ = R − 1,

0, otherwise,

where X = (S, I,R) is the current state and X′ = (S′, I ′,R′) is a new state. Be-cause Rt = N −St −It , we keep track of only susceptible and infected individuals,St and It .

This type of compartmental model is similar to other cholera models in the lit-erature. The time-series SIRS model of Koelle and Pascual (2004) also includesthe effects of both intrinsic factors (disease dynamics) and extrinsic factors (envi-ronment) on transmission. King et al. (2008) examine both a regular SIRS modeland a two-path model to include asymptomatic infections, and use a time-varyingtransmission term that incorporates transmission via the environmental reservoirand direct person-to-person transmission, but does not allow for feedback frominfected individuals into the environmental reservoir. The SIWR model of Tienand Earn (2010) and Eisenberg, Robertson and Tien (2013) allows for infectionsfrom both a water compartment (W) and direct transmission, and considers thefeedback created by infected individuals contaminating the water. To allow for thepossibility of asymptomatic individuals, Longini et al. (2007) use a model with acompartment for asymptomatic infections; that model only considers direct trans-mission. Codeço (2001) uses an SIR model with no direct person-to-person trans-

580 A. A. KOEPKE ET AL.

mission; infected individuals excrete directly into the environment and susceptibleindividuals are infected from exposure to contaminated water. Our SIRS model isnot identical to any of the above models, but it borrows from them two importantfeatures: explicit modeling of disease transmission from either direct person-to-person transmission of cholera or consumption of water that has been contami-nated by infected individuals and a time-varying environmental force of infection.

3. Hidden SIRS model. While the underlying dynamics of the disease aredescribed by Xt , these states are not directly observed. The number yt of infectedindividuals observed at each time point t is only a random fraction of the num-ber of infected individuals. This fraction depends on both the number of infectedindividuals that are symptomatic and the fraction of symptomatic infected individ-uals that seek treatment and get reported (the reporting rate). Thus, yti , the numberof observed infections at time ti for observation i ∈ {0,1, . . . , n}, has a binomialdistribution with size Iti , the number of infected individuals at time ti , and suc-cess probability ρ, the probability of infected individuals seeking treatment, soyti |Xti = (Sti , Iti ,Rti ), ρ ∼ Binomial(Iti , ρ). Given Xti , yti is independent of theother observations and other hidden states.

We use a Bayesian framework to estimate the parameters of the hidden SIRSmodel, where the unobserved states Xt are governed by the infinitesimal rates inequation (1). The parameters that we want to estimate are β , γ , μ, ρ and thek +1 parameters that will be incorporated into α(t), the time-varying environmen-tal force of infection. We let C1(t), . . . ,Ck(t) denote the k time-varying environ-mental covariates, and we assume α(t) = exp(α0 + α1C1(t) + · · · + αkCk(t)).

We assume independent Poisson initial distributions for St0 and It0 , with meansφS and φI . The population size N is assumed to be known, and we check sensitiv-ity to this assumption. Parameters that are constrained to be greater than zero, suchas β , γ , μ, φS and φI , are transformed to the log scale. A logit transformationis used for the probability ρ. We assume independent normal prior distributionson all of the transformed parameters, incorporating biological information into thepriors where possible.

We are interested in the posterior distribution Pr(θ |y) ∝ Pr(y|θ)p(θ), where y =(yt0, . . . , ytn), θ = (log(β), log(γ ), log(μ), logit(ρ),α0, . . . , αk , log(φS), log(φI )),and

Pr(y|θ) = ∑X

(n∏

i=0

Pr(yti |Iti , ρ)

[Pr(Xt0 |φS,φI )

n∏i=1

p(Xti |Xti−1, θ)

]).

Here p(Xti |Xti−1, θ) for i = 1, . . . , n are the transition probabilities of thecontinuous-time Markov chain (CTMC). However, this likelihood is intractable;there is no practical method to compute the finite time transition probabilities ofthe SIRS CTMC because the size of the state space of Xt grows on the order ofN2. For the same reason, summing over X with the forward-backward algorithm[Baum et al. (1970)] is not feasible. To use Bayesian inference despite this like-

PREDICTIVE MODELING OF CHOLERA OUTBREAKS IN BANGLADESH 581

lihood intractability, we turn to a particle marginal Metropolis-Hastings (PMMH)algorithm.

4. Particle filter MCMC.

4.1. Overview. The PMMH algorithm, introduced by Beaumont (2003) andstudied in Andrieu and Roberts (2009) and Andrieu, Doucet and Holenstein(2010), constructs a Markov chain that targets the joint posterior distributionπ(θ ,X|y), where X is a set of auxiliary or hidden variables, and requires only anunbiased estimate of the likelihood. To construct this likelihood estimate, we usea sequential Monte Carlo (SMC) algorithm, also known as a bootstrap particle fil-ter [Doucet, de Freitas and Gordon (2001)]. Thus, the PMMH algorithm proceedsas follows: at each Metropolis-Hastings step [Hastings (1970), Metropolis et al.(1953)], a new θ∗ is proposed from the proposal distribution q(·|θ), an SMC algo-rithm is used to estimate the marginal likelihood of the data given the proposed setof parameters, p̂(y|θ∗), and to obtain a sample X∗

t0:n ∼ p̂(·|y, θ∗), and θ∗, X∗t0:n and

p̂(y|θ∗) are accepted with a Metropolis-Hastings acceptance ratio which uses theestimated likelihood. An SMC algorithm is used to generate and weight K particletrajectories corresponding to the hidden state processes using the proposed param-eter set θ∗ as follows. Let the superscript k ∈ {1, . . . ,K} denote the particle index,where K is the total number of particles, and the subscript ti ∈ {t0, . . . , tn} denotesthe time; thus, Xk

tidenotes the kth particle at time ti , and Xk

t0:i = (Xkt0, . . . ,Xk

ti). At

time ti = t0, we simulate initial states Xkt0

= (Skt0, I k

t0) for k = 1, . . . ,K from the ini-

tial density of the hidden Markov state process, specifically from Poisson distribu-tions with means φS and φI . We compute the k weights w(Xk

t0) := Pr(yt0 |Xk

t0, θ∗),

and set W(Xkt0) = w(Xk

t0)/

∑Kk′=1 w(Xk′

t0). For i = 1, . . . , n, we resample X̄k

ti−1

from Xkti−1

with weights W(Xkti−1

). We sample K particles Xkti

from p(·|X̄kti−1

). We

assign weights w(Xkti) := Pr(yti |Xk

ti, θ∗), compute normalized weights W(Xk

ti) =

w(Xkti)/

∑Kk′=1 w(Xk′

ti), and set Xk

t0:i = (X̄kt0:i−1

,Xkti).

The marginal likelihood is estimated by summing the weights of the SMC algo-rithm, since

p̂(yti |yt0:i−1, θ

∗) = 1

K

K∑k=1

w(Xk

ti

)

is an approximation to the likelihood p(yti |yt0:i−1, θ∗), and therefore an approxi-

mation to the total likelihood is

p̂(y|θ∗) = p̂

(yt0 |θ∗) n∏

i=1

p̂(yti |yt0:i−1, θ

∗).

A proposed X∗t0:n = (X∗

t0, . . . ,X∗

tn) trajectory is sampled from the K particle trajec-

tories based on the final particle weights of the SMC algorithm, and the proposed

582 A. A. KOEPKE ET AL.

θ∗ and X∗t0:n are accepted with probability min{1,

p̂(y|θ∗)p̂(y|θ)

p(θ∗)p(θ)

q{θ |θ∗}q{θ∗|θ} }, where p(θ)

is the prior distribution for θ . To sample Xti from p(·|X̄ti−1), we simulate froma cholera transmission model with a time-varying environmental force of infec-tion. CTMCs which incorporate time-varying transition rates are inhomogeneous.The details of the discretely-observed inhomogeneous CTMC simulations are nowdescribed.

4.2. Simulating inhomogeneous SIRS using tau-leaping. Gillespie developedtwo methods for exact stochastic simulation of trajectories with constant rates: thedirect method [Gillespie (1977)] and the first reaction method [Gillespie (1976)].Details of these methods are given in Appendix A in the supplementary material[Koepke et al. (2016)]. The exact algorithms work for small populations, but forlarge state spaces these methods require a prohibitively long computing time. Thisis a common problem in the chemical kinetics literature, where an approximatemethod called the tau-leaping algorithm originated [Cao, Gillespie and Petzold(2005), Gillespie (2001)]. This method simulates CTMCs by jumping over a smallamount of time τ and approximating the number of events that happen in this timeusing a series of Poisson distributions. As τ approaches zero, this approximationtheoretically approaches the exact algorithm. The value of τ must be chosen suchthat the rates remain roughly constant over the period of time; this is referred to asthe “leap condition.”

Specifically, for our simulation, using the methods outlined in Cao, Gille-spie and Petzold (2005), we define the rate functions h1(Xt ) = (βIt + α(t))St ,h2(Xt ) = γ It , and h3(Xt ) = μRt , corresponding to the infinitesimal rates of theCTMC. Then k1 ∼ Poisson(h1(Xt )τ ) represents the number of infections in time[t, t + τ), k2 ∼ Poisson(h2(Xt )τ ) represents the number of recoveries in time[t, t + τ), and k3 ∼ Poisson(h3(Xt )τ ) represents the number of people that be-come susceptible to infection in time [t, t + τ). We make the assumption that thetime-varying force of infection, α(t), remains constant each day. We define dailytime intervals Ai := [i, i + 1) for i ∈ {t0, t0 + 1, . . . , tn − 1}, and α(t) = αAi

fort ∈ Ai . Using τ ≤ 1 day, our rates now remain constant within each tau jump. SeeAppendix A for details regarding the selection of τ .

4.3. Metropolis-Hastings proposal for model parameters. Our implementa-tion of the PMMH algorithm consists of a pilot run with a burn-in and a post-burn-in period, and a final run. Both the pilot burn-in and the post-burn-in periods useindependent normal random walk proposal distributions for the parameters. Thepilot burn-in period is thrown away; from the pilot run post-burn-in period, wecalculate the approximate posterior covariance of the parameters and scale it bya factor to construct the covariance of the multivariate normal random walk pro-posal distribution in the final run of the PMMH algorithm. In all runs, parametersare proposed and updated jointly. Proposal covariance matrices are scaled suchthat acceptance rates for all final runs were between 15% and 20%.

PREDICTIVE MODELING OF CHOLERA OUTBREAKS IN BANGLADESH 583

4.4. Prediction. One of the main goals of this analysis is to be able to pre-dict cholera outbreaks in advance using environmental predictors. To assess thepredictive ability of our model, we estimate the parameters of the model using atraining set of data and then predict future behavior of the epidemic process. Weexamine the posterior predictive distributions of cholera counts by simulating dataforward in time under the time-varying SIRS model using the accepted parame-ter values explored by the particle MCMC algorithm and the accepted values ofthe hidden states ST and IT at the final observation time, t = T , of the trainingdata. Under each set of parameters, we generate possible future hidden states andobserved data, and we compare the posterior predictive distribution of observedcholera cases to the test data. In the analyses below, the PMMH output is alwaysthinned to 1000 iterations for prediction purposes by saving only every kth itera-tion, where k depends on the total number of iterations.

5. Simulation results. To test the PMMH algorithm on simulated infectiousdisease data, we generate data from a hidden SIRS model with a time-varying envi-ronmental force of infection. We then use our Bayesian framework to estimate theparameters of the simulated model and compare the posterior distributions of theparameters with the true values. To simulate endemic cholera where many peoplehave been previously infected, we start with a population size of N = 10,000 andassume independent Poisson initial distributions for St0 and It0 , with means φS =2900 and φI = 84. The other parameters are set at β = 5 × 10−5, γ = 0.12, andμ = 0.0018. All rates are measured in the number of events per day. The averagelength of the infectious period, 1/γ , is set to be 8 days, and the average length ofimmunity, 1/μ, is set to be about 1.5 years. Parameter values are chosen such thatthe simulated data are similar to the data collected from Mathbaria, Bangladesh.We use the daily time intervals Ai := [i, i + 1) for i ∈ {t0, t0 + 1, . . . , tn − 1}, asin Section 4.2, and define α(t) = αAi

for t ∈ Ai where αAi= exp[α0 + α1C(i)].

Here C(i) = a(z) sin(2πi/365) for 365(z − 1) < i ≤ 365z and z ∈ (1,2,3,4,5),where a(1) = 2.1, a(2) = 1.8, a(3) = 2, a(4) = 2.2, and a(5) = 2. The inter-cept α0 and the amplitude α1 are parameters to be estimated. The frequency ofthe sine function is set to mimic the annual peak seen in the environmental datacollected from Bangladesh. For the simulations we set α0 = −6 and α1 = 2. Us-ing the modified Gillespie algorithm described in Appendix A, we simulate the(St , It ) chain given in the left plot of Figure 2. The observed number of infectionsyt ∼ Binomial(It , ρ), where ρ = 0.009 and is treated as an unknown parameter.

We simulate four years of training data. There is not enough information in thedata to estimate the means of the Poisson initial distributions, φS and φI , since esti-mation of these parameters is only informed by the very beginning of the observeddata. We set these parameters to different values and compare parameter estima-tion and prediction between models with parameter assumptions which differ fromthe truth. We also assume that we know the population size, N = 10,000.

584 A. A. KOEPKE ET AL.

FIG. 2. Plots of simulated hidden states (counts of susceptible, St , and infected, It , individuals)and the observed data [number of observed infections = yt ∼ Binomial(It , ρ)] plotted over time, t .Simulation with seasonally varying α(t) generates data with seasonal epidemic peaks. The dashedvertical black line represents the first cutoff between the training sets and the test data. Data beforethe line are used to estimate parameters, and we use those estimates to predict the data after the line.Other data cutoffs are shown in Figure 3.

We assume independent normal prior distributions for the elements of θ , withmeans and standard deviations chosen such that the mass of each prior distribu-tion is not centered at the true value of the parameter in this simulation setting.We use relatively uninformative, diffuse priors for log(β), α0 and α1, centered atlog(1.25 × 10−4), −8 and 0, respectively, and with standard deviations of 5. Theprior distribution for logit(ρ) is centered at logit(0.03) and has a standard deviationof 2. For log(γ ), the prior is centered at log(0.1) with a relatively small standarddeviation of 0.09, since this value is well studied for cholera. The prior for log(μ)

is centered at log(0.0009) with a standard deviation of 0.3. Thus, a priori 1/γ fallsbetween 8.4 to 11.9 days with probability 0.95, and 1/μ falls between 1.7 and 5.5years.

Using these data, the PMMH algorithm starts with a pilot run with a burn-inperiod of 30,000 iterations, a pilot post-burn-in period of 20,000 iterations and afinal run of 400,000 iterations. To thin the chains, we save only every 10th itera-tion. We use K = 100 particles in the SMC algorithm. We compare results frommodels with different assumptions on the values of φS and φI : assumed φS/N

and φI /N are above the true values (0.39 and 0.0168), at the true values (0.29and 0.0084), below the true values (0.19 and 0.0042), or further below the truevalues (0.095 and 0.0021). Marginal posterior distributions for the parameters ofthe SIRS model from the final runs of these PMMH algorithms are in Appendix B.The posterior distributions are similar, regardless of assumed values for φS and φI .Trace plots, auto-correlation plots, bivariate scatterplots and effective sample sizesfor the posterior samples under the situation in which the true values of φS and φI

were assumed are also given in Appendix B. We report β × N and ρ × N , sincein sensitivity analyses we found these to be robust to assumptions about the totalpopulation size N . From the posterior distributions, it is clear that the algorithm

PREDICTIVE MODELING OF CHOLERA OUTBREAKS IN BANGLADESH 585

FIG. 3. Summary of prediction results for simulated data. We run PMMH algorithms on trainingsets of the data, which are cut off at each of the dashed black lines in the bottom plot. Cutoff timesoccur at days 1456, 1484, 1512, 1540, 1568 and 1596. Future cases are then predicted until thenext cutoff. The top plot compares the posterior probability of the predicted counts to the test data(diamonds connected by straight purple lines). The coloring of the bars is determined by the fre-quency of each set of counts in the predicted data for each time point. The bottom plot shows howthe trajectory of the predicted hidden states changes over the course of the epidemic. The gray areaand the dot-and-dash line denote the 95% quantiles and median, respectively, of the predictive dis-tribution for the fraction of susceptibles. The short dashed lines and the long dashed line denote the95% quantiles and median, respectively, of the predictive distribution for the fraction of infected in-dividuals. The solid blue and red lines denote the true simulated fraction of susceptible and infectedindividuals.

is providing good estimates of the true parameter values, though estimates of theparameters μ and ρ × N are slightly different than the truth when φS and φI arenot set at the true values.

5.1. Prediction results. To test the predictive ability of the model, we use mul-tiple cutoff times to separate our simulated data into staggered training sets and testsets. The simulated observed data are shown in the right plot of Figure 2. For eachcutoff time, parameters were drawn from the posterior distribution based on the

586 A. A. KOEPKE ET AL.

training data. These parameter values were then used to simulate possible real-izations of reported infections after the training data until the next cutoff, 28 dayslater. The distributions of these predicted reported cases are shown in the top plot ofFigure 3. The test data are denoted by the purple diamonds, connected by straightlines to help visualize ups and downs in the case counts. Case counts are observedonce every 14 days. On each observation day, the colored bar represents the distri-bution of predicted counts for that day. The distributions of predicted counts on thecutoff days come from the accepted values of the hidden states ST and IT at thefinal observation time, t = T , of the training data. As desired, the posterior predic-tive distribution shifts its mass as time progresses to follow the case counts in thetest data. The plot of the predicted hidden states in the bottom row of Figure 3 alsoshows that our model is capturing the formation and decline of the epidemic peakwell, as seen in the trajectory of the predicted fraction of infected individuals. Thisplot shows that the predictive distributions are capturing the true simulated fractionof susceptible and infected individuals and illustrates the interplay of the hiddenstates of the underlying compartmental model. During an epidemic, the fraction ofsusceptibles decreases while the fraction of infected individuals quickly increases.Afterward, the fraction of infected individuals drops and the pool of susceptiblesslowly begins to increase as both immunity is lost and more susceptible individualsare born.

These predictions were made under the assumption that φS and φI are set tothe true values. To test sensitivity to these assumptions, we compare predictionsmade from models that assume other values; these are shown in Appendix D. Pre-dicted distributions are similar for all values of φS and φI . In addition, we studythe effects of misspecification of the data-generating process on estimation andprediction. We simulate data from a more biologically realistic model for choleratransmission, and fit the parameters of the SIRS model to this simulated data. De-spite the model misspecification, we find that we can still predict outbreaks well.See Appendix G for details.

6. Using cholera incidence data and covariates from Mathbaria, Bangla-desh. Huq et al. (2005) found that water temperature (WT) and water depth (WD)in some water bodies had a significant lagged relationship with cholera incidence.Therefore, we use these covariates and cholera incidence data from Mathbaria,Bangladesh, collected between April 2004 to September 2007 and again from Oc-tober 2010 to July 2013. Between April 2004 and September 2007, physiciansmade bimonthly visits to the thana health complex in Mathbaria and counted thenumber of cholera cases that were observed on each day during a three day period.Environmental data were also collected approximately every two weeks from mul-tiple water bodies in the area. Water bodies include rivers, lakes and ponds. FromOctober 2010 to July 2013, physicians made three-day visits weekly during peri-ods in which seasonal outbreaks were expected and made monthly three-day visitswhen few cases were expected. Environmental data were collected from multiple

PREDICTIVE MODELING OF CHOLERA OUTBREAKS IN BANGLADESH 587

FIG. 4. Barplot of cholera case counts in Mathbaria, Bangladesh, and the standardized covariatemeasurements over time. The covariates are shown with a lag of three weeks. No data were collectedfrom October 2007 through November 2010. The ranges of the unstandardized smoothed covariatesare 1.4 to 2.8 meters for water depth and 21.6 to 33.1◦C for water temperature.

water bodies on the same schedule, approximately once a week during seasonaloutbreaks and monthly between outbreaks. Further details of the clinical and envi-ronmental surveillance are given by Sack et al. (2003) and Huq et al. (2005). Weuse data from five to six water bodies in each phase of data collection. To get asmooth summary of the covariates using data from the water bodies, we fit a cubicspline to the covariate values. We then slightly modify our environmental force ofinfection to allow for a lagged covariate effect. Let κ denote the length of the lag.We consider the daily time intervals Ai := [i, i + 1) for i ∈ {t0, t0 + 1, . . . , tn − 1}and define the environmental force of infection α(t) = αAi

for t ∈ Ai and t ≥ κ ,where αAi

= exp[α0 + α1CWD(i − κ) + α2CWT(i − κ)]. Here the covariatesare the smoothed standardized daily values CWT(i) = (WT(i) − WT)/sWT andCWD(i) = (WD(i) − WD)/sWD, where X is the mean of the measurements for alli and sX is the sample standard deviation. We consider and compare results frommodels assuming three different lags: κ = 14, κ = 18, and κ = 21. Predictionsfrom all three models are similar, so we report only results from the model assum-ing κ = 21 in order to receive the earliest warning of upcoming epidemics; seeAppendix C for details and prediction comparisons. The smoothed, standardized,21-day lagged covariates and cholera incidence data are shown in Figure 4.

The population size N , which quantifies the size catchment area for the medicalcenter, is assumed to be 10,000 for computational convenience. We do not knowthe true value of N , but 10,000 is a reasonable estimate and is small enough thatsimulations run quickly. We studied sensitivity to this assumption by setting N todifferent values, obtaining similar results. We also again set φS and φI to variousvalues and the results were insensitive. See Appendix D for details.

588 A. A. KOEPKE ET AL.

In these analyses, we use relatively uninformative, diffuse normal prior distri-butions on the time-varying environmental covariates α1 and α2, centered at 0 andwith standard deviations of 5. The diffuse normal prior distributions on the trans-formed parameter values log(β) and α0 are centered at log(1.25 × 10−7) and −8,respectively, with standard deviations of 5. We know that the average infectiousperiod for cholera, 1/γ , should be between 8 and 12 days. Thus, the transformedparameter log(γ ) is given a normal prior distribution with mean log(0.1) and stan-dard deviation 0.09 to give 0.95 prior probability of 1/γ falling within the inter-val (8,12). In addition, we know that a reasonable length of immunity under thismodel should be between one and six years. We use a normal prior distributionfor log(μ) centered at log(0.0009) with a standard deviation of 0.3, giving a 0.95prior probability of 1/μ falling between 1.7 and 5.5 years. We also know that ρ

should be very close to zero, since only a small proportion of cholera infectionsare symptomatic and a smaller proportion will be treated at the health complex[Sack et al. (2003)]. Thus, the transformed parameter logit(ρ) is given a normalprior distribution with mean logit(0.0008) and standard deviation equal to 2, togive 0.95 prior probability of ρ falling within the interval (1.6 × 10−5,0.04).

We run the PMMH algorithm with a pilot run with a burn-in period of 30,000 it-erations, a pilot post-burn-in period of 20,000 iterations and a final run of 400,000iterations. We again save only every 10th iteration and use K = 100 particles inthe SMC algorithm. We check sensitivity to the number of particles, using 1000particles instead of 100 and obtaining similar results; see Appendix D for details.Posterior medians and 95% Bayesian credible intervals for the parameters β × N ,γ , μ, α0, α1, α2 and ρ × N generated by the final run of the PMMH algorithmare given in Table 1. We report β × N and ρ × N since we found these param-eter estimates to be robust to changes in the population size N during sensitivityanalyses. For more details, see Appendix D. The credible intervals for α1 and α2do not include zero, so both water depth and water temperature have a significant

TABLE 1Posterior medians and 95% equitailed credible intervals (CIs) for

the parameters of the SIRS model estimated using clinical andenvironmental data sampled from Mathbaria, Bangladesh

Coefficient Estimate 95% CIs

β × N 0.47 (0.33,0.65)

γ 0.12 (0.1,0.14)

μ 0.002 (0.001,0.002)

(β × N)/γ 3.92 (2.84,5.19)

α0 −6.49 (−7.39,−5.41)

α1 −1.94 (−2.49,−1.37)

α2 2.35 (1.85,2.98)

ρ × N 37.1 (27.2,50.2)

PREDICTIVE MODELING OF CHOLERA OUTBREAKS IN BANGLADESH 589

relationship with the force of infection. Decreasing water depth increases the forceof infection, likely due to the higher concentration and resulting proliferation of V.cholerae in the environment; increasing water temperature increases the force ofinfection [Huq et al. (2005)].

The basic reproductive number, R0, is the average number of secondary casescaused by a typical infected individual in a completely susceptible population[Diekmann, Heesterbeek and Metz (1990)]. In Table 1, we report (β × N)/γ , thepart of the reproductive number that is related to the number of infected individu-als in the population under our model assumptions. Our estimate of 3.92 is fairlylarge; it is very similar to the reproductive number of 5 (sd = 3.3) estimated byLongini et al. (2007) using data from Matlab, Bangladesh. However, posterior me-dian values for α(t) range from 0.000003 to 0.27, while posterior median valuesfor βIt only range from 0 to 0.04, suggesting that the epidemic peaks in our modelare driven mostly by the environmental force of infection. During epidemic peaks,when It is largest, the posterior median for α(t) is larger than the posterior medianfor βIt . See Appendix F for more details. However, the infectious contact rate isnot zero and is not negligible compared to the environmental force of infection.

6.1. Prediction results. For the data collected from Mathbaria, we begin pre-diction at multiple points around the time of the two epidemic peaks that occur in2012 and 2013. Figure 4 shows the full cholera data with smoothed and standard-ized covariates. Figure 5 shows the posterior predictive distribution of observedcholera cases (top row) and hidden states from the time-varying SIRS model (bot-tom row). Parameters used to simulate the SIRS forward in time have been sampledusing the PMMH algorithm applied to the training data, with data being cut off atdifferent points during the 2012 and 2013 epidemic peaks. From each of thesecutoffs, parameter values are then used to simulate possible realizations of the testdata. Predictions are run until the next cutoff point, with cutoff points chosen basedon the length of the lag κ . Realistically, at time t we have covariate informationto use for prediction only until time t + κ , where κ is the covariate lag. Sincethe smallest lag considered is 14 days, we make only 14-day ahead predictionswhere possible to mimic a realistic prediction set up. Due to the sparse samplingbetween epidemic peaks (June 2012 to February 2013), we use longer predictionintervals for these cutoffs than would be possible in real time data analysis in orderto evaluate our model predictions.

In the top row of Figure 5, the coloring of the bars again represents the distribu-tion of predicted cases. Between the two peaks of case counts (June 2012 to Febru-ary 2013), the frequency of predicted zero counts is very high, so we conclude thatthe model is doing well with respect to predicting the lack of an epidemic. Dur-ing the epidemics, the distribution of the counts shifts its mass away from zero.The plot in the bottom row of Figure 5 again illustrates the periodic nature andinterplay of the hidden states of the underlying compartmental model. When thefraction of infected individuals quickly increases during an epidemic, the fraction

590 A. A. KOEPKE ET AL.

FIG. 5. Summary of prediction results for the second to last and last epidemic peaks in theBangladesh data. We again run PMMH algorithms on training sets of the data, which are cut offat each of the dashed black lines in the bottom plot, and future cases are predicted until the nextcutoff. The top plot compares the posterior probability of the predicted counts to the test data (pur-ple diamonds and line), and the bottom plot shows how the trajectory of the predicted hidden stateschanges over the course of the epidemic. See the caption of Figure 3 for more details. The black hor-izontal bar at the top of the bottom plot marks where our model predicts an increase in the fractionof infected individuals, warning of the upcoming epidemic. This increase is predicted weeks beforean increase can be seen in the case counts.

of susceptibles decreases. Afterward, the fraction of infected individuals drops toalmost zero and the pool of susceptibles is slowly replenished. When the fractionof infected individuals is low, there is more uncertainty in the prediction for thefraction of susceptibles (September 2012 to March 2013). The fraction of infectedindividuals increases to a slightly higher epidemic peak in 2013 (March 2013 toMay 2013) than in 2012 (March 2012 to May 2012), as observed in the test data forthose years. The predicted fraction of infected people in the population increasesbefore an increase can be seen in the case counts, which could allow for earlywarning of an epidemic.

PREDICTIVE MODELING OF CHOLERA OUTBREAKS IN BANGLADESH 591

We also use a quasi-Poisson regression model similar to the one used by Huqet al. (2005) to predict the mean number of cholera cases (Appendix E). Althoughthe quasi-Poisson model predicts reasonably well the timing of epidemic peaks, itappears to overestimate the duration of the outbreaks. The predicted means underboth the quasi-Poisson and SIRS models most likely underestimate the true meanof the observed counts, with the quasi-Poisson model performing slightly better.However, the SIRS predicted fraction of infected individuals—a hidden variablein the SIRS model—provides a more detailed picture of how cholera affects apopulation. By providing not only accurate prediction of the time of epidemicpeaks but also the predicted fraction of the population that is infected, the SIRSmodel predictions could be used for efficient resource allocation to treat infectedindividuals. See Appendix E for additional details.

7. Discussion. We use a Bayesian framework to fit a nonlinear dynamicmodel for cholera transmission in Bangladesh which incorporates environmentalcovariate effects. We demonstrate these techniques on simulated data from a hid-den SIRS model with a time-varying environmental force of infection, and the re-sults show that we are recovering well the true parameter values. We also estimatethe effect of two environmental covariates on cholera case counts in Mathbaria,Bangladesh, while accounting for infectious disease dynamics, and we test the pre-dictive ability of our model. Overall, the prediction results look promising. Basedon data collected, the predicted hidden states show a noticeable increase in thefraction of infected individuals weeks before the observed number of cholera casesincreases, which could allow for early notification of an epidemic and timely allo-cation of resources. The predicted hidden states show that the fraction of infectedindividuals in the population decreases greatly between epidemics, supporting thehypothesis that the environmental force of infection triggers outbreaks. Estimatesof βIt are low, but not negligible, compared to estimates of α(t), suggesting thatmost of the transmission is coming from environmental sources.

Computational efficiency is an important factor in determining the use-fulness of this approach in the field. We have written an R package whichimplements the PMMH algorithm for our hidden SIRS model, available athttps://github.com/vnminin/bayessir. The computationally expensive portions ofthe PMMH code are primarily written in C++ to optimize performance, usingRcpp to integrate C++ and R [Eddelbuettel (2013), Eddelbuettel and François(2011)]; however, there is still room for improvement. Running 400,000 iterationsof the PMMH algorithm on the six years of data from Mathbaria takes 3 days ona 4.3 GHz i7 processor. Since we can predict three weeks into the future using a21-day covariate lag, we do not think timing is a big limitation for using our modelpredictions in practice.

Plots of residuals over time, shown in Appendix C, show that we are modelingwell case counts between the epidemic peaks but not the epidemic peaks them-selves, either due to missing the timing of the epidemic peak or the latent states

592 A. A. KOEPKE ET AL.

not being modeled accurately. This possible model misspecification might be fixedby including more covariates, using different lags or modifying the SIRS model.Also, we assume a constant reporting rate, ρ, rather than using a time-varying ρt

[Finkenstädt and Grenfell (2000)]. With better quality data we might be able toallow for a reporting rate that varies over time; we will try to address these modelrefinements in future analyses. Another related assumption of this model is that in-dividuals act independently. However, these data come from a carefully conductedobservational study of cholera over many years with strict protocol across timeand locations. Thus, it is unlikely that there would be systematic differences inreporting or treatment seeking behaviors of people. We tested this using a basicsplit of the data, altering our SIRS model to include a separate reporting rate foreach phase of data collection, and found no significant difference between the tworeporting rates. Also, there is no evidence that people do not act independently, atleast with respect to their treatment seeking behavior. Surveys currently in the fieldwill allow us to test these assumptions in future work.

In the future, we will extend this analysis to allow for variable selection over alarge number of covariates. This will allow us to include many covariates at manydifferent lags and incorporate information from all of the water bodies in a waythat does not involve averaging. In the current PMMH framework, choosing anoptimal proposal distribution to explore a much larger parameter space would bedifficult. We want to include a way of automatically selecting covariates or shrink-ing irrelevant covariate effects to zero with sparsity inducing priors. The particleGibbs sampler, introduced by Andrieu, Doucet and Holenstein (2010), would al-low for such extensions. Approximate Bayesian computation is also an option forfurther model development [McKinley, Cook and Deardon (2009)]. In addition, theavailable data consist of observations from multiple thanas during the same timeperiod. Future analyses will look into sharing information across space and timeand accounting for correlations between thanas. Another challenging future direc-tion involves exploring models which incorporate a feedback loop from infectedindividuals back into the environment to capture the effect of infected individualsexcreting V. cholerae into the environment. To accomplish this, we could add a wa-ter compartment to our SIRS model that quantifies the concentration of V. choleraein the environment, similar to the model of Tien and Earn (2010). However, addingan additional latent state leads to identifiability problems, even with fully observeddata [Eisenberg, Robertson and Tien (2013)], so such an extension will requirerigorous testing and fine tuning.

Acknowledgments. The authors gratefully acknowledge collaborators at theICDDR,B who collected and processed the data, and the Anonymous Review-ers and Associate Editor for their constructive criticism that greatly improved themanuscript.

PREDICTIVE MODELING OF CHOLERA OUTBREAKS IN BANGLADESH 593

SUPPLEMENTARY MATERIAL

Supplementary Materials for “Predictive Modeling of Cholera Outbreaksin Bangladesh” (DOI: 10.1214/16-AOAS908SUPP; .pdf). The files contains ap-pendices with additional details.

REFERENCES

ANDRIEU, C., DOUCET, A. and HOLENSTEIN, R. (2010). Particle Markov chain Monte Carlo meth-ods. J. R. Stat. Soc. Ser. B Stat. Methodol. 72 269–342. MR2758115

ANDRIEU, C. and ROBERTS, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlocomputations. Ann. Statist. 37 697–725. MR2502648

BAUM, L. E., PETRIE, T., SOULES, G. and WEISS, N. (1970). A maximization technique occurringin the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist. 41 164–171. MR0287613

BEAUMONT, M. A. (2003). Estimation of population growth or decline in genetically monitoredpopulations. Genetics 164 1139–1160.

BHADRA, A., IONIDES, E. L., LANERI, K., PASCUAL, M., BOUMA, M. and DHIMAN, R. C.(2011). Malaria in Northwest India: Data analysis via partially observed stochastic differentialequation models driven by Lévy noise. J. Amer. Statist. Assoc. 106 440–451. MR2866974

BRETÓ, C., HE, D., IONIDES, E. L. and KING, A. A. (2009). Time series analysis via mechanisticmodels. Ann. Appl. Stat. 3 319–348. MR2668710

CAO, Y., GILLESPIE, D. T. and PETZOLD, L. R. (2005). Avoiding negative populations in explicitPoisson tau-leaping. J. Chem. Phys. 123 054104.

CAUCHEMEZ, S. and FERGUSON, N. M. (2008). Likelihood-based estimation of continuous-timeepidemic models from time-series data: Application to measles transmission in London. J. R. Soc.Interface 5 885–897.

CODEÇO, C. (2001). Endemic and epidemic dynamics of cholera: The role of the aquatic reservoir.BMC Infect. Dis. 1 1.

COLWELL, R. R. and HUQ, A. (1994). Environmental reservoir of Vibrio cholerae: The causativeagent of cholera. Ann. N.Y. Acad. Sci. 740 44–54.

DIEKMANN, O., HEESTERBEEK, J. A. P. and METZ, J. A. J. (1990). On the definition and thecomputation of the basic reproduction ratio R0 in models for infectious diseases in heterogeneouspopulations. J. Math. Biol. 28 365–382. MR1057044

DOUCET, A., DE FREITAS, N. and GORDON, N., eds. (2001). Sequential Monte Carlo Methods inPractice. Springer, New York. MR1847783

DUKIC, V., LOPES, H. F. and POLSON, N. G. (2012). Tracking epidemics with Google Flu Trendsdata and a state-space SEIR model. J. Amer. Statist. Assoc. 107 1410–1426. MR3036404

EDDELBUETTEL, D. (2013). Seamless R and C++ Integration with Rcpp. Springer, New York.EDDELBUETTEL, D. and FRANÇOIS, R. (2011). Rcpp: Seamless R and C++ integration. J. Stat.

Softw. 40 1–18.EISENBERG, M. C., ROBERTSON, S. L. and TIEN, J. H. (2013). Identifiability and estimation of

multiple transmission pathways in cholera and waterborne disease. J. Theoret. Biol. 324 84–102.MR3041644

FEARNHEAD, P., GIAGOS, V. and SHERLOCK, C. (2014). Inference for reaction networks using thelinear noise approximation. Biometrics 70 457–466. MR3258050

FERM, L., LÖTSTEDT, P. and HELLANDER, A. (2008). A hierarchy of approximations of the masterequation scaled by a size parameter. J. Sci. Comput. 34 127–151. MR2373035

FINKENSTÄDT, B. F. and GRENFELL, B. T. (2000). Time series modelling of childhood diseases:A dynamical systems approach. J. Roy. Statist. Soc. Ser. C 49 187–205. MR1821321

594 A. A. KOEPKE ET AL.

GILLESPIE, D. T. (1976). A general method for numerically simulating the stochastic time evolutionof coupled chemical reactions. J. Comput. Phys. 22 403–434. MR0503370

GILLESPIE, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem.81 2340–2361.

GILLESPIE, D. T. (2001). Approximate accelerated stochastic simulation of chemically reactingsystems. J. Chem. Phys. 115 1716–1733.

GINSBERG, J., MOHEBBI, M. H., PATEL, R. S., BRAMMER, L., SMOLINSKI, M. S. and BRIL-LIANT, L. (2008). Detecting influenza epidemics using search engine query data. Nature 4571012–1014.

HASTINGS, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applica-tions. Biometrika 57 97–109.

HE, D., IONIDES, E. L. and KING, A. A. (2010). Plug-and-play inference for disease dynamics:Measles in large and small populations as a case study. J. R. Soc. Interface 7 271–283.

HELD, L., HÖHLE, M. and HOFMANN, M. (2005). A statistical framework for the analysis of mul-tivariate infectious disease surveillance counts. Stat. Model. 5 187–199. MR2210732

HUQ, A., COLWELL, R. R., RAHMAN, R., ALI, A., CHOWDHURY, M. A., PARVEEN, S.,SACK, D. A. and RUSSEK-COHEN, E. (1990). Detection of Vibrio cholerae O1 in the aquaticenvironment by fluorescent-monoclonal antibody and culture methods. Appl. Environ. Microbiol.56 2370–2373.

HUQ, A., SACK, R. B., NIZAM, A., LONGINI, I. M., NAIR, G. B., ALI, A., MORRIS, J. G.JR, KHAN, M. N., SIDDIQUE, A. K., YUNUS, M., ALBERT, M. J., SACK, D. A. and COL-WELL, R. R. (2005). Critical factors influencing the occurrence of Vibrio cholerae in the envi-ronment of Bangladesh. Appl. Environ. Microbiol. 71 4645–4654.

INTERNATIONAL VACCINE INSTITUTE (2012). Country investment case study on cholera vaccina-tion: Bangladesh. International Vaccine Institute, Seoul.

IONIDES, E. L., BRETÓ, C. and KING, A. A. (2006). Inference for nonlinear dynamical systems.Proc. Natl. Acad. Sci. USA 103 18438–18443.

KEELING, M. J. and ROHANI, P. (2008). Modeling Infectious Diseases in Humans and Animals.Princeton Univ. Press, Princeton, NJ. MR2354763

KEELING, M. J. and ROSS, J. V. (2008). On methods for studying stochastic disease dynamics. J. R.Soc. Interface 5 171–181.

KING, A. A., IONIDES, E. L., PASCUAL, M. and BOUMA, M. J. (2008). Inapparent infections andcholera dynamics. Nature 454 877–880.

KOELLE, K. and PASCUAL, M. (2004). Disentangling extrinsic from intrinsic factors in diseasedynamics: A nonlinear time series approach with an application to cholera. Amer. Nat. 163 901–913.

KOELLE, K., RODÓ, X., PASCUAL, M., YUNUS, M. and MOSTAFA, G. (2005). Refractory periodsand climate forcing in cholera dynamics. Nature 436 696–700.

KOEPKE, A. A., LONGINI, JR., I. M., HALLORAN, M., WAKEFIELD, J. and MININ, V. N. (2016).Supplement to “Predictive modeling of cholera outbreaks in Bangladesh.” DOI:10.1214/16-AOAS908SUPP.

KOMOROWSKI, M., FINKENSTÄDT, B., HARPER, C. V. and RAND, D. A. (2009). Bayesian infer-ence of biochemical kinetic parameters using the linear noise approximation. BMC Bioinformat-ics 10 343.

LONGINI, I. M., YUNUS, M., ZAMAN, K., SIDDIQUE, A. K., SACK, R. B. and NIZAM, A. (2002).Epidemic and endemic cholera trends over a 33-year period in Bangladesh. J. Infect. Dis. 186246–251.

LONGINI, I. M. JR., NIZAM, A., ALI, M., YUNUS, M., SHENVI, N. and CLEMENS, J. D. (2007).Controlling endemic cholera with oral vaccines. PLoS Med. 4 e336.

MAY, R. and ANDERSON, R. M. (1991). Infectious Diseases of Humans: Dynamics and Control.Oxford Univ. Press, London.

PREDICTIVE MODELING OF CHOLERA OUTBREAKS IN BANGLADESH 595

MCKINLEY, T., COOK, A. R. and DEARDON, R. (2009). Inference in epidemic models withoutlikelihoods. Int. J. Biostat. 5 Art. 24, 39. MR2533810

METROPOLIS, N., ROSENBLUTH, A. W., ROSENBLUTH, M. N., TELLER, A. H. and TELLER, E.(1953). Equation of state calculations by fast computing machines. J. Chem. Phys. 21 1087–1092.

RASMUSSEN, D. A., RATMANN, O. and KOELLE, K. (2011). Inference for nonlinear epidemiolog-ical models using genealogies and time series. PLoS Comput. Biol. 7 e1002136, 11. MR2845064

RUBIN, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the appliedstatistician. Ann. Statist. 12 1151–1172. MR0760681

SACK, R. B., SIDDIQUE, A. K., LONGINI, I. M., NIZAM, A., YUNUS, M., ISLAM, S., MOR-RIS, J. G., ALI, A., HUQ, A., NAIR, G. B., QADRI SHAH, F., FARUQUE, M., SACK, D. A.and COLWELL, R. R. (2003). A 4-year study of the epidemiology of Vibrio cholerae in four ruralareas of Bangladesh. J. Infect. Dis. 187 96–101.

TAYLOR, H. M. and KARLIN, S. (1998). An Introduction to Stochastic Modeling, 3rd ed. AcademicPress, San Diego, CA. MR1627763

TIEN, J. H. and EARN, D. J. D. (2010). Multiple transmission pathways and disease dynamics in awaterborne pathogen model. Bull. Math. Biol. 72 1506–1533. MR2671584

TONI, T., WELCH, D., STRELKOWA, N., IPSEN, A. and STUMPF, M. (2009). ApproximateBayesian computation scheme for parameter inference and model selection in dynamical sys-tems. J. R. Soc. Interface 6 187–202.

VAN KAMPEN, N. G. (1992). Stochastic Processes in Physics and Chemistry 1. Elsevier, Amster-dam.

A. A. KOEPKE

FRED HUTCHINSON CANCER RESEARCH CENTER

SEATTLE

WASHINGTON 98109USAE-MAIL: [email protected]

I. M. LONGINI

DEPARTMENT OF BIOSTATISTICS

AND

EMERGING PATHOGENS INSTITUTE

UNIVERSITY OF FLORIDA

GAINESVILLE

FLORIDA 32611USA

M. E. HALLORAN

DEPARTMENT OF BIOSTATISTICS

UNIVERSITY OF WASHINGTON

AND

FRED HUTCHINSON CANCER RESEARCH CENTER

SEATTLE

WASHINGTON 98109USA

J. WAKEFIELD

DEPARTMENT OF STATISTICS

AND

DEPARTMENT OF BIOSTATISTICS

UNIVERSITY OF WASHINGTON

SEATTLE

WASHINGTON 98195USA

V. N. MININ

DEPARTMENT OF STATISTICS

AND

DEPARTMENT OF BIOLOGY

UNIVERSITY OF WASHINGTON

SEATTLE

WASHINGTON 98195USAE-MAIL: [email protected]