# epifast: a fast algorithm for large scale realistic epidemic simulations on distributed memory...

TRANSCRIPT

EpiFast: A Fast Algorithm for Large Scale RealisticEpidemic Simulations on Distributed Memory Systems∗

Keith R. [email protected]

Jiangzhuo Chen †

[email protected] Feng

[email protected] Bioinformatics Institute, Virginia Tech

Blacksburg, VA 24061, USA

V.S. Anil [email protected]

Madhav V. [email protected]

Department of Computer Science and Virginia Bioinformatics InstituteVirginia Tech, Blacksburg, VA 24061, USA

ABSTRACTLarge scale realistic epidemic simulations have recently be-come an increasingly important application of high-performancecomputing. We propose a parallel algorithm, EpiFast, basedon a novel interpretation of the stochastic disease propaga-tion in a contact network. We implement it using a master-slave computation model which allows scalability on dis-tributed memory systems.

EpiFast runs extremely fast for realistic simulations thatinvolve: (i) large populations consisting of millions of in-dividuals and their heterogeneous details, (ii) dynamic in-teractions between the disease propagation, the individualbehaviors, and the exogenous interventions, as well as (iii)large number of replicated runs necessary for statisticallysound estimates about the stochastic epidemic evolution.We find that EpiFast runs several magnitude faster thananother comparable simulation tool while delivering similarresults.

EpiFast has been tested on commodity clusters as well asSGI shared memory machines. For a fixed experiment, ifgiven more computing resources, it scales automatically andruns faster. Finally, EpiFast has been used as the major

†Corresponding author. Email: [email protected]. Tel.:+1 703 518 8032. Fax: +1 703 518 5376.∗We thank our external collaborators and members ofthe Network Dynamics and Simulation Science Labora-tory(NDSSL) for their suggestions and comments. Thiswork has been partially supported NSF Nets Grant CNS-0626964, NSF HSD Grant SES-0729441, CDC Center ofExcellence in Public Health Informatics Grant 2506055-01,NIH-NIGMS MIDAS project 5 U01 GM070694-05, DTRACNIMS Grant HDTRA1-07-C-0113 and NSF NETS CNS-0831633.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ICS’09, June 8–12, 2009, Yorktown Heights, New York, USA.Copyright 2009 ACM 978-1-60558-498-0/09/06 ...$5.00.

simulation engine in real studies with rather sophisticatedsettings to evaluate various dynamic interventions and toprovide decision support for public health policy makers.

Categories and Subject DescriptorsI.6.3 [Simulation and Modeling]: Applications

General TermsAlgorithms, Experimentation, Human Factors, Performance

1. INTRODUCTIONPandemics such as avian influenza pose a serious threat

to global public health security. Pandemic influenza strainshave demonstrated their ability to spread worldwide quicklyand to cause infections in all age groups. According to a re-cent World Health Report 2007, every year human influenzaresults in an estimated three to five million cases of severeillness and between 250,000 and 500,000 deaths [23]. Al-though, the final number of infections, illnesses, and deathsis unpredictable, and could vary tremendously depending onmultiple factors, it is certain that without adequate planningand preparations, an influenza pandemic in the 21st centuryhas the potential to cause enough illnesses to overwhelm thepublic health system at all levels.

Computational epidemiology is the development and useof computer models to understand the spatio-temporal dif-fusion of disease through populations. An important factorthat greatly influences an outbreak of an infectious diseaseis the structure of the interaction network across which itspreads. Aggregate or collective computational epidemiol-ogy models often assume that a population is partitionedinto a few subpopulations (e.g. by age) with a regular inter-action structure within and between subpopulations. How-ever, this approach does not take the structure of underly-ing social network into account. Additionally, the numberof different subpopulation types considered is small and pa-rameters such as mixing rate and reproductive number areeither unknown or hard to observe. Over the last few years,it has become increasingly clear that an interaction basedapproach to computational epidemiology is likely to play anincreasingly prominent role in developing public health poli-

430

cies [17–19]. Interaction based computational epidemiologyseeks to develop detailed representation of the underlyingsocial contact networks, within and across disease progres-sion and transmission models, as well as computational rep-resentations of implementable policies. These models arethen appropriately composed to produce detailed computersimulations that can inform public health policies.

Computer simulations are an appropriate tool to studylarge interaction-based systems. However, developing suchhigh resolution simulations tends to be computationally chal-lenging for a number of reasons, including the following.

• Scale and heterogeneity of social contact networks. Forexample, a social contact network spanning the NewYork region is comprised of approximately 20 millionindividuals. The contact patterns and individual at-tributes are highly heterogeneous. As a result inter-process communication patterns and load balancingare non-trivial.

• The contact networks are dynamic and coevolve as aresult of actions taken by individuals and interven-tions by public health officials. This implies that non-adaptive schemes for efficient partitioning will mostlikely not be efficient.

• Use of these simulations in practical settings requiresexecution of a large number of replicates for a givencombination of independent parameters. More impor-tantly, a typical case study involves searching an ex-tremely large parameter space that is best searchedusing adaptive designs. Factorial designs are practi-cally impossible in such situations.

The above discussion motivates the need for extremelyfast HPC-based simulations that are capable of simulat-ing disease dynamics over large dynamic social contact net-works. Furthermore, these simulations should be able torepresent complex interventions and public policies. Thisaspect is crucial if such simulations are to be used in prac-tical environments.

Our contributions. In this paper, we present EpiFast— an extremely fast, scalable, high performance simulationtool. EpiFast can be used to study how infectious diseasesspread through individual populations (in humans as wellas animals) as well as the relative effectiveness of variousinterventions and public policies. It can also be used foranswering general questions pertaining to network science,e.g. – which individual nodes are critical or vulnerable in agiven population, how does the structure of the network af-fect disease dynamics, and how to quantify the uncertaintyof the results.

EpiFast brings in various novel contributions. Below wehighlight several of them. First, EpiFast represents a novelinterpretation of epidemic simulations. It is significantly dif-ferent from the usual multi-agent based discrete event simu-lations. By using a simpler (yet realistic) disease model andrestricting the set of interventions that can applied, EpiFastcan use a pre-constructed contact network and avoid scan-ning agent activities in each step while delivering similarresults with comparable details.

Second, the EpiFast algorithm is scalable on almost anydistributed memory system. EpiFast incurs low communi-cation cost and tolerates high communication latency. Thus

it is able to run on almost any distributed memory systemincluding commodity clusters. To our best knowledge, Epi-Fast is one of the few epidemic simulations (the other beingEpiSimdemics [3]) that does not rely on large shared mem-ory machines.

Third, EpiFast runs extremely fast. For example, a 180-day simulation of the greater Los Angeles metropolitan area,whose population is about 16 million, on a cluster using 96Power5 processors, takes less than 5 minutes (ignoring thenetwork data load cost). Such high performance comes froma combinatorial interpretation of the stochastic process ofdisease propagation in a population, as well as an elegantyet efficient parallel algorithm that allows most of the com-putations to be parallelized in a scalable way. The EpiFastcode runs at least several magnitudes faster than other toolswhile delivering similar results.

Fourth, EpiFast supports representation and reasoningabout complex interventions and public policies. It can han-dle sophisticated interventions to allow realistic case studies.This distinguishes EpiFast from classical simulations basedon percolation theory [17,21]. See Section 2.4 for details.

Last, but also most important, EpiFast demonstrates verygood usability in practical applications. We have recentlydeveloped a web-based tool called Didactic that is integratedwith EpiFast so that epidemiologists and public health ana-lysts can easily use EpiFast. Didactic allows users to specifyan experimental design using a web based interface. Thedetailed dynamic information produced by EpiFast is thenanalyzed with ease by using preconfigured analytical tools.

Organization of this paper. In Section 2, we formallydescribe the epidemic simulation problem. After present-ing the sequential and parallel versions of the EpiFast al-gorithm in Section 3, we evaluate the performance of anMPI/C++ based EpiFast implementation in Section 4, fol-lowed by overviews of two practical applications of EpiFastin Section 5. Related work is described in Section 6.

2. THE EPIDEMIC SIMULATION PROBLEMIn the epidemic simulation problem, we are given (i) a

population with detailed schedules of each person on eachday, which describe at any moment where the person is andwhat he is doing; (ii) a disease and the characteristics of eachperson with respect to the disease, which describe how likelythe person will be infected, once infected how long he takesto recover, and how likely he will infect other people; (iii)self interventions and public health interventions during anepidemic, which may change the infectivity of the involvedpeople or their schedules; and (iv) initial conditions, whichdescribe how the disease starts with which people. Witheach simulation run, which represents one possible instanceof the stochastic disease propagation, we want to know oneach day the health state of each person, and if anyone isinfected, who possibly transmits the disease to him.

We assume that we do not care where the disease trans-missions occur. The disease transmits through people topeople contacts. Therefore, we only need to know the people-people contact graph at any moment. We also assume thatin the simulation the discrete time increases by day. Thatis, there is nothing happening between any day and its nextday. Without loss of generality, we assume everything occursat the beginning moment of each day. So in the simulationresults, we only care about that a person is healthy on day

431

t and becomes infected on day t + 1; we do not care ex-actly when during day t + 1 he is infected. Furthermore, weassume that the schedules of each person is known at thebeginning of the day, and do not change day to day, unlessinterventions are applied to the person. Now it’s easy to seethat if we know the people-people contacts for each day, itis sufficient to simulate the disease propagation.

2.1 Contact networkThe people-people contacts form a network, called social

contact network. A contact network G(V, E, w), is a di-rected, edge-weighted network. Nodes correspond to indi-viduals in the population, edges represent the contact be-tween two end nodes during the day, and edge weights arecontact durations. Edge (u, v) with weight w(u, v) repre-sents that node u has contact of duration w(u, v) with nodev on the day, during which the disease may transmit fromnode u to node v with probability p(w(u, v)). Function p(·)is non-decreasing and depends on the disease infectivity. Inmost cases the edges are bi-directional with equal weights.For each day t, there is a contact network Gt. Since thepeople-people contacts do not change very often, and theaverage number of contacts of each person is usually upper-bounded by a constant, we can pre-compute a base networkG = G0, and update it when there are interventions.

2.2 SEIR model for within host diseaseprogression

We adopt the standard SEIR disease model which is widelyused in mathematical epidemiology literatures [2, 15, 20].Each person is in one of the following four health states atany time : susceptible, exposed, infectious, and removed. Letθ(v, t) denote the health state of person v on day t. Asso-ciated with each person v are an incubation period ΔtE(v)and and an infectious period ΔtI(v).

A person is in the susceptible state until he becomes ex-posed. If person v becomes exposed, he remains so forΔtE(v) days, during which he is not infectious. Then hebecomes infectious and remains so for ΔtI(v) days. Finallyhe becomes removed (or recovered) and remains so perma-nently.

The health states of person v during the epidemic pe-riod, denoted by θ(v), can be equivalently represented byτ (v) = (tS→E(v), tE→I(v), tI→R(v)), where tS→E(v) is theday when v becomes exposed, tE→I(v) = tS→E(v)+ΔtE(v)is the day when v becomes infectious, and tI→R(v) = tE→I(v)+ΔtI(v) is the day when v becomes removed. We assume thatwe know ΔtE(v) and ΔtI(v) for each person v at the begin-ning of the simulation. The SEIR model is illustrated inFigure 1.

2.3 SEIR model for between host diseaseprogression

With the SEIR model, the disease spreads in a populationin the following way. It can only be transmitted from an in-fectious node to a susceptible node. On any day, if node u isinfectious and v is susceptible, disease transmission from u tov occurs with probability p(w(u, v)) = 1−(1−r)w(u,v) wherer is the probability of disease transmission for a contact ofone unit time. So the disease propagates probabilisticallyalong the edges of the contact network. Figure 2 shows howthe disease spreads via the network and how the SEIR statesof the nodes change accordingly.

Suscep tib le ��Exposed��In fectious����R em oved

t0 tS->E tE->I tI->R

Figure 1: The S → E → I → R disease model of health

state transitions. A person in the susceptible state can

become exposed. A person in the exposed state will

become infectious. A person in the infectious state will

become removed. The state transitions are one-way and

there are no other possible transitions.

time0 1 2 3 4

A

C

D B

SEIR

A

C

D B

A

C

D B

A

C

D B

A

C

D B

Figure 2: Example of SEIR model for between hostdisease progression. The contact network has fournodes, and five contact edges. Suppose each nodehas ΔtE = 0, ΔtI = 2, and each edge represents atransmission probability of 0.5 for each day on whichthe tail node is infectious (I – red) and the head nodeis susceptible (S – green). Node A is infectious atstart. On day 0, node A transmits the disease tonode B but not node D. On day 1, both nodes Aand B are infectious. Node A transmits the diseaseto node D; node B infects node C but not node D.On day 2, node A is removed, all other nodes are in-fectious but there are no more susceptible nodes. Sothe nodes are removed gradually and on day 4, thesystem enters a fixed point status and stops evolv-ing.

A crucial assumption made in almost all epidemic modelsis that of independence: we assume that the spread of infec-tion from a node u to node v is completely independent ofthe infection from a node u′ to node v. Similarly, an infectednode u spreads the infection to each neighbor v, independentof the other neighbors of u. This is a central assumption inalmost all the epidemic models and the analytical resultsbased on percolation.

2.4 InterventionsInterventions can be broadly classified using the following

attributes: (i) Pharmaceutical (PI) versus Non-pharmaceutical(NPI), (ii) non-adaptive/adaptive.

PI interventions include administering antivirals, antibi-otics and vaccines. NPI interventions refer to actions thateffectively change the social network without administeringany drugs. This includes: various kinds of social distancing(closure of various facilities such as schools, etc) quaran-tine and sequestration. It is now well accepted that NPImeasures, especially social distancing can be extremely ben-eficial in controlling disease spread; see the recent federalpandemic influenza plan [6] for additional discussion.

Non-adaptive interventions occur before the start of theepidemic. The non-adaptive interventions unrealistically as-sume people’s behaviors do not change with the course of the

432

epidemic. They are limited to studying treatments that havea permanent effect, like vaccination. The adaptive strategieson the other hand, incorporate changes in the movement ofthe people, treatments that have only temporary effects (an-tiviral medications are only effective when being taken), andwholesale changes to the interactions within the population(like school closure). We can differentiate various strategiesby how frequently interventions are applied and triggeringconditions are checked. This can be viewed as degree ofadaptation.

At implementation level, an intervention is specified as ei-ther changing the infectivity/vulnerability of the intervenedperson, or changing the behaviors of the intervened per-son. In the former case, the intervention does not changethe people-people interactions, so the disease may transmitthrough the contact network via the same edges. But thetransmissions are hindered at the intervened nodes. Thisclass of interventions include vaccination and antiviral ad-ministration. A vaccinated person still goes to the sameplaces, meets the same people, as usual. But now the dis-ease cannot propagate via him. If we apply antiviral on aperson, he will become less likely to get infected and evenif he is infected he will be less likely to infect other people.The PI interventions are implemented in EpiFast under thisclass.

The latter class of interventions do change the contactnetwork, so the disease cannot transmit along the same net-work edges. Examples include social distancing measures.If we apply school closure intervention on a person, his con-tact edges that take place in the school will be removed, sothe disease cannot transmit to him or from him along theseedges. The NPI interventions are implemented in EpiFastunder this class.

Non-adaptive interventions are implemented as node oredge changes at the beginning of the simulation. Adap-tive interventions are implemented using trigger conditions.Generally speaking, a trigger condition is a boolean func-tion of the states of all people on all days from the startof the simulation until current simulation day. In EpiFast,this function is computed on every simulation day; if it havevalue true the associated intervention will be executed. Notethat this is a stochastic function which can take different val-ues in different simulation instances. So the intervention isadaptive to the simulation progression.

Our EpiFast implementation can handle interventions thathave predetermined target subpopulation and change thenetwork edges associated with the intervenees or their in-fectivity/vulnerability, the interventions can be either non-adaptive, or adaptive so long as the function form of thetrigger condition is predetermined.

The following is an example to show what kind of inter-ventions EpiFast are used to evaluate in real studies.

When there are more than 100 infectious people in

this population, apply antiviral to critical workers for

two weeks. If among the school age children, more

than 0.1% are infectious, immediately close all pub-

lic schools. If more than 5% of the population are

infected (really major outbreak), 10% people choose

to stay home.

Note that the current implementation of EpiFast cannothandle trigger conditions with function form, or interven-tions with adaptive targets. For example, a trigger like:

“20% of the contacts of the infectious people become exposed”;or an intervention like: “all contacts of the currently infec-tious people”.

Problem statement. Now we formally state the epidemicsimulation problem. Let G = (V, E, w) be the base contactnetwork. Let Δτ = {ΔtE(v), ΔtI(v)}v∈V be the incuba-tion and infectious period durations of the people. Let A ={At}t∈[0..T ] be the interventions during the epidemic. Theinterventions in set At can be predetermined ahead of thesimulation, or dynamic based on {G0, G1, . . . , Gt, θ0, θ1, . . . , θt}where θt represents the health state of all people on day t.Let Λ denote the initial conditions, including the initial setof infected individuals. Let τ = {τ (v)}v∈V be the healthstates of all people on all days. Let π(v) be the set of peoplethat transmit the disease to person v and Π = {π(v)}v∈V .This structure is most naturally represented as a transmis-sion network TN : it is a directed acyclic graph, the set ofinitial infections are at level 0. Level i contains the set ofnodes (individuals) that become infected on day i. a directededge from u to v represents the fact that u transmitted thedisease to v. A single execution of a discrete event simula-tion produces a random instance of TN ; the probability ofoccurrence of TN is dependent on the various input param-eters.

Problem 2.1. Given (G, Δτ, A,Λ), the epidemic simula-tion problem is to compute a random instance of the trans-mission network TN and the tuple (τ, Π).

It is easy to produce an instance of TN in polynomialtime. This is because the disease dynamics always reach afixed point in which the set of individuals are partitionedinto two classes, namely, recovered and susceptible. Ourgoal is to develop a fast parallel algorithm for the epidemicsimulation problem.

3. EPIFAST ALGORITHMA sequential version of the EpiFast algorithm is described

in Algorithm 1. Now we show how to parallelize most ofthe computation using a master-slave model and present theparallel version of the EpiFast algorithm.

To parallelize the EpiFast algorithm, we need to dividethe data and computations. The contact network is the ma-jor data so we first consider partitioning the network. Thereare several natural ways to divide a contact network. Oneis based on that the network consists of a list of undirectededges. So we divide the edges into groups. With this ap-proach, however, it is difficult to keep track of status of eachperson. For example, if person v is infected, this informa-tion has to be synchronized in all groups since we don’t knowwhich group does or does not contain edges incident on v.Therefore, any event needs to be synchronized among allprocesses, which leads to large communication costs. An-other approach is to divide the people according to theirgeographic locations. This relies on that we have their lo-cation information and since the population is never evenlydistributed, it is difficult to achieve load balancing withoutcomplicated partitioning algorithms.

Our network partitioning method is simple yet efficient.We adopt an adjacency list representation of the network,where the directed edges are grouped by the tail nodes. Thenetwork naturally consists of |V | components, each corre-sponding to a node and containing the node and all outgo-

433

Algorithm 1: Sequential EpiFast algorithm

Input: (G, Δτ, A, Λ) where G is the contact network,Δτ is the incubation and infectious perioddurations, A is the interventions, and Λ is theinitial conditions.

Output: (τ, Π) where τ is the health state transitiontimes of all people, and Π is the infectors whotransmit the disease to the infected people.

initializationfor t = 0 to T do

foreach person v ∈ V doupdate his health state according to τ (v)

get interventions At and update G accordinglyforeach infectious node u ∈ V do

foreach contact v of node u doif v is susceptible then

let transmission u → v occur withprobability p(w(u, v))if u → v does occur then

update τ (v) and π(v)

output (τ, Π)

ing edges from the node. The |V | components are assignedarbitrarily to processors, each processor containing approxi-mately the same number of edges. Figure 3 is an example ofour network partitioning approach. Suppose the componentcorresponding to v is assigned to processor γ. We say thatγ is the owner of v and that v is local to γ. It is obviousthat each person is local to exactly one processor and hasexactly one owner. All information about person v will bekept on his owner processor. Note that the head node of theedges store on a processor may be local or non-local to theprocessor.

B

E

AD

C

5

7

8

1

3

9

A

5

3 E

D

8

1

3

9

C7

9

4

4 4

B

5

7

8

1

Processor 1 Processor 2whole contact network

Figure 3: Network partitioning example. A 5-node net-

work is partitioned and assigned to two processors. Each

node together with its outgoing edges form a component.

Grouping components A, B together and C, D, E together

makes the two processors contain the same number of

edges (7). This approach achieves approximate load bal-

ancing.

Figure 4 depicts the executions and communications in theEpiFast algorithm with a symmetric parallel computationmodel. The communication complexity is quadratic in thenumber of processors. So the algorithm is not scalable tomany processors. Our solution to this problem is a master-slave computation model, where the single master processorknows the owner processor of each node, while each of the

Slave3 Slave2

Slave1

a

cc c

c

c

c

Slave4

f

ab

c

c

f

a

b

f

b a b

f

c

ccc

Figure 4: A symmetric

model for parallelization

of EpiFast algorithm. Se-

quence of operations:

a: compute probabilistic

transmissions from local

nodes

b: update τ(v) of local

nodes

c: send transmission (u →v) where v is non-local to

owner of v

f: update τ(v) of received

transmission (u → v)

M aster

Slave3 Slave2

Slave1

aed

e

c

cec

Slave4

f

ab

ce

f

a

b

f

b a b

f

Figure 5: A master-slave

model for parallelization

of EpiFast algorithm.

Sequence of operations:

a: slave computes prob-

abilistic transmissions

from local nodes

b: slave updates τ(v) of

local nodes

c: slave sends transmis-

sion (u → v) where v is

non-local to master

d: master looks up own-

ers of received messages

e: master sends trans-

mission (u, v) to owner of

v

f: slave updates τ(v) of

received transmission

(u → v)

slave processors knows for each node whether it is local. Theslave processor sends operation requests concerning any non-local nodes to the master processor. The master processordoes not do any computation rather than looking up thedestination of each request and forward the requests to theirowner processors. The master-slave model is described inFigure 5.

Based on Figure 5 we parallelize the EpiFast algorithmas in Algorithm 2. Its computation complexity is linear inthe number of infected nodes, and depends on the interven-tions. About the communication cost we have the followingobservation.

Proposition 3.1. The communication cost of Algorithm 2is upper-bounded by a linear function in the number of in-fected nodes which does not depend on the number of proces-sors.

Proof. Suppose the number of infected node during theepidemic is k. Each infected nodes is infected by a nodethat has either the same owner or a different owner. Thetotal number of transmissions that need to be forwarded isupper-bounded by k. Therefore the total communicationcost is ck where c is the constant cost for forwarding onetransmission.

Our parallel EpiFast algorithm is implemented in C++/MPI.The code is portable and tested on commodity clusters con-sisting of x86 processors or PowerPC processors, as well asSGI Altix systems.

434

Algorithm 2: Parallel EpiFast algorithm

Input/Output: same as in Algorithm 1.partition Ginitializationfor t = 0 to T do

if slave thenforeach local node v do

update his health state according to τ (v)

get interventions At and update local partitionof G accordinglyforeach local infectious node u do

foreach contact v of node u doa: let transmission u → v occur withprobability p(w(u, v))if u → v does occur then

if v is local thenb: update τ (v) and π(v)

elsec: send transmission (u → v) tomaster

if master thenreceive transmissions from slavesd: look up the destination of the transmissionse: forward the transmissions to owner slaves

if slave thenf: update τ (v) and π(v) for each received(u → v)

output (τ, Π)

4. EXPERIMENTAL RESULTSIn this section, we describes some detailed performance

profiling and analysis of our EpiFast implementation on aterascale system. This system consists of 1000+ nodes andeach nodes has two 2.3 GHz PowerPC 970FX processors.The system uses InfiniBand interconnection. We insert tim-ing functions into the EpiFast source code to instrument therunning time of each computation phase on all participatingprocessors.

The effects of interventions. Practical studies using Epi-Fast usually consist of a set of simulations and each of themsimulates one or multiple intervention strategies. So it is use-ful to know the computational cost of interventions. In ourperformance analysis, we measure the running time withoutand with interventions. For the scenario without interven-tion, we use the following settings: The incubation periodand infectious period of each person are sampled from dis-tributions such that the former has a mean of 1.9 days andthe latter has a mean of 4.1 days. Twenty people are ran-domly chosen as the seeds to initiate the epidemic. Underthese settings, the epidemic has an attack rate1 of about42%. For the scenario with interventions, we consider thefollowing two interventions.

1. Intervention AV. This is a non-adaptive antiviral inter-vention. From day 20 to day 39, 10% of the peoplewill be applied antiviral for continuous 20 days. Theantiviral drug can make the person 87% less likely to

1The attack rate of the an epidemic is the fraction of thepopulation that ever get infected during the epidemic.

be infected or if he is already infected, make him 80%less likely to infect others. After day 39 the antiviraldrug is not effective any more.

2. Intervention GSD. This is an adaptive social distancingintervention. When more than 0.1% of the populationare infectious, 10% people will avoid unnecessary ac-tivities, so they will only go to work or school thencome stay home.

Figure 6 shows the running time of complete simulations(with 25 replicates and 180 days) for the above three scenar-ios for the Los Angeles area network. The LA network has16 million people and nearly 900 million people-people con-tacts. This figure shows that the setting with interventionsrequires about 75% more computation time than that with-out interventions. However, the scaling trends are same forboth cases, with or without interventions. This figure indi-cates that EpiFast can finish a complete simulation in aboutthree hours for either scenario or less than 15 minutes fora simulation with only one replicate. This experimentallyconfirms that EpiFast is a very fast algorithm.

0

5000

10000

15000

20000

25000

30000

224 160 96 32

exec

utio

n tim

e (s

econ

ds)

number of processors

Execution Time for Los Angeles Network

no interventionantiviral

generic social distancing

Figure 6: The execution time of EpiFast with and with-

out interventions. There are three scenarios: (i) no in-

tervention, (ii) non-adaptive antiviral intervention, (iii)

adaptive social distancing intervention. The experimen-

tal settings are the same in all cases, except for the in-

terventions.

The strong scaling of EpiFast. As described in Section 3,by splitting the network nodes across processors, EpiFastreduces the memory footprints of each processor as well asdecreases the execution time for simulations. Figure 7 showsthe execution times and relative speedups of EpiFast for fourdifferent networks: Miami, Boston, Chicago, and Los Ange-les. The simulation was configured to use the same incuba-tion and infectious period distributions as described above,plus same kinds of generic social distance interventions. Asthe simulations requires large memory footprints that cannot fit in to a single compute node, we compute relativespeedup using the execution time on 32 processors as thebase. Figure 7 indicates that EpiFast scales reasonable wellfor fixed problem size, achieving 3.1 to 4.1 times speedupwhen using 6x more processors. The diminished perfor-mance gains of using more processors are largely caused bythe initialization phase that read the network data from disk

435

Table 1: Experimental Results Under Weak Scaling.Size column is the number of nodes (in millions) inthe network; CPUs column is the number of proces-sors; Time column is the average running time (inseconds) per simulation day.

Network Size CPUs TimeMiami 2.09 32 0.47Boston 4.15 64 0.54Chicago 9.05 128 0.54

and then distribute to all processors. Our profiling showsthat the initialization phases accounts for more than 30% ofthe total execution time of a single replicate and that costdoes not scales well with the number of processors.

The weak scaling of EpiFast. As it is difficult to obtainweak scaling data using realistic networks, we discuss it hereempirically. Table 4 shows the experimental results of Epi-Fast for three networks under the weak scaling rule. As thenetwork sizes are roughly doubled, we double the number ofprocessors to keep the execution time constant. As shown inthis table, the execution times of per day simulation for thethree networks are indeed close enough. This confirms thatthe EpiFast algorithm has very good weak scaling propertiesand thus is feasible to simulate very large networks just byincreasing the number of processors.

Better graph partitioning algorithm. EpiFast uses asimple and scalable algorithm to partition the network. Nodesare assigned in the order of their ids to the processors in theorder of their ids. That is, there exist n0 = −∞, n1, n2, . . . , nJ ,where J is number of slave processors, such that ids of nodesassigned to processor j are all in interval (nj−1, nj ]. Eachprocessor only needs to store {n1, n2, . . . , nJ} to know ownerof every node. This scheme appears adequate for dealingwith networks we are currently working with.

More sophisticated partitioners like Metis [13] can im-prove the communication complexity. The improvement,however, is limited because the running time of EpiFast isdominated by local computations. We use the k-way parti-tioning routine in Metis version 5.0 to partition the Miamicontact network into 2i parts, i = 3, 4, 5, 6, and measurethe average running time of EpiFast for simulating non-intervention case, based on this partitioning. We do thesame with our simple partitioning algorithm and comparethe two methods. Note that Metis partitioning is done ina preprocessing step and its execution time is not countedin the simulation time, while our simple partitioning is doneonline during the simulation. Miami population size is about2 million. The running time is per replicate, averaged over20 replicates. The simulation is run on an IBM PowerPC-64(Power4+) cluster, each node of which has eight 1.7 GHzPower5 processors and 16GB memory. The running timesunder two partitioning schemes are plotted in Figure 8.

5. APPLICATIONS OF EPIFASTThe high performance of the EpiFast algorithm does not

sacrifice its usefulness. Our EpiFast implementation hasstarted its applications by epidemiologists to various sim-ulations that compute interesting measures that were pre-viously computationally infeasible, as well as to real studiesthat aid the public health policy makers to in evaluations

0

20

40

60

80

100

120

140

160

180

0 10 20 30 40 50 60 70

Exe

cutio

n T

ime

(Sec

onds

)

Number of Slave Processors (Network Parts)

Execution Time with Different Partitioning Algorithms

Simple partitioningMetis partitioning

Figure 8: Metis graph partitioner slightly improves per-

formance of EpiFast.

and decisions. In Section 5.1 we show a measure that prob-ably only EpiFast can excel in computing. In Section 5.2,we present a real case study.

5.1 Variation in Temporal VulnerabilityWe now discuss a setting which illustrates the need for a

fast simulation tool such as EpiFast for computing epidemicproperties. The vulnerability of a node f(v) is defined as theprobability that node v gets infected, starting at a randominitial infection. The temporal vulnerability ft(v) is definedas the probability that node v gets infected during the firstt time steps, starting at a random initial infection. The vul-nerability measure is a basic property of disease propagationon a social network and its properties are therefore of inter-est from a public health perspective. It is related to classicalbond percolation [21], which helps in fast estimation, but thetemporal vulnerability cannot be computed directly by bondpercolation techniques. We find that the temporal vulner-ability can be estimated by Monte-Carlo simulations usingEpiFast, illustrating its power and applicability.

In this experiment we use the contact network represent-ing the Chicago population whose size is about 8.8 million.Regarding interventions we study the following four cases:(i) base case, no intervention; (ii) work closure on day 40with 50% compliance; (iii) school closure on day 50 with60% compliance; (iv) work closure on day 40 with 50% com-pliance and school closure on day 50 with 60% compliance.The attack rate is about 70% in case (i). The variation in thevulnerability distributions after 60 and 80 days are shown inFigures 9 and 10, respectively. The figures show that socialdistancing resulting from work or school closure has a verysignificant impact on the disease dynamics.

We run the experiment on an IBM PowerPC-64 (Power4+)cluster with EpiFast. We use four nodes, each with eight 1.7GHz Power5 processors and 16GB memory. In each case thesimulation runs for 500 replicates. The average running timeper replicate is 2.35 minutes in case (i), 2.60 minutes in case(ii), 1.52 minutes in case (iii), and 2.03 minutes in case (iv).The difference in running time is mainly due to differentinterventions and different resultant attack rate: more in-terventions take more time; smaller attack rate results lessrunning time.

436

0

5000

10000

15000

20000

25000

224 160 96 32

Exe

cutio

n T

ime

(Sec

onds

)

Number of Processors

Execution Time with Increasing Number of Processors

MiamiBoston

ChicagoLos Angeles

1

1.5

2

2.5

3

3.5

4

4.5

224 160 96 32

Rel

ativ

e S

peed

up

Number of Processors

Relative Speedup with Increasing Number of Processors

MiamiBoston

ChicagoLos Angeles

Figure 7: The strong scaling of EpiFast for four different networks. Each simulation includes 25 replicates of 180 days

epidemic duration. All times are reported as wall clock time of the entire simulation runs.

0

100000

200000

300000

400000

500000

600000

0 100 200 300 400 500

Num

ber

of N

odes

Number of Times Being Infected

Histograms of Vulnerability up to Day 60

BaseWorkClosure

SchoolClosureWorkClosure + SchoolClosure

Figure 9: Vulnerability distribution after 60 days.

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

0 100 200 300 400 500

Num

ber

of N

odes

Number of Times Being Infected

Histograms of Vulnerability up to Day 80

BaseWorkClosure

SchoolClosureWorkClosure + SchoolClosure

Figure 10: Vulnerability distribution after 80 days.

5.2 Evaluating the effectiveness ofinterventions

A typical case study is a full experimental design compar-ing the efficacy of different combinations of interventions.Each combination, called a cell of the experimental design,is run many times to quantify the random variability inher-ent in a stochastic simulation.

In the following simple study, four different interventionsare combined. Each intervention has two levels, off and on,giving a total of sixteen combinations. Each cell is run 25times with a different random number seed each time. Thestudy was run on a social network representing the greaterLos Angeles metropolitan region, specifically the counties ofImperial, Los Angeles, Orange, Riverside, San Bernardino,and Ventura, for a total population of 16 million people.

The interventions used are: a vaccine that has a 30% ef-ficacy, a 40% compliance rate, administered on day 0 of thesimulation; an antiviral with an 87% efficacy against infec-tion and an 80% efficacy against transmission, a 60% compli-ance rate, administered when 1% of the population becomesinfected; generic social distancing with a 50% compliancerate, put into effect when 0.5% of the population is infected;and school closure, with a 60% compliance rate, put into

Figure 11 shows the results of the sixteen cells of the ex-perimental design. Each point represents the proportion ofthe population infected for a particular cell, averaged over all25 replicates of that cell. The left half of the figure shows thecells without school closure (the most effective intervention),and the right half with school closure. Within each half,the left side shows the cells without vaccination (the secondmost effective intervention), the right side shows the cellswith vaccination. This pattern repeats for generic social dis-tancing and antivirals (the least effective intervention). Thesimulations were run on 50 nodes (200 cores) of a commodityLinux cluster with 2.2GHz dual-socket, dual-core processors,8 GB of main memory per node, and a Myrinet interconnect.

6. RELATED WORKMathematical models have been extensively used by epi-

demiologist to study the evolution of epidemics. Amongthem, the compartmental models (e.g., the classic Kermack-McKendrick model [1, 14, 20]), divide the population intocompartments according to people’s health state, and applydifferential equations to study the evolution of epidemics inthe system. They rely on an assumption of perfect mixing

effect when 0.5% of the population becomes infected.

437

0.6

0.5

0.4

03

0.2

av O X O O O O O O OX X X X X X Xsdg O O O O O O O OX X X X X X X Xvc O O O O O O OOX X X X X X X Xsds O O O O O O O O X X X X X X X X

Pro

port

ion

Infe

cted

Figure 11: Summary of the results of an experimental design with 4 interventions: vaccination (v), antiviral (av),

school closure (sds), and generic social distancing (sdg). An X indicates that the corresponding intervention was in

effect for that cell, a O indicates that the intervention was not in effect. Each point represents the average infection

rate across 25 replicates.

that is unrealistic. The contact network epidemiology [17–19] models epidemics as a random propagation process ina heterogeneous contact network and derives analytical re-sults. The analyses in this area, however, often rely on as-sumptions about the network structure. They are unableto handle arbitrary, irregular contact networks. In addi-tion, physicists have also used a percolation-based approachto simulate the spread of infectious diseases [21]. Such anapproach is quite efficient but it, along with the previousapproach, only yield the final outbreak size; the time vary-ing information about disease dynamics (information abouttransients) cannot be obtained by such methods.

Recently individual based network models have become apreferred approach for epidemic and social simulations [3,5,7,8,11,12,22]. Among existing simulators, EpiFast is closelyrelated to EpiSims [8, 9] and EpiSimdemics [3]. All thesethree simulators are build upon the Simdemics framework [4]and provide a high fidelity environment for modeling diseasespread across large realistic contact networks. Both EpiSimsand EpiSimdemics aim to study very general contagion pro-cesses using a probabilistic timed transition system (PTTS)model [3]. They also use person-location network represen-tation that includes detailed individual activities. These twosimulators can be used to study very general disease modelsbut requires significant computing resources. By incorporat-ing disease semantics and the structure of social networksinto algorithm design and implementation, EpiSimdemicsgains much higher scalability than EpiSims that is basedclassical PDES approach. In contrast to these two simula-tors, EpiFast significantly speeds up the simulation processby using a simpler (yet realistic) disease model and restrict-ing the set of applicable interventions. Our experimental re-sults indicate that EpiFast usually runs 10 times faster thanEpiSimdemics for equivalent networks, though we have tonote that these two simulators are based on different simula-tion methodologies and thereby differ in modeling capabilityand flexibility.

EpiFast is also related to other parallel simulation sys-tems developed by Longini et al [16], Parker [22] and Fer-

guson [10, 11]. The system implemented by Ferguson et al.runs on a shared memory platform; so the problem size it itcan handle is constrained by the amount of available sharedmemory. The work of Longini et al. is indeed a parallelsimulation like EpiSimdemics. But the locations in its un-derlying contact networks are not real but surrogates forsimple location types such as school, home, etc. This re-sults in a structured social contact network that is moreamenable to efficient parallel computation, but which, ar-guably, is less representative of real-world social networks.Parker [22] has developed a parallel shared memory simu-lation that can scale to 300 million individuals. Like thework of Longini et al. and Ferguson et al., the underlyingsocial contact networks are highly structured and randomlydistributed.

7. CONCLUSIONSThis paper describes EpiFast, a novel parallel algorithm

and implementation for simulating disease spread on largerealistic social contact networks. By integrating the seman-tics of the SEIR disease models and using a pre-constructedsocial contact network, EpiFast reduces the SEIR epidemicssimulation problem to a sequence of graph operations, andthereby significantly decreases the simulation cost. By for-mally incorporate interventions into the simulation problemitself, EpiFast has a wide range of practical applications inpublic health and social behaviors studies. Though in thispaper we focus on our discussion on disease spread, it isworth pointing out that EpiFast is a general fast simulationalgorithm that can be applied to other stochastic diffusionprocesses as well, such as virus propagation in computernetworks and knowledge spreading in social networks.

The EpiFast algorithm is scalable on both shared mem-ory systems and distributed memory systems. The algo-rithm incurs low communication costs and is tolerant tolarge communication overhead. Thus, it not only runs ex-tremely fast for typical contact networks with several mil-lions of population but also makes reality of simulating ex-

438

tremely large networks. In this paper, we describe the par-allel performance of EpiFast based on a single simulation.However, for practical applications, we can further exploretwo higher-level parallelisms–replicate level parallelism andcell-level parallelism–to make full use of the computationalcapability provided by emerging petascale systems.

Our current EpiFast implementation is based on the MPIprogramming model. In the near future, we pan to investi-gate EpiFast implementations based on other programmingmodels such as UPC and MPI+OpenMP. Considering somepromising performance benefits of accelerator based HPCarchitecture, we will also explore the feasibility of portingEpiFast to GPGPU or Cell based clusters.

8. REFERENCES[1] R. M. Anderson and R. M. May. Population biology of

infectious diseases: Part I. Nature, 280:361–367, Aug.1979.

[2] N. Bailey. The Mathematical Theory of InfectiousDiseases and Its Applications. Hafner Press, NewYork, 1975.

[3] C. L. Barrett, K. R. Bisset, S. Eubank, X. Feng, andM. V. Marathe. Episimdemics: an efficient algorithmfor simulating the spread of infectious disease overlarge realistic social networks. In Proceedings of theACM/IEEE Conference on High PerformanceComputing (SC), page 37, 2008.

[4] C. L. Barrett, S. Eubank, and M. Marathe. Aninteraction based approach to computationalepidemics. In AAAI’ 08: Proceedings of the AnnualConference of AAAI, Chicago USA, 2008. AAAI Press.

[5] K. M. Carley, D. B. Fridsma, E. Casman, A. Yahja,N. Altman, L.-C. Chen, B. Kaminsky, and D. Nave.BioWar: scalable agent-based model of bioattacks.IEEE Transactions on Systems, Man, andCybernetics, Part A, 36(2):252–265, 2006.

[6] Dept. of Health and Human Services. HHS pandemicinfluenza plan, 2007.

[7] P. S. Dodds and D. J. Watts. A generalized model ofsocial and biological contagion. Journal of TheoreticalBiology, 232:587–604, 2005.

[8] S. Eubank. Scalable, efficient epidemiologicalsimulation. In SAC ’02: Proceedings of the 2002 ACMsymposium on Applied computing, pages 139–145, NewYork, NY, USA, 2002. ACM.

[9] S. Eubank, H. Guclu, V. S. A. Kumar, M. V.Marathe, A. Srinivasan, T. Z., and N. Wang.Modelling disease outbreaks in realistic urban socialnetworks. Nature, 429(6988):180–184, May 2004.

[10] N. Ferguson, D. Cummings, S. Cauchemez, C. Fraser,S. Riley, A. Meeyai, S. Iamsirithaworn, and D. Burke.Strategies for containing an emerging influenzapandemic in Southeast Asia. Nature, 437:209–214,2005.

[11] N. Ferguson et al. Strategies for mitigating aninfluenza pandemic. Nature, 442:448–452, 2006.

[12] T. C. Germann, K. Kadau, I. M. L. Jr., and C. A.Macken. Mitigation strategies for pandemic influenzain the United States. Proc. of National Academy ofSciences, 103(15):5935–5940, April 11 2006.

[13] G. Karypis and V. Kumar. A fast and highly qualitymultilevel scheme for partitioning irregular graphs.SIAM Journal on Scientific Computing,20(1):359–392, 1999.

[14] W. O. Kermack and A. G. McKendrick. Acontribution to the mathematical theory of epidemics.Proceedings of the Royal Society of London. Series A,Containing Papers of a Mathematical and PhysicalCharacter, 115(772):700–721, Aug. 1927.

[15] Y. Kuznetsov and C. Piccardi. Bifurcation analysis ofperiodic SEIR and SIR epidemic models. Journal ofMathematical Biology, 32:109–121, 1994.

[16] I. M. Longini, A. Nizam, S. Xu, K. Ungchusak,W. Hanshaoworakul, D. A. T. Cummings, and M. E.Halloran. Containing pandemic influenza at thesource. Science, 309(5737):1083–1087, 2005.

[17] L. A. Meyers. Contact network epidemiology: Bondpercolation applied to infectious disease predictionand control. Bulletin of The American MathematicalSociety, 44:63–86, 2007.

[18] L. A. Meyers, M. E. J. Newman, and B. Pourbohloul.Predicting epidemics on directed contact networks.Journal of Theoretical Biology, 240(3):400–418, 2006.

[19] L. A. Meyers, B. Pourbohloul, M. E. J. Newman,D. M. Skowronskic, and R. C. Brunham. Networktheory and SARS: predicting outbreak diversity.Journal of Theoretical Biology, 232(1):71–81, 2005.

[20] J. D. Murray. Mathematical Biology I. AnIntroduction, volume 19 of Biomathematics.Springer-Verlag, 3rd edition, 2002.

[21] M. Newman. The structure and function of complexnetworks. SIAM Review, 45(2):167–256, 2003.

[22] J. Parker. A flexible, large-scale, distributed agentbased epidemic model. In Winter SimulationConference, 2007.

[23] World Health Organization. World health report 2007:A safe future: global public health security in the 21stcentury, 2007.

439