parallel simulation for a fish schooling model on a...

Parallel Simulation for a Fish Schooling Model on a General-Purpose GraphicsProcessing Unit

Hong LiDepartment of Computer Science

University of California, Santa Barbara, CA [email protected] M. Kolpas

Department of MathematicsUniversity of California, Santa Barbara, CA 93106

[email protected] Petzold

Department of Computer ScienceUniversity of California, Santa Barbara, CA 93106

[email protected] Moehlis

Department of Mechanical EngineeringUniversity of California, Santa Barbara, CA 93106

[email protected]

Abstract

We consider an individual-based model for fish school-ing which incorporates a tendency for each fish to align itsposition and orientation with an appropriate average of itsneighbors’ positions and orientations, plus a tendency foreach fish to avoid collisions. To accurately determine sta-tistical properties of the collective motion of fish whose dy-namics are described by such a model, many realizationsare typically required. This carries a very high computa-tional cost. The current generation of graphics processingunits is well-suited to this task. We describe our implemen-tation, and present computational experiments illustratingthe power of this technology for this important and chal-lenging class of problems.

1. Introduction

Many organisms display ordered collective motion [1],such as geese flying in a Chevron-shaped formation [5],wildebeests herding on the Serengeti plains of Africa [15],locusts swarming in sub-Saharan Africa [16], and fish

schooling [13]. The number of individuals involved in suchcollective motion can be huge, from several hundred thou-sand wildebeests to millions of Atlantic cod off the New-foundland coast. Despite these large numbers, the groupcan seem to move as a single organism, with all individ-uals responding very rapidly to their neighbors to main-tain the collective motion. Advantages of collective mo-tion include a reduction in predation [10, 4], increased har-vesting efficiency [8, 13], and improved aerodynamic effi-ciency [17, 9].

The mathematical study of biological collective motionhas proceeded, broadly speaking, on two fronts. First, onecan model each organism individually, with rules specifyingits dynamics and interactions with other organisms. Such“individual-based models” can incorporate experimentalobservations of the behavior of the organisms, thereby giv-ing biologically realistic models. On the other hand, one candefine “continuum models” to describe the dynamics of thepopulation; for example, one might study a model for theevolution of the density of the organisms. However, avail-able continuum models are often qualitative caricatures thatcannot capture all of the details which can be included inindividual-based models, thereby compromising biological

realism.In this paper, we consider an individual-based model for

fish schooling. This model incorporates a tendency for eachfish to align its position and orientation with an appropri-ate average of its neighbors’ positions and orientations, plusa tendency for each fish to avoid collisions. Furthermore,stochasticity is included in the model to account for imper-fections in the gathering of information and in acting onthis information. To accurately determine statistical prop-erties of the collective motion of fish whose dynamics aredescribed by such an individual-based model, many real-izations are typically required; see, e.g., [3]. This can entaila very large amount of computation. After describing themodel in more detail, we will describe how a Graphics Pro-cessor Unit (GPU) can be used to very efficiently carry outparallel simulations of this model.

2 Fish Schooling Model

We consider a two-dimensional individual-based modelfor schooling with local behavioral interactions. This modelis similar to that of [2], but without an informed leader,and with different weights of orientation and attraction re-sponse. Groups are composed of N individuals with posi-tions ci(t), unit directions v̂i(t), constant speed s, and max-imum turning rate !. At every time step of size " , indi-viduals simultaneously determine a new direction of travelby considering neighbors within two behavioral zones. Thefirst zone, often referred to as the “zone of repulsion” [3],is represented by a circle of radius rr about the individual.Individuals repel from neighbors that are within this zone,which typically has a radius of one body length. The sec-ond zone is represented by an annulus of inner radius rr andouter radius rp about the individual. This zone also includesa blind area, defined as a circular sector with central angle(2# ! $), for which neighbors within the zone are unde-tectable. Individuals orient with and are attracted towardsneighbors within this second zone.

These zones are used to define behavioral rules of mo-tion. First, if individual i finds agents within its zone ofrepulsion, then it orients its direction away from the aver-age relative directions of those within its zone of repulsion.Its desired direction of travel in the next time step is givenby

vi(t + ") = !!

j !=i

cj(t) ! ci(t)

|cj(t) ! ci(t)|. (1)

This vector is normalized as v̂i(t+") = vi(t+!)|vi(t+!)| , assuming

vi(t + ") "= 0. In the case that vi(t + ") = 0, agent i main-tains its previous direction of travel as its desired directionof travel, giving v̂i(t + ") = v̂i(t).

If agents are not found within individual i’s zone of re-pulsion, then it will align with (by averaging the directions

of travel of itself and its neighbors) and feel an attractiontowards (by orienting itself towards the average relative di-rections of) agents within the second zone. The desired di-rection of agent i is given by the weighted sum of these twoterms:

vi(t + ") = %a

!

j !=i

cj(t) ! ci(t)

|cj(t) ! ci(t)|+ %o

!

j

vj(t)

|vj(t)|, (2)

where %a and %o are the weights of attraction and align-ment, respectively. This vector is then normalized as v̂i(t +

") = vi(t+!)|vi(t+!)| , assuming vi(t + ") "= 0. As before, if

vi(t + ") = 0, then agent i maintains its previous directionof travel.

We denote r = %o/%a as the ratio of orientation andalignment tendencies. When r = 0 (%o = 0), individu-als have no desire to orient with their neighbors. As r ap-proaches 1, individuals balance their orientation and attrac-tion preferences. For r > 1, individuals are more interestedin orientation with their neighbors than attraction towardsthem.

Stochastic effects are incorporated into the model by ro-tating agent i’s desired direction v̂i(t+") by an angle drawnfrom a circularly wrapped normal distribution with mean 0and standard deviation &. Also, since individuals can onlyturn !" radians in one time step, if the angle between v̂i(t)and v̂i(t + ") is greater than !" , individuals do not achievetheir desired direction, and instead rotate !" towards it. Fi-nally, each agent’s position is updated simultaneously as

ci(t + ") = ci(t) + sv̂i(t + ")". (3)

3 The Graphics Processor Unit - A New DataParallel Computing Device

3.1 Modern Graphics Processor Unit

The general-purpose Graphics Processing Unit (GPU) isbecoming a viable option for many parallel programmingapplications. The GPU has a highly parallel structure withhigh memory bandwidth and more transistors devoted todata processing than to data caching and flow control, com-pared with a CPU architecture, as shown in Figure 1 [11].Problems that can be implemented with stream processingand using limited memory are well-suited to the GPU archi-tecture. Single Instruction Multiple Data (SIMD) computa-tion, which involves a large number of totally independentrecords being processed by the same sequence of operationssimultaneously, is ideal for GPU application. We will showhow our computation can be structured to fit within the con-straints of the GPU architecture.

DRAMDRAM

Cache

ALUControl

ALU

ALU

ALU

DRAM

Cache

ALUControl

ALU

ALU

ALU

DRAM

Figure 1. CPU vs. GPU architecture. Thereare more transistors devoted to data process-ing rather than to data caching and flow con-trol for the GPU, compared with the CPU.

3.2 NVIDIA 8 Series GeForce-based GPUArchitecture

The NVIDIA 8800 GTX chip, released at the end of2006, has 768MB RAM and 681 million transistors on a480mm2 surface area. There are 128 stream processors ona GeForce 8800 GTX chip, divided into 16 clusters of mul-tiprocessors with 8 streaming processors in each multipro-cessor as shown in Figure 2 [11]. The 8 processors in eachmultiprocessor share 16 KB shared memory which bringsdata closer to the Arithmetic Logic Unit (ALU). The peakcomputation rate accessible for general-purpose applicationis (16 multiprocessors # 8 processors / multiprocessor) #(2 flops / MAD 1) # (1 MAD / processor-cycle) # 1.35GHz = 345.6 GFLOP/s, since the processors are clockedat 1.35 GHz with dual processing of scalar operations sup-ported. The maximum observed bandwidth between systemand device memory is about 2GB/second.

Unfortunately, current GPU chips support only single-precision floating point, while current CPUs support 64-bittechnology. Additionally, there is only 16K of fast read andwrite on-chip memory shared between the eight processors,on each multiprocessor.

3.3 CUDA: Compute Unified Device Ar-chitecture

The Compute Unified Device Architecture (CUDA)Software Development Kit (SDK) supplies general purposefunctionality for non-graphics applications on the NVIDIAGPU. The CUDA provides an essential high-level develop-ment environment with standard C language, which resultsin a minimal learning curve for beginners to access the low-level hardware. For development flexibility, the CUDA pro-vides both scatter and gather memory operations. It alsosupports a fast read and write shared memory [11].

Following the data-parallel model, the structure ofCUDA computation allows each of the processors to exe-cute the same instruction sequence on different sets of data

1A MAD is a multiply-add.

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Figure 2. Hardware Model : a set of SIMDmul-tiprocessors with on-chip shared memory.

in parallel. The data can be broken into a 1D or 2D grid ofthread blocks. Each block can be specified as a 1D, 2D or3D array of threads which collaborate via the shared mem-ory. Up to 512 threads can be active.

Currently the CPU and GPU cannot run in parallel. Alsoit is not possible to execute multiple kernels at the same timethrough the CUDA, or to download data and run a kernel inparallel. Users can branch in the kernel based on the threadid to achieve multiple tasks on a single kernel, but the sim-ulation will be slowed down with such branches. IndividualGPU program launches can run at most 5 seconds on a GPUwith a display attached.

4. Parallel Simulation

To accurately determine statistical properties of the col-lective motion of fish, many realizations are typically re-quired for a given set of parameters. This can be very com-putationally intensive. There are essentially two ways toimprove the performance: parallelize the simulation acrossthe realizations, and parallelize the simulation within onerealization. We do both.

4.1 Random Number Generation

To generate the initial conditions and to add noise toour calculations at each time step, we need many uniformlydistributed random numbers. So that our statistical results

can be trusted, it is particularly important that these ran-dom numbers are independent. Generating independent se-quences of random numbers is one of the most importantissues of implementing simulations for ensembles of fishschooling models in parallel.

Originally we considered pre-generating a sequence ofuniform random numbers in the CPU. Since the CPU andGPU can’t run in parallel, we can pre-generate a huge num-ber of random numbers and store them in the shared mem-ory, and swap back to the CPU to generate more whenthey are used up. Alternatively, we could pre-generate ahuge number of random numbers and store them in theglobal memory. These methods spend too much time ondata access. Furthermore, the Scalable Parallel RandomNumber Generators Library (SPRNG) [6, 7], which mighthave been an ideal choice because of its excellent statis-tical properties, cannot be moved to the GPU efficientlydue to its complicated data structure. The only solutionappears to be to employ a simple random number gener-ator on the GPU. Experts suggest using a mature randomnumber generator instead of inventing a new one, since itrequires great care and extensive testing to evaluate a ran-dom number generator[14]. Thus we chose the MersenneTwister from the literature in our application [12], whichhas passed many statistical randomness tests [12] includ-ing the stringent Diehard tests. In our implementation, wemodified Eric Mills’ multithreaded C implementation [12].Since many random numbers are required by this model,we use the shared memory for random number generationto minimize the data launching and accessing time.

4.2 Parallelizing Across the Realizations

Parallelizing the independent realizations is an effectiveway to improve the performance. Ensembles of fish school-ing runs are very well-suited for implementation on theGPU through the CUDA.

The whole fish schooling simulation can be put intoa single kernel running in parallel on a large set of sys-tem state vectors Pxi(j), Pyi(j), V xi(j), V yi(j), respec-tively representing the x and y components of the posi-tions of the fish and the x and y components of the veloc-ities of the fish, where i is the block id and j is id of thefish. The initial conditions Pxi(0), Pyi(0), V xi(0), V yi(0)are randomly generated by different processors withineach multiprocessor. The large set of final state vectorsPxi(tf ), Pyi(tf ), V xi(tf ), V yi(tf ) will contain the de-sired results. We minimize the transfer between the hostand the device by using an intermediate data structure onthe device, and batch a few small transfers into a big trans-fer to reduce the overhead for each transfer.

4.3 Parallelizing Within One Realization

Before discussing parallelism within one realization, weoutline the sequential algorithm. Individual fish are initial-ized in a bounded region with randomized positions and di-rections of travel. Then the simulation is executed for 3000steps to reach a steady state. At each step, we calculatethe influence on each fish from all the other fish. To calcu-late the influence on a fish, we compute the distance fromthis “goal fish” to all the other fish and record the influencecoming from the different zones in a few variables. Nextwe compute the net influences for this “goal fish”, includ-ing the noise term, based on the above data and save themto an influence array. After computing all the influences forall the fish, we update the position and direction of each fishbased on the influence array of this step.

For the GPU, there are some restrictions on the numberof threads in each block and the total number of fish storedin the shared memory. The device has very limited sharedmemory in each multiprocessor but relatively large globalmemory. The global memory adjacent to the GPU chip hasmuch higher latency and lower bandwidth than the on-chipshared memory: it takes about 400-600 clock cycles to reador write the global memory vs. 4 clock cycles to access theshared memory. To effectively use the GPU, our simulationmakes as much use of on-chip shared memory as possible.Although it is better to have at least 128 threads for the bestefficiency, we have to take the limited shared memory sizeinto account.

The main goal of parallelizing within one realization isto decompose the problem domain into smaller subdomains.In our implementation, within each block i, the system statevector Pxi(j), Pyi(j), V xi(j), V yi(j) is divided amongthreads. Assume that we have N fish and we use n threadsto do the simulation. Then we need order of m = N/ntime to calculate the influences on each of the fish. In theparallel implementation, we initialize all the fish with ran-dom position and direction. At each step we load n fish byloading one fish in each thread. The system state loaded toshared memory at this stage are the “goal fish”. Each threadholds only one “goal fish” at a time and calculates the in-fluences on this “goal fish”. Thus during one step, eachthread processes one m-element subvector. To calculate thedistance from the “goal fish” to all the other fish, we needthe position and direction of all the other fish. By using atemporary array of size n in shared memory, we load n fishat a time, with each thread loading one fish. The programwill keep loading until all the desired data have been loadedand used for calculation for the goal fish. Then each threadwill compute the influence information for the goal fish ofthis thread. At the end of the simulation step, each threadwrites the results into its own influence subvector. After allthe influence calculations are finished, each thread updates

1

… ……

tid = 0

tid = 1

tid = 2

tid = n

Length = n

…

Goal FishInfluence Record

(Reuse the vector of goal fish)

2Keep loading until finished loading all the fish

4Save

temporarycalculation

results

Update the system state

5

3

3

3

3

1

… ……

tid = 0

tid = 1

tid = 2

tid = n

Length = n

…

Goal FishInfluence Record

(Reuse the vector of goal fish)

2Keep loading until finished loading all the fish

4Save

temporarycalculation

results

Update the system state

5

3

3

3

3

Figure 3. Memory layout of fish schooling model. There are N threads to simulate one realizationwith thread id tid to identify the thread. In process 1 each thread loads one fish as its “goal fish” fromthe device memory to shared memory; In process 2 each thread loads fish to another array in sharedmemory; In process 3 each thread uses the full data from process 2 to compute the influences on itsown “goal fish”; Processes 2 and 3 continue until all of the influences to the “goal fish” have beencomputed; In process 4 each thread saves its influence to the influence record array (this reusesthe previous array to save shared memory); In process 5 each thread updates its “goal fish” to thesystem state vector on the device.

its system state subvector by its influence subvector. Thememory layout of the whole process is shown in Figure 3.Our implementation does not make use of a concurrent up-date. During each calculation, each thread reads the fullsystem state vector. The concurrent reads to the same dataare supported by the hardware.

Within each realization, most calculations use the sharedmemory, for which the memory access latency is small. Inaddition to the parallelization within one realization, wehave many blocks doing independent realizations. When-ever a thread is waiting to access the device memory, an-other thread will be running on the ALUs. Hence the mem-ory access latency is not a problem.

5 Results

Our simulations were run on the NVIDIA GeForce8800GTX installed on a personal computer with Intel Pen-tium 3.00GHz CPU and 3.50GB of RAM with physical Ad-dress Extension.

To begin a simulation, individuals are placed in abounded region (so that each agent initially interacts with atleast one other agent), with random positions and directions

of travel. The parameters were fixed to be rr = 1, rp = 7,$ = 350"($ 6.1 radians), s = 1, " = 0.2, & = 0.01, and! = 115"($ 2 radians). Simulations were run for approxi-mately 3000 steps until they reached a steady state. Groupsof size N = 100 were explored for a range of ratios r ofattraction to orientation weightings.

Two observables were used to measure the structure ofthe schools: elongation and polarization [2]. Elongationmeasures the ratio of the length of the axis aligned withgroup motion to the axis perpendicular to group motion, andpolarization

P (t) =1

N

"

"

"

"

"

N!

i=1

vi(t)

"

"

"

"

"

measures the degree of group alignment. Both quantitiestake values between 0 and 1. To obtain statistics regardingthe group structure for a given size N and ratio r, 1120realizations (with different initial conditions) were run. Theaverage group elongation and polarization were recorded,as well as the probability of group fragmentation.

We were able to nicely resolve the group statistics as afunction of the ratio of attraction to alignment. We find thatfor r close to zero, groups exhibit swarm behavior, withsmall polarization and elongation near 1. As r is increased,

groups become increasingly more aligned, forming elon-gated dynamically parallel groups and then highly paral-lel groups as r is further increased; see Figure 4. Thus,by changing the proportion of attraction to alignment ten-dencies, groups can shift between different collective be-haviors. We also see that the probability of fragmenta-tion first increases as the group elongation increases (from1 < r < 2.5), then decreases as elongation decreases (from3 < r < 8), and then increases again as schools becomemore highly aligned; see Figure 5.

The simulation performance in generating these resultswas extraordinary. The parallel (GPU) simulation is about230 ! 240 times faster than the corresponding sequentialsimulation which was already optimized on the host com-puter.

6 Conclusions

We considered an individual-based model for fishschooling which incorporates a tendency for each fish toalign its position and orientation with an appropriate av-erage of its neighbors’ positions and orientations, plus atendency for each fish to avoid collisions. Stochasticity isincluded in the model to account for imperfections in thegathering of information and in acting on this information.For the problem of calculating collective motion statisticalmeasures from many realizations, we observed speedups of230!240 times for our parallelized code running on a GPUover the corresponding sequential simulation on the hostworkstation. With this impressive performance improve-ment, in one day we can generate data which would requiremore than six months of computation with the sequentialcode.

The usefulness of individual-based models of biologi-cal systems, which can incorporate detailed experimentalobservations of the behavior of the organisms, is limitedby the ability to simulate them in a reasonable amount oftime. By expoiting the power of general-purpose GPUsas illustrated in this paper, we expect that it will be pos-sible to simulate more detailed models with more agents forlonger times and over more realizations. This GPU com-putational paradigm has the potential to revolutionize ourability to understand how collective behavior at the macro-scopic, population-level arises from the interaction of indi-viduals with each other and their environment.

7 ACKNOWLEDGMENTS

We thank Iain Couzin for helpful discussions relatedto the model, and Brent Oster for introducing us to theNVIDIA GPU.

This work was supported in part by the U.S. Departmentof Energy under DOE award No. DE-FG02-04ER25621,

by NIH Grant EB007511, by the Institute for CollaborativeBiotechnologies through grant DAAD19-03-D004 from theU.S. Army Research Office, and by National Science Foun-dation Grant NSF-0434328. J.M. also was supported by anAlfred P. Sloan Research Fellowship in Mathematics.

References

[1] S. Camazine, J. L. Deneubourg, N. R. Franks, J. Sneyd,G. Theraulaz, and E. Bonabeau. Self-Organization in Bi-ological Systems. Princeton University Press, Princeton,2003.

[2] I. D. Couzin, J. Krause, N. R. Franks, and S. A. Levin. Ef-fective leadership and decision making in animal groups onthe move. Nature, 433:513–516, 2005.

[3] I. D. Couzin, J. Krause, R. James, G. D. Ruxton, and N. R.Franks. Collective memory and spatial sorting in animalgroups. J. theor. Biol., 218:1–11, 2002.

[4] P. M. Driver and D. A. Humphries. Protean Behavior: TheBiology of Unpredictability. Oxford University Press, Ox-ford, 1988.

[5] P. B. S. Lissaman and C. A. Shollenberger. Formation flightof birds. Science, 168:1003–1005, 1970.

[6] M. Mascagni. SPRNG: A scalable library for pseudoran-dom number generation. In Proceedings of the Ninth SIAMConference on Parallel Processing for Scientific Computing,San Antonio, Texas, 1999.

[7] M. Mascagni and A. Srinivasan. SPRNG: A scalable libraryfor pseudorandom number generation. In ACM Transac-tions on Mathematical Software, volume 26, pages 436–461,2000.

[8] D. R. Martinez and E. Klinghammer. The behavior of thewhale orcinus orca: a review of the literature. Z. Tierpsy-chol., 27:828–839, 1970.

[9] R. M. May. Flight formations in geese and other birds. Na-ture, 282:778–780, 1979.

[10] S. R. S. J. Neill and J. M. Cullen. Experiments on whetherschooling by their prey affects the hunting behavior ofcephalopods and fish predators. J. Zoology, 172:549–569,1974.

[11] NVIDIA Corporation. NVIDIA CUDA Compute Uni-fied Device Architecture Programming Guide, 2007.http://developer.download.nvidia.com.

[12] NVIDIA Forums members. NVIDIA forums, 2007.http://forums.nvidia.com.

[13] B. L. Partridge. The structure and function of fish schools.Sci. Am., 245:90–99, 1982.

[14] Richard P. Brent. Uniform Random Number Generators forVector and Parallel Computers. Report TR-CS-92-02, 1992.

[15] A. R. E. Sinclair and M. Norton-Griffiths. Serengeti: Dy-namics of an Ecosystem. University of Chicago, Chicago,1979.

[16] B. P. Uvarov. Grasshoppers and Locusts. Imperial Bureauof Entomology, London, 1928.

[17] D. Wiehs. Hydrodynamics of fish schooling. Nature,241:290–291, 1973.

−10 −8 −6 −4 −2 0 2 4 6 8

−8

−6

−4

−2

0

2

4

6

r = 0.25

−8 −6 −4 −2 0 2 4 6 8

−6

−4

−2

0

2

4

6

r = 16

−15 −10 −5 0 5 10

−10

−8

−6

−4

−2

0

2

4

6

8

r = 2

Swarm

Dynamic parallel

Highly parallel

Figure 4. Swarm (r = 0.25), dynamic parallel(r = 2), and highly parallel (r = 16) collectivemotion for N = 100member schools.

10−2 10−1 100 101 1021

1.5

2

2.5

Elongation

r

10−2 10−1 100 101 1020

0.5

1

Fragmentation

r

10−2 10−1 100 101 1020

0.5

1

Polarization

r

Figure 5. Statistics of group elongation (onlyincluding non-fragmented schools), grouppolarization (only including non-fragmentedschools), and probability of group fragmenta-tion, for N = 100 member schools. To obtainthese statistics, for each value of r, 1120 sim-ulations with different initial conditions wererun for 3000 time steps, each using the GPU.

parallel simulation for a fish schooling model on a...

Documents