simulating and visualizing real-time crowds on gpu clusters

Simulating and Visualizing Real-Time Crowds on GPU Clusters

Benjamın Hernandez1, Hugo Perez1,2, Isaac Rudomin1, Sergio Ruiz3, Oriam de Gyves4, and Leonel Toledo4

1Barcelona Supercomputing Center, Barcelona,Spain

2Universitat Politecnica de Catalunya, Barcelona,Spain

3Tecnologico de Monterrey, Campus Ciudad de Mexico,Mexico

4Tecnologico de Monterrey, Campus Estado de Mexico,Mexico

{benjamin.hernandez, hugo.perez, isaac.rudomin}@bsc.es,{sergio.ruiz, odegyves, ltoledo}@itesm.mx

Abstract. We present a set of algorithms for simulat-ing and visualizing real-time crowds in GPU (GraphicsProcessing Units) clusters. First we present crowd sim-ulation and rendering techniques that take advantageof single GPU machines. Then, using as an examplea wandering crowd behavior simulation algorithm, weexplain how this kind of algorithms can be extended fortheir use in GPU cluster environments. We also presenta visualization architecture that renders the simulationresults using detailed 3D virtual characters. This archi-tecture is adaptable in order to support the BarcelonaSupercomputing Center (BSC) infrastructure. The re-sults show that our algorithms are scalable in differenthardware platforms including embedded systems, desk-top GPUs, and GPU clusters, in particular, the BSC’sMinotauro cluster.

Keywords. Crowd simulation, visualization, HPC, GPU-clusters, real-time, embedded systems.

1 Introduction

Crowd simulations allow safe application of thescientific method to certain subsets of the crowdphenomena; they may aid in the analysis of dif-ferent events related, for example, disease prop-agation, building evacuations, traffic modeling, orsocial evolution. The use of computational modelsthat describe the behavior of people is becominga necessity as the population increases in urban

areas. Prediction before, during, and after dailycrowd events may reduce associated logistic costs.

Large scale crowd simulations demand compu-tational power and memory resources commonlyavailable in HPC (High Performance Computing)platforms, they particularly require in situ visual-ization, i.e., when significant computational powerfor concurrent execution of simulation and visu-alization is required. Coupling visualization withsimulation while it is running reduces bottlenecksassociated with storing, retrieving, and post pro-cessing data in disk storage.

The aim of this paper is to show how an algo-rithm implementing a wandering crowd behavior (asimple behavior that is commonly used in crowdsimulations) can be extended for its use in GPU(Graphics Processing Unit) cluster environments.In addition, we present an adaptable visualizationarchitecture that renders in 3D the simulation re-sults and supports three configurations: streaming,in situ, and web.

2 Related Work

Reynolds [11] proposed the first known simulationsolution for large groups of entities with emergentbehavior being an extension of particle systems. It

Computación y Sistemas, Vol. 18, No. 4, 2014, pp. 651–664doi: 10.13053/CyS-18-4-2060

ISSN 2007-9737

is based on three basic rules: separation, align-ment, and cohesion. These rules keep together,while going in a given direction and free of colli-sions, a group of boids or bird-like objects. Suchmodel is an example of an individual-based oragent-based model (ABM), and is used to simulatethe global behavior of a large number of interactingautonomous individuals. ABMs can be appliedto different fields including biology, ecology, andeconomics.

ABMs applied to crowd simulation require theprocessing of each individual separately, thus it isa good candidate for data-parallel processing. Inrecent years, simulation systems have exploitedthe computing power offered by the GPU, freeingthe CPU from tasks that are highly parallel andbased on a single instruction multiple data (SIMD)model. Before the existence of programming mod-els such as OpenCL or CUDA, GPUs were alreadyused for general purpose computing. For example,Rudomin et al. [14] used Finite State Machines todetermine the behavior of agents, which are imple-mented using fragment shaders in GLSL (OpenGLShading Language).

CUDA and OpenCL allowed developers to takeadvantage of the GPU resources using a moretraditional C-language syntax. In addition, thearchitecture of the GPU has evolved to meet thedemands of general computing as well as that ofgraphics. In this sense, Yilmaz et al. [21] proposeda fuzzy logic method implemented in CUDA forsimulation of crowds.

There is a large body of work in simulation andvisualization of crowds. Some of it uses the GPU.Since much of this is beyond the scope of thisarticle, for a more comprehensive and recent dis-cussion, consult Rudomin et al. [13] and Kappadiaet al. [5]. Both articles discuss several advancedmethods for simulating, generating, animating, andrendering crowds.

Computer clusters have also been used forcrowd simulation and visualization. For exam-ple, Vigueras et al. [19] describe an architec-ture in which they distribute virtual world regionsbetween different machines called Action Servers(AS), where each of them controls the actionsof agents located in their region. Agents areassigned to a Client Process (CP) and maintain

constant communication with their AS. The serveralso maintains communication with AS of adjacentregions to query the state of the world in the borderareas. This system is complemented with VisualClient Process or VCP for visualization. There canbe several VCP in the system to cover differentregions or subregions of the virtual world; the ASsends only the information of the agents that willbe displayed by the VCP, which may receive infor-mation from one or more AS. The initial algorithmswere implemented in CPU; however, in recent ver-sions of the architecture, collision detection pro-cess has been migrated to the GPU acceleratingthe process considerably [18].

If one desires to have crowd simulations thattake advantage of the GPU or an HPC clusterwithout needing to program from scratch, thereare a couple of frameworks available where a usercan define the characteristics of the simulation andagents through configuration files usually in Exten-sible Markup Language (XML) format. The mostpopular is FLAME GPU [12], which maps formaldescriptions of agents into simulation code thatruns in the GPU. Another example is Pandora [20].It is an open-source framework designed to createand execute large-scale social simulations in HPCenvironments. Pandora will automatically deal withthe details of distributing the computational load ofthe simulation, so the researcher does not needto have any additional knowledge about parallel ordistributed systems.

3 Single GPU Crowd Simulation andRendering

Interactive simulation and visualization of large andvaried crowds for single GPU systems is a very ac-tive research topic. Current trends are focused onsemi-automatic crowd authoring (modeling and an-imation), microscopic and macroscopic behavior,and rendering. As solutions to these challenges,we have developed methods for simulating crowdsof varied aspect and a diversity of behaviors.

In this section we present solutions for real-timecrowd simulation and rendering that take advan-tage of single GPU systems.


652 Benjamín Hernández, Hugo Pérez, Isaac Rudomin, Sergio Ruiz, Oriam de Gyves, and Leonel Toledo

ISSN 2007-9737

3.1 Accurate Psycho-physical Characteristics

Crowds are heterogeneous groups of people, inwhich each person has individual and social prop-erties, and interacts with others through differentmeans. It is not realistic to have a crowd com-posed of only assertive male pedestrians around30 years old; nevertheless, simulations do not typ-ically include the characteristics that make pedes-trians unique. We focus on the navigation of virtualagents based on accurate psycho-physical char-acteristics of a population. Individual propertiesinclude physical and psychological characteristics;social interaction properties include communica-tion and group formation. Participants from a per-ception study are able to identify the psycholog-ical characteristics of the agents in a simulationand experimental results show that agents withthese characteristics also change the behavior ofa crowd.

We can define physical characteristics as a per-son’s body defining traits which are visually percep-tible and generally can be measured. Age, gender,height, weight, and fitness are examples of thesecharacteristics. These physical characteristics arebased on studies of real pedestrians [6, 2, 7, 10]and are used to compute the maximum speed atwhich a virtual agent is able to walk at any giventime.

We use the term psychological characteristics torefer to the personality of a pedestrian. Personalitycan be understood as a set of internal motivationsthat make two people behave in different manners,even if they share similar physical features. We usethe Eysenck 3-factor model, which categorizes thepersonality of an individual according to three mainfactors: Psychoticism, Extraversion, and Neuroti-cism. An individual may tend to exhibit one ofthese factors because of testosterone, serotonin,and dopamine, respectively. Durupinar et al. [1]and Guy et al. [3] studied similar approaches inregard to the implications of adding psychologicalcharacteristics to virtual agents. Figure 1 showsa comparison of the three personalities that areachieved using this method.

Fig. 1. Psychological comparison. The agent movesthrough five groups in order to get to its destination. Fromtop to bottom: Aggressive agent, Assertive agent, andShy agent

3.2 Goal-Oriented Behavior

We now focus on a particular technique for solvingthe Navigation problem involving embodied agentswithin virtual environments. Our first observationis that an agent, while moving through an environ-ment, solves a sequential decision problem to finda path that goes from its current configuration toa goal, constructing a set of additive rewards asit gets closer to its goal. This behavior can bedescribed by a Markov Decision Process [15].

Markov Decision Processes (MDP) may besolved in stages by a single GPU in real time,given that the partition of the navigable space iscoarse so that the MDP represents general paths


Simulating and Visualizing Real-Time Crowds on GPU Clusters 653

ISSN 2007-9737

for groups of agents [15]. We assign multiple re-wards and penalties to cells in a partitioned naviga-ble space to control agent Navigation. A finer gridpartition (Figure 2) allows for another algorithm tohandle Local Collision Avoidance (LCA).

The LCA algorithm displaces agents toward theirgoal in the direction established by the MDP Op-timal Policy (Π∗). This technique presents threesimulation advantages:

Evacuation scenarios. Several exits and obsta-cles can be modeled, matching real scenarios forevacuation. Scenario zonal hierarchy. Zones ofdifferent importance can be modeled, simulatingrough terrain or agent preference to navigate. In-teraction. Since the MDP solution is segmented,obstacles or exits can be introduced or removedwithout the need to stop the simulation.

3.3 Hierarchical Structures for Level of Detailfor Varied Animated Crowds

We introduce a level of detail system useful forvaried animated crowds, which is capable of han-dling several thousands of varied animated crowdsat interactive frame rates. This is accomplishedby using two complementary structures to reducememory consumption and optimize the renderingstage. The first structure is a skeleton with asso-ciated octrees which are used for computing ge-ometry and animation level of detail. The secondstructure is a tiling mechanism and a quad-treebuilt on top of this tiling that is used for furtherlevel of detail optimization, allowing us to combinegeometry from different characters in parts of thescene that are far away from the camera. Thecombination of these structures allows us to renderseveral thousands of varied characters within acrowd, for instance, crowds up to a quarter millioncharacters are achievable at interactive frame ratesusing these structures. Extra details on the tech-nique can be found in Toledo et al. [17].

4 Cluster Crowd Simulation

4.1 Algorithm Overview

In this section we present a simplified algorithm tosimulate reactive behavior, i.e., crowds of agents

Fig. 2. Navigation and LCA. Top: MDP partition withone exit. A coarse partition provides Navigation foragents. Navigation is modeled with a MDP. Middle: MDPpartition with two exits. Several rewards and penaltiescan be modeled. Bottom: LCA partition. Another parallelalgorithm handles LCA using the MDP policy

wandering in a virtual environment while avoidingcollisions between each other. This is a concep-tually simple behavior; however, in non-optimalconditions, its complexity is O(N2) (where N isthe number of agents in the crowd), i.e., an agentposition must be compared with all the remainingpositions to detect collisions. Current techniquesreduce the complexity of the algorithm using ge-ometrical or hierarchical approaches [5, 13]. Wereduce the complexity of the algorithm by partition-ing the navigable space into a grid which cuts downthe search space. In addition, by defining a searchradius for each agent, this search space is reduced



ISSN 2007-9737

even more. This approach follows [11], however,we increase the performance of the algorithm byadopting a parallel programming model based onGPU and multiple nodes (Sec. 4.2).

The algorithm was implemented in CUDA anduses three data arrays that store id, agent, andworld information. The id array stores a uniqueidentifier for each agent. The agent array storesthe x,y position in the world, the motion’s direction(stored as an angle), and the speed at which theagent is moving. The world array stores zerosfor unoccupied cells and the agent id for occupiedcells.

The agents are initially placed in the world withrandom direction and speed at random locations.Then the simulation works as follows. Each agentis moved in the direction and with the speed that iscontained in its representation, and does so unlessthis would take the agent to a cell that is occupied.Collision avoidance is calculated by evaluating aradius of five positions around their current posi-tion; then the agent will move in the direction inwhich there are fewer agents or unoccupied cellswithin such radius. Such evaluation starts usingthe original agent’s direction and switches counter-clockwise every 45 degrees covering a total of eightdirections (Figure 3 left).

Fig. 3. Left: Evaluation radius and directions for collisionavoidance. Right: Path evaluation example

Notice that if there is another agent in the evalu-ated path having the same direction as the currentagent, such cell is considered as a free cell, since itwould be no collision. For example, Figure 3 Rightshows the case of the agent in the position (8,6).The first direction evaluated is upwards becauseit is its original direction; after four cells there isanother agent but it is going in the same direction,therefore, the cell is considered free. There are

other directions where we would also find emptycells in a radius of five positions, for example, in thedown direction, but we give priority to the agent’soriginal direction. In the current implementation,agents choose the best cell at every iteration, butthe original angle for the agent is not changed, soin the next iteration agents will try to move in theoriginal direction. When an agent reaches the limitof the world, it rotates 180 degrees; for example,the agent at position (15,9) would return in thesame path.

The main computation is updating the status ofthe world and the agents, and this task is per-formed in the GPU using CUDA. Notice that thedata arrays are uploaded to the GPU memory atthe beginning of the simulation. On the other hand,by defining a search radius and direction for eachagent, our algorithm reduces the original complex-ity of collision avoidance from O(N2) to O(N×r×d)worst case, where N is the number of agents, r <<N is the search radius and d << N is the searchdirection. The best case occurs when the agentfinds another agent just in front of him, i.e., r = 1and d = 1 which turns into O(N) complexity, andthe worst case occurs when the agent is roundedup by other agents.

4.1.1 Results

Three experiments were designed to verify theperformance of the algorithm in different platformsconsisting in an embedded system and four differ-ent GPUs. The embedded system was a JetsonTK1 development kit with 192 CUDA cores and 1.8GB VRAM, and the GPUs were a GT 540M with 96CUDA cores and 2 GB VRAM, a Tesla M2090 with512 CUDA cores and 6 GB in RAM, a GTX TITANBlack with 2880 CUDA cores and 6 GB VRAM, anda Tesla K40c with 2880 CUDA cores and 12 GBVRAM.

All tests were performed running 30 iterationsand using a world size of 16384× 16384. Reportedtime corresponds to the average time per simula-tion step given in seconds.

The first experiment consisted in determining thespeedup of the simulation when GPU processingis present. The sequential version of the algorithm,running in one CPU core of an Intel Xeon E5649



ISSN 2007-9737

Fig. 4. Speedup obtained in the CUDA version of thealgorithm

Six-Core at 2.53 GHz with 24 GB RAM, is usedas a reference. Results of this test are reported inFigure 4.

The second test consisted in evaluating the al-gorithm’s performance on the different hardwareplatforms described earlier. Figure 5 shows theresults from this test. Considering 30ms being thethreshold response time for interactive systems,according to the results the Jetson TK1 is not ableto respond with the minimal quantity of agentswhich was 5122 in this test, GT 540M can simulatearound 500 thousand agents, Tesla, Titan Black,and K40c can simulate almost a million of agents.

Notice that in these experiments, we simulate upto 16 million agents because most of the GPUswere able to allocate such amount of data exceptthe Jetson TK1 which supported up to eight millionagents.

The third experiment allowed us to establish theperformance of each platform when reaching its

Fig. 5. Performance comparison in different GPUs

peak load (Figure 6).

Fig. 6. Peak load of the embedded system and GPUsused in the experiments

4.2 Simulating Crowds in Minotauro Clusterusing MPI and CUDA

As found in the experiments in Section 4.1.1, thenumber of agents to compute in one node is limitedby the memory available in the GPU. In particular,this limit is 6 GB for the Minotauro cluster (NvidiaM2090 Tesla GPU); thus, to tackle larger problemswe will need to use multiple nodes.

Multiple node programming requires memorymanagement, communication, and synchroniza-tion techniques to avoid communication over-heads; the system architecture becomes complex



ISSN 2007-9737

because of the processes inherent to the program-ming model.

In order to extend the algorithm from Section4.1 to multiple nodes, we implemented the tilingtechnique, i.e., the world is tiled, then each tile iscomputed by a different node. In this case, CUDAis used for computing the agent’s new position, andMPI (Message Passing Interface) is used for datainterchange between nodes.

We designed the system to use a number ofnodes that has a square root. The use of thenumber of nodes that has a square root allows usto divide the world into an equal number of rowsand columns (Figure 7). The minimum number ofnodes that our program can use to execute the sim-ulation is four. On the other hand, the world can beany size, but large sizes are to be expected givena large number of agents which can be simulated.

Fig. 7. Example of a simulation requiring 9 nodes: theworld is tiled into 9 zones, i.e., 3 rows and 3 columns

Regarding the data structures, we modify theid array to store the zone in which each agent islocated. Notice that, to save memory, each nodestores just the tile of the world that corresponds toit, using an offset in axis x and y to calculate theglobal coordinate.

On the other hand, the collision avoidance al-gorithm is modified when the agents are on theborders of the tile. We explain this with an example.Considering a vision radius of one cell, Figure 8shows that the agent located at coordinates (10, 9)in zone 5 needs to know if the cell at coordinates(11, 9) is empty: it is necessary to exchange theinformation of the agents near the borders in orderto determine the next position of such agents.

Fig. 8. Interchange of agent information in area border-ing two nodes

Therefore, the area considered as the tile’s bor-der is the radius times the height of the area for leftand right neighbors, and the radius times the widthof the area in the case of up and down neighbors.

Communications are made more efficient by ex-changing only the cells that are occupied at theborders using a two-dimensional integer array thatstores the coordinate of a cell in the world and theidentifier of the agent occupying the cell. On theother hand, as computation is performed by severalnodes, we are reducing the complexity of the algo-rithm presented in Section 4.1 to O(N×r×d

n ), whereN , r, and d remain as described in Section 4.1,and n is the number of compute nodes. However,this complexity only takes into account an idealcase when nodes do not interchange data. Con-sidering we are using point-to-point communicationbetween n nodes and sending messages of msize, then, the complexity is O(N×r×d

n ) + O(nm)

or O(N×r×dn + nm), with n << N and m << N .

A worst case for the communication complexitymay occur when a node communicates with itsfour neighbors, e.g., node five in Figure 7, and anexample of a best case is node three, which only in-terchanges data with nodes six and two (Figure 7).

4.2.1 Results

Two experiments were performed in order to verifythe efficiency of the algorithm using multiple nodes.In both experiments, the world size was 278522 andwe ran three iterations of the algorithm.

The first experiment consisted in a performancecomparison between the CPU and GPU (CUDA)version of the algorithm using 4, 9, and 16 nodes



ISSN 2007-9737

(Figure 9). Although the GPU version is betterthan the CPU version as expected, the differenceis not as big as in single node. The GPU versionexecution time is increased dramatically due to thenumerous internode communications, and, in par-ticular, because in order to transfer data betweenGPUs we must copy data from the GPU memoryto the CPU memory, then transfer data betweennodes, and then copy from the CPU memory tothe GPU memory. The execution time of the CPUversion has improved compared to the single nodesequential version.

Fig. 9. Performance comparison between CPU and GPUversions using 4, 9, and 16 nodes

The second test was to determine simulationscaling using multiple nodes (Figure 10). In thecurrent implementation it is possible to simulate upto 289 million agents with 4 nodes and 529 millionagents with 9 nodes.

The response time for the CUDA multiple nodeversion increases up to 5 seconds for 64 millionagents, which is a good number considering thenumber of agents. On the other hand, the CPUmultiple node version improves considerably withrespect to the single node version, but the GPUmultiple node version is still at least 3 times faster.

Fig. 10. Scaling

5 GPU Cluster Crowd Visualization

Recent advances in hardware and software tech-nologies in HPC technology have allowed the sim-ulation of large scale problems. Computationalsolutions produce a vast amount of data whichrequire further processing to offer insights aboutthe results. Such information usually is simplifiedbefore analysis; however, simplifications may hiderelevant details.

On the other hand, visualization, which trans-forms these results into graphical and color repre-sentations, can improve human cognition in data orresults’ analysis. In addition, in situ visualization re-duces I/O operations [8, 22], i.e., it avoids full datatransfer of results; it also reduces time, allowing theresearcher to inspect partial simulation results andperform simulation adaptations as required.

The selection of a visualization mechanism de-pends on the simulation and available HPC infras-tructure; our approach for visualization can renderdetailed 3D crowds. The term ‘detailed’ refersto the fact that the crowd is rendered using an-imated characters with different geometrical andvisual appearance. In addition, we have developedthree visualization modes described in the follow-ing paragraphs that are suitable for the particularBSC’s infrastructure.

Figure 11 shows the crowd engine’s modules de-signed to visualize detailed 3D crowds. The GoD,or generation of diversity stage, runs as a pre-process. This stage generates and animates 3Dcharacters semi-automatically. In run-time simula-tion results are stored in an array of ids and agent



ISSN 2007-9737

Fig. 11. General diagram of our Crowd Engine. GoD(generation of diversity) generates and animates geo-metrical and visually diverse characters, then they arerendered using level of detail (LoD) and view frustumculling (VFC) according to the simulation results

positions (Section 4) which are used to renderunique animated characters. Details of this stageare reported in [16]. Discrete level of detail (LoD)and view frustum culling (VFC) are implementedin GPU to visualize several hundreds of charactersin real time [4]. As a result, our engine generatesframes that are processed or transmitted accordingto different engine configurations.

5.1 Streaming Mode

As mentioned earlier, visualization mechanismsdepend on the simulation and available HPC infras-tructure. In the case of the Barcelona Supercom-puting Center, the Minotauro cluster establishesfast network connection within BSC’s LAN, thisallows visualization at interactive rates. Defaultset-up uses VirtualGL1 for image transport. Band-width between the user and the cluster can helpor severely impair an interactive user experience.Unfortunately, external bandwidth into BSC is re-stricted resulting in a visualization frame rate ofabout 1-2 fps with a resolution of 5122 pixels usingVirtualGL’s jpeg compression.

Accessing simulation results for visualization inany computer is one of our goals. It is desirableto enable researchers to access their results forvisualization and analysis at a distance, and suchaccess may occur during or after simulation. A

1http://www.virtualgl.org/

Fig. 12. Our crowd engine can be configured in thestreaming mode to visualize simulations or results beingcalculated or stored in remote resources

second goal is to couple our crowd engine to ex-isting BSC Pandora framework [20] without majormodifications in any of the platforms.

In order to achieve both goals, we decided toexecute the simulation (Section 4) or access thesimulation data from a remote source and streamout the results to a visualization server or a work-station. Figure 12 shows a general diagram of thisapproach. A remote resource sends crowd simu-lation results (previously generated or calculated inrun-time) to a visualization server which does thecrowd rendering.

For the second case, which requires remoteresource access, a HDF5 file format parser wasdesigned to retrieve the data. For test purposesOSC protocol is used to stream results in bothcases.

Notice that the bandwidth of the connectionneeds to be less than that needed for transferringimages. It depends on the number of agents andnot the size or quality of the image. It works forfewer agents but works for slower networks andgives us good quality results.

5.2 In Situ Mode

In Situ Visualization configuration uses the algo-rithm described in Section 4.2 for simulation usingthe Minotauro GPU cluster. Each node executesboth the simulation and the visualization of onetile of the world (Figure 13). The crowd enginewas modified to support MPI communications andoff-screen rendering.



ISSN 2007-9737

Fig. 13. In situ configuration allows crowd simulation andvisualization within different nodes. Each participatingnode sends its rendered frames to the master node. Themaster node composes each received frame to generatethe final image

MPI communications enable the engine to as-sign simulation/rendering work to each participat-ing node, share simulation results between nodes(Section 4) and transmit partial renderings for finalcomposition. Off-screen rendering enables eachnode to capture a visualization frame with colorand depth information encoded in an RGBA arraywhere the depth component is stored in the alphachannel. Then this information is downloaded toRAM and transferred to the master node.

Once the master node receives the color plusdepth images from each node, it generates the finalcomposite. This image is generated based on sort-last depth compositing algorithm [9] implementedin GLSL. Finally, the composite image can be sentto the client through VirtualGL.

Notice that the client will not require advancedrendering resources as needed in the streamingmode but it needs to be connected to BSC’s LAN.

5.3 Web Mode

A particular requirement of our streaming architec-ture (Section 5.1) is the use of a workstation orvisualization server having a last generation GPU.On the other hand, in situ visualization requiresdirect connection between the user and the cluster.

We are designing a third option in order to fill thegap between streaming and in situ mode. The webmode, currently in the prototype stage, is designed

Fig. 14. Web mode allows to display visualization resultsin web browsers

to allow the researcher to access visualization re-sults in any device and away from the desktop.

Inspired by cloud gaming technology, this con-figuration captures the results from visualizationand streams them out to a web browser (Figure14). It follows the client/server architecture: theserver executes the simulation, visualization, andstreaming, and the client displays the visualizationframes and captures user interaction events whichare sent back to the server.

5.4 Results

Stream mode tests were performed using two op-tions. In the first option, a remote resource streamsa previously generated simulation in the PandoraFramework of a wandering crowd behavior. Theremote resource uses a peer to peer connection toa workstation with an Nvidia TITAN graphics card.In this case we obtained a consistent frame rate of60 fps for a crowd of 4096 characters.

The second test consisted in setting up a reversetunneling connection between the Minotauro clus-ter and the workstation. The simulation is executedin the cluster and results are streamed out to theworkstation. In this case we were also able torender 4096 characters at 60 fps.

It is important to mention that our streamingalgorithm does not implement any buffering tech-nique. However, our crowd engine interpolates thereceived character’s positions between frames toavoid artifacts due to possible data loss.

In the case of in situ visualization, the test wasperformed in the Minotauro cluster using five MPIprocess. Four processes simulated and rendered



ISSN 2007-9737

a) Far distance view.

b) Close up view.

Fig. 15. In situ mode results. Four nodes are used, eachrenders the crowd according to the assigned world tile

4096 characters and the fifth one performed imagecomposition (Figure 15 Up). Figure 15 Bottomshows a close up view of the simulation. In thecase of in situ visualization, we obtained framerates between 15 to 20 fps for a simulation of 16Kcharacters.

Further tests consisting in running the crowdengine in one node (Nvidia Tesla M2090) showedthat it can render up to 8K characters at 10-12 fpswhile the workstation with an Nvidia TITAN graphiccard can render up to 32K characters at 16-20 fps.

Finally, to test the web prototype, we designedtwo experiments. The first one consisted in runningthe prototype in the same machine using differ-ent web browsers to detect potential compatibilityproblems (Figure 16). Our results showed thatChrome, Firefox, and Opera browsers were able todisplay the visualization successfully. In addition,

Fig. 16. Results of the web mode prototype. Left:OpenGL/glut window. Right: Visualization is displayedin Chrome browser

there were found no noticeable lag between theuser interaction and the simulation.

In the second experiment, we made a peer topeer connection between the workstation and alaptop, and between the workstation and an An-droid based mobile device. Some lag between thesimulation and the visualization in the browser wasnoticeable. This is due to the fact that in the currentstate of the prototype we have not yet implementedany compression mechanism; however, this config-uration allows us to inspect visualization results inany device.

6 Conclusions

In the embedded and single GPUs systems, ouralgorithm scales almost linearly according to itsalgorithmic complexity. It reaches up to 302 millionagents in a single GPU system. A result thatattracted our attention was the performance of theJetson TK1. Despite having 192 CUDA cores, theJetson TK1 had a poorer performance than the GT540M which has 96 CUDA cores. This is becausethe Jetson TK1 has a clock rate of 852 MHz anda memory bus width of 64 bits, while the GT 540has a clock rate of 1344 MHz and a memory buswidth of 128 bits. Nevertheless, the Jetson TK1performance is up to three times better than thatof the sequential version. It is also important to



ISSN 2007-9737

notice that the architecture of the GPUs allowsbetter speedups when processing larger numbersof agents.

The performance behavior of the other GPUswere as expected.

On the other hand, the cluster version of thealgorithm scales up to 529 million agents with 9nodes. It is important to mention that communi-cation overhead is reduced by computation andcommunication overlapping and exchanging onlyoccupied positions in the borders between neigh-bors. We also found that scalability is worse forthe GPU accelerated application than that for theCPU application given that the impact of the GPUacceleration was quickly dominated by the commu-nication time, since in each iteration we exchangeagents and border areas with neighbor nodes. Weare working to improve the communication pro-cesses.

Regarding visualization, we have designed aconfigurable crowd engine that supports three ba-sic modes: streaming, in situ, and web. Thestreaming mechanism supports crowd visualiza-tion in machines connected outside the BSC’sLAN; in addition, the web mode allows us to displaylarge scale simulations in any device. Both config-urations offer flexibility and may reduce infrastruc-ture costs when advanced visualization systemssuch as the CAVE or Tiled Display Walls are re-motely available.

Streaming mode also avoids the difficulty of cou-pling already existing simulation tools, and the webmode can be used to take advantage of sensorsavailable in mobile devices. On the other hand, ourin situ visualization approach for crowd rendering isdesigned to take advantage of the GPUs availablein the Minotauro supercomputer. But additionalperformance is expected when the simulation runsin one GPU and crowd rendering in the secondGPU is available per each node.

7 Future Work

As future work, we plan to use recent techniquesof communication like CUDA-aware MPI or GPUDirect, expecting these technologies to improvethe communication process. On the other hand,the dynamic nature of crowd simulation makes it

prone to load imbalance quickly, so we are adopt-ing OMPSS, a programming model developed atthe Barcelona Supercomputing Center, to tacklethis problem.

The visualization architecture also needs someimprovements. For example, the streaming modecan be used to couple the rendering engine withadditional crowd simulation frameworks; the webmode requires mechanisms for adaptive compres-sion and buffering. In addition, an interesting op-portunity that the web configuration offers is theuse of sensors available in mobile devices (e.g.,touch screen, accelerometers, and gyroscopes) foruser interaction. A mobile device can also be usedas a remote control in cases where visualization isdisplayed in CAVE systems or Tiled Display Walls.

Finally, detailed rendering demands more re-sources than simulation. In the current in situarchitecture, the amount of simulated agents de-pends on how many of them can be rendered; inother words, we do not simulate more agents thanthose that can be visualized per node. Alternativearchitectures for in situ visualization will use somenodes for simulation and others for visualization.Another important modification that we are consid-ering is the reduction of communications betweenslave−master nodes: instead of performing the fi-nal composition in the master, partial compositionscan be done in nodes selected based on the virtualcamera. Then each partial composite can be trans-mitted to the nodes near to the camera to performadditional compositions and, finally, the master willdo the global composition of the remaining partialcomposites. This modification will be necessaryfor exascale systems which would be composed ofthousands of compute nodes.

Acknowledgements

The authors would like to thank NVIDIA throughthe CUDA Center of Excellence program for par-tial support of this project. This work has alsobeen partially funded by CONACyT-BSC postdoc-toral fellowship, CONACyT SNI-54067, CONACyTCVU-375247, and CONACyT CVU-311633.



ISSN 2007-9737

References

1. Durupinar, F., Allbeck, J., Pelechano, N., &Badler, N. I. (2008). Creating Crowd Variation withthe OCEAN Personality Model. International jointconference on Autonomous agents and multiagentsystems, AAMAS, pp. 1217–1220.

2. Fugger, T., Randles, B., Stein, A., Whiting, W.,& Gallagher, B. (2000). Analysis of PedestrianGait and Perception-Reaction at Signal-ControlledCrosswalk Intersections. Transportation ResearchRecord, Vol. 1705, No. 1, pp. 20–25.

3. Guy, S. J., Kim, S., Lin, M. C., & Manocha, D.(2011). Simulating heterogeneous crowd behaviorsusing personality trait theory. Symposium on Com-puter Animation, pp. 43.

4. Hernandez, B. & Rudomin, I. (2011). A renderingpipeline for real time crowds. GPU PRO 2. AKPeters.

5. Kappadia, M., Pelechano, N., Guy, S., Allbeck, J.,& Chrysanthou, Y. (2014). Simulating heteroge-neous crowds with interactive behaviors. EG 2014 -Tutorials.

6. Knoblauch, R., Pietrucha, M., & Nitzburg, M.(1996). Field Studies of Pedestrian Walking Speedand Start-Up Time. Transportation ResearchRecord, Vol. 1538, No. 1, pp. 27–38.

7. Laplante, J. N. & Kaeser, T. P. (2004). The Con-tinuing Evolution of Pedestrian Walking Speed As-sumptions. ITE Journal, Vol. 74, No. 9, pp. 32–40.

8. Ma, K.-L. (2009). In situ visualization at extremescale: Challenges and opportunities. ComputerGraphics and Applications, IEEE, Vol. 29, No. 6,pp. 14–19.

9. Molnar, S., Cox, M., Ellsworth, D., & Fuchs, H.(1994). A sorting classification of parallel rendering.IEEE Comput. Graph. Appl., Vol. 14, No. 4, pp. 23–32.

10. Rahman, K., Ghani, N. A., Kamil, A. A., &Mustafa, A. (2012). Analysis of Pedestrian FreeFlow Walking Speed in a Least Developing Country: A Factorial Design Study. Research Journal ofApplied Sciences, Engineering & Technology, Vol. 4,No. 21, pp. 4299–4304.

11. Reynolds, C. W. (1987). Flocks, herds and schools:A distributed behavioral model. Proceedings of the14th Annual Conference on Computer Graphics andInteractive Techniques, SIGGRAPH ’87, ACM, NewYork, NY, USA, pp. 25–34.

12. Richmond, P. & Romano, D. (2011). Template-Driven Agent-Based Modeling and Simulation withCUDA. In GPU Computing Gems Emerald Edition,Applications of GPU Computing Series, chapter 21.Morgan Kaufmann, 1 edition, 313–324.

13. Rudomin, I., Hernandez, B., de Gyves, O.,Toledo, L., Rivalcoba, I., & Ruiz, S. (2013). Gpugeneration of large varied animated crowds. Com-putacion y Sistemas (CyS) special issue on Super-computing: Applications and Technologies, Vol. 17,No. 3.

14. Rudomın, I., Millan, E., & Hernandez, B. (2005).Fragment shaders for agent animation using finitestate machines. Simulation Modelling Practice andTheory, Vol. 13, No. 8, pp. 741–751.

15. Ruiz, S. & Hernandez, B. (2014). Markov decisionprocess and micro scenarios for crowd navigationand collision avoidance. Research in ComputingScience, Vol. 74, pp. 103–116.

16. Ruiz, S., Hernandez, B., Alvarado, A., &Rudomın, I. (2013). Reducing memory require-ments for diverse animated crowds. Proceedings ofMotion on Games, MIG ’13, ACM, New York, NY,USA, pp. 55:77–55:86.

17. Toledo, L., De Gyves, O., & Rudomın, I. (2014).Hierarchical level of detail for varied animatedcrowds. The Visual Computer, Vol. 30, No. 6-8,pp. 949–961.

18. Vigueras, G., Orduna, J. M., Lozano, M., & Ce-cilia, J. M. (2014). Accelerating collision detectionfor large-scale crowd simulation on multi-core andmany-core architectures. Int. J. High Perform. Com-put. Appl., Vol. 28, No. 1, pp. 33–49.

19. Vigueras, G., Orduna, J. M., Lozano, M., & Jegou,Y. (2013). A scalable multiagent system architecturefor interactive applications. Sci. Comput. Program.,Vol. 78, No. 6, pp. 715–724.

20. Wittek, P. & Rubio-Campillo, X. (2012). Scalableagent-based modelling with cloud hpc resourcesfor social simulations. 2013 IEEE 5th InternationalConference on Cloud Computing Technology andScience, pp. 355–362.

21. Yilmaz, E., Isler, V., & Cetin, Y. Y. (2009). Thevirtual marathon: Parallel computing supports crowdsimulations. IEEE Computer Graphics and Applica-tions, Vol. 29, No. 4, pp. 26–33.

22. Yu, H., Wang, C., Grout, R., Chen, J., & Ma,K.-L. (2010). In situ visualization for large scalecombustion simulations. Computer Graphics andApplications, Vol. 30, No. 3, pp. 45–57.



ISSN 2007-9737

Benjamın Hernandez is a postdoctoral researcherat the Barcelona Supercomputing Center in Spain.Dr. Hernandez focuses his research interestson the intersection of real-time crowd simulation,human computer interaction, and visualization ofinhabited virtual environments using high perfor-mance computing. He has advised postgraduatetheses in this field and co-directed the NVIDIACUDA Teaching Center Initiative at Tecnologico deMonterrey. He currently holds the National Sys-tem of Researches Fellowship (SNI-C) from theMexican National Council for Science and Technol-ogy(CONACYT), Mexico.

Hugo Perez is a Ph.D. student specializing inComputer Science at Universitat Politecnica deCatalunya, Barcelona, Spain. His researchinterests are real-time crowd simulation, high-performance computing, parallel programmingmodels, and computer graphics.

Isaac Rudomin received his Ph.D. from the Uni-versity of Pennsylvania. He is currently a seniorresearcher at the Barcelona Supercomputing Cen-ter, Spain. His research interests are human andcrowd generation, simulation, animation and vi-sualization, human-computer interaction, and highperformance computing.

Sergio Ruiz is a software engineer at the Mon-terrey Institute of Technology, Mexico City cam-

pus (Tecnologico de Monterrey, campus Ciudad deMexico). Currently, he is a Ph.D. student at thesame institution; his research area is path planningfor simulated crowds proposing a thesis entitled “AHybrid Method for Macro and Micro Simulation ofCrowd Behavior”. He is currently on a researchinterchange visit working on crowd simulation atthe Barcelona Supercomputing Center, Spain.

Oriam de Gyves is a Ph.D. specializing in Com-puter Science and particularly in Computer Graph-ics, at Tecnologico de Monterrey, campus Estadode Mexico. He studies and does research of be-haviors for crowd simulation using General Pur-pose Computation in Graphics Processing Units.His research interests include simulation, behaviorand visualization of real-time crowds, as well asparallel computing on GPUs.

Leonel Toledo is a Ph.D. student at Tecnologicode Monterrey, Campus Estado de Mexico. Hemade a research interchange visit working oncrowd simulation at the Barcelona SupercomputingCenter, Spain. He has been a half-time professorat Tecnologico de Monterrey, Campus Estado deMexico, since 2011, and his research interestsinclude crowd simulation, animation, visualization,and rendering.

Article received on 23/06/2014; accepted on 28/09/2014.



ISSN 2007-9737

simulating and visualizing real-time crowds on gpu clusters

Documents