x10-based massive parallel large- scale traffic flow...

X10-based Massive Parallel Large-Scale Traffic Flow Simulation

Toyotaro Suzumura1,2, Sei Kato1, Takashi Imamichi1, Mikio Takeuchi1, Hiroki Kanezashi2, Tsuyoshi Ide1, and Tamiya Onodera1

IBM Research – Tokyo1, Tokyo Institute of Technology2

1

This research was partly supported by the Japan Science and Technology Agency (JST) Core Research of Evolutionary Science and Technology (CREST)

1

2

3 4 5

6

7 8

X10-based Ultra-Large Scale Agent Simulation on the 2 Petaflops Supercomputer

Goal: To build a scalable large-scale agent simulation platform based on X10 that runs on a Super Computer with ten thousands of CPU cores and dual links of 40Gbps Infiniband network

Status: Completed the multi-node version and verified the scalable performance with the Hiroshima road network.

Megaffic

X10 TSUBAME: 2 Petaflops Supercomputer

Simulation Data: Hiroshima # of trips: 10000 (1/100 of real trips) # of simulation steps: 1000 (1/100 of real steps: 24 hours)

0

50

100

150

200

250

1 2 4 8 16

Time(s)

Places

Simula.on .me

196 cores 12 cores

Outline § Motivation

§ XAXIS Overview and Architecture

§ Design for Highly Scalable Platform

§ Performance Evaluation

§ Discussion

§ Related Work

§ Concluding Remarks and Future Work

§ Other Activities

Background: Large-scale Simulation is Everywhere §  We have entered into the era where proactive response is needed

§  Highly performance large-scale based simulation is required for timely decision.

4

http://mark.buchanan.pagesperso-orange.fr/nature_economic_modelling.pdf

How can we design and develop highly distributed agent simulation platform ?

§  How can we design and implement a platform that handles millions of agents and multiple simulations concurrently ?

§  How can we handle large-scale graphs consisting of millions of vertices and tens of millions of edges such as the whole Japanese road network ?

1

2

345

6

78

1

2

3 4 5

6

7 8

X10-based Large Scale Agent Simulation on the 2 Petaflops Super Computer

§  Goal: To build a scalable large-scale agent simulation platform that runs on a Super Computer with thousands of cores and dual links of 40Gbps Infiniband network

§  Technical Challenge towards High Scalability : How can we concurrently process multiple agents in a scalable manner ? –  How can we divide extremely huge graph into a set of sub-graphs and allocate each subgraph to compute node on a

super computer in order to find the best allocation pattern that balances the communication and computational cost based on the profiling data at runtime ?

à Prior arts tackle similar problem, but the different underlying environment and application needs different optimization scheme

Megaffic

X10

XAXIS: X10-based Agents eXecutive Infrastructure for Simulation § X10-based Distributed Agent Simulation Platform

– X10 is the state-of-the-art PGAS (Partitioned Global Address Space) language that brings high productivity when implementing highly parallel and distributed applications on post-peta or exascale machines •  X10 provides the functionality that can seamlessly integrate with legacy

applications written in Java or C++.

§ Programming Model –  The agent programming model of XAXIS is derived from our ZASE

[Yamamoto, AAMAS2007] simulation platform

– XAXIS provides compatible API interface of ZASE to developers.

Gaku Yamamoto, et.al, “A Platform for Massive Agent-based Simulation and” , AAMAS 2007

XAXIS Software Stack §  The following diagram illustrates the software stack of XAXIS and its

applications.

§  XAXIS in X10 can execute the existing ZASE applications written in Java with slight modification

8

ZASE Simula2on

Run2me (Java)

Agent Simula2on(Java)(e.g. Traffic, CO2 Emission, Auc2on, Marke2ng)

XAXIS : X10-‐Based Simula2on Run2me

Agent Simula2on

(X10)

ZASE-‐XAXIS-‐Bridge (Java)

X10 (Java, C++)

ZASE API ZASE-‐XAXIS-‐Bridge

(X10)

XAXIS Architecture: X10-Based Agent Simulator

A2

onHandleMessage

receiveMessage

Agent Object

Repository

(3) Retrieves an agent object with an agent id

Invoke an onHandleMessage method of the A2 object

9

A1

execute

sendMessage

Agent Directory

Identify a place id

Send message with “async at ” Msg (place id, agent id)

Msg (agent id)

Agent Object

Repository

Agent Directory

Agent Manager Agent Manager

Place P

Identify a place id

User Agent Code

Place Q

User Agent Code

A2

XAXIS Server Global Data

Place 0

Simulation Cycle Management

XAXIS-based Large Scale Traffic Simulator

10

Vehicle Proxy/ Vehicle

CrossPoint (X10 Activity)

Place P Place Q

Simulation Manager

Place 0



Simulation execution at time T

Road Road




Road Road




Simulation execution

Road Road




Road Road

Simulation execution at timeT

Graph Server (Java) Graph

Simulation execution

Destination (X10 Activity)

Origin (X10 Activity)

SubGraph ?? SubGraph ??



Mapping Magaffic Components to X10

Megaffic

X10

Megaffic on XAXIS

XAXIS Runtime

GraphServer Driver.StaticDriver

roadnetwork. Road

simulator. Place

roadnetwork. Area

simulator. Region

roadnetwork. CrossPoint

simulator. Driver

TrafficEnv

Service

citizen. Service

TrafficSim

Launcher

Simulator.Launcher/

RegionLauncher

ShortestPath /Dijkstra

Vehicle

simulator.

Citizen

Vehicle Proxy

simulator. CitizenProx

y

Component Diagram

Cross

Point (Zase Driver)

Cross

Point (ZaseDrive

r)

CrossPoint

(ZaseDriver)

Road (Zase Place)

VehicleProxy

Vehicle

Area (Zase Region)

Road (Zase Place)

Driver

VehicleProxy

Vehicle

Driver

Graph(road network)

in

out





§ Discussion

§ Related Work



Each Place(X10) manages different set of CrossPoints (XAXIS)

Cross

Point

Cross

Point

CrossPoint

(ZaseDriver)

VehicleProxy

Vehicle

Road

Driver

VehicleProxy

Vehicle Driver

Graph(road network)

in

out Road

Cross

Point

Cross

Point

CrossPoint

(ZaseDriver)

VehicleProxy

Vehicle

Road

Driver

VehicleProxy

Vehicle Driver

Graph(road network)

in

out Road

Cross

Point

Cross

Point

CrossPoint

(ZaseDriver)

VehicleProxy

Vehicle

Road

Driver

VehicleProxy

Vehicle Driver

Graph(road network)

in

out Road

1

2

3 4 5

6

7 8

2

3

4

SubGraph (Road

Network)

SubGraph (Road

Network)

SubGraph (Road

Network)

Design for Vehicle Migration among Different X10 Places

A2

deserialize migrated vehicle

Receive migrated vehicle object

(3) Retrieves an agent object with an agent id

Invoke an onHandleMessage method of the A2 object

16

A1

Execute & send migrate vehicles

Migrate vehicle object Cross Point

Directory Identify a place id

Send vehicle with “async at ” Vehicle object message

Vehicle (migration cross point id)

Cross Point Directory

Cross Point Manager Cross Point Manager

X10 Place (P)

Identify a place id

X10 Place (Q)

Cross Point

ZASE-X Server Global Data

Place 0


Cross Point

Migration Vehicle

Repository

Road Network (P)

Road Network (Q)

Design for Vehicle Migration Among X10 Places

CP1

CP2

CP3 CP0

X10 Place 1

X10 Place 2 X10 Place 0

Road 0

Road 1

Road 2

•  A road object has information on the identifiers of origin and destination cross points. When looking into the cross point identifier, it is possible to know which X10 place the cross point is located since the identifier is assigned by the X10 DistArray construct.

•  A road object also exists at a place where its destination cross point is located. For instance, if certain trip takes CP1 as origin and CP3 as destination, a graph server returns CP1, CP2, and CP3 as a shortest path. When a vehicle firstly enters into a road, it checks whether a road exists at the same X10 place. If not, a manager migrates the vehicle to the next road exists at different X10 place.

CP: Cross Point





§ Discussion

§ Related Work



TSUBAME 2.0 Supercomputer .

TSUBAME 2.0 System Configuration

TSUBAME 2.0 Specification Specification

CPU Intel Westmere EP (Xeon X5670, L2 Cache: 256 KB, L3: 12MB) 2.93 GHz processors, 12 CPU Cores (24 cores with Hyper Threading) x 2 sockets per 1 node (24 CPU Cores)

RAM 54 GB

OS SUSE Linux Enterprise 11 (Linux kernel: 2.6.32) # of Total Nodes 1466 nodes (We only used up to 1366 nodes) Network Topology Full-Bisection Fat-Tree Topology

Network Voltaire / Mellanox Dual-rail QDR Infiniband (40Gbps x2 = 80 Gbps)

GPGPU Three NVIDIA Fermi M2050 GPUs (*Not used for this work) GCC and OpenMP GCC 4.3.4 (-O3 option) , OpenMP 3.0

OpenMPI OpenMPI 1.5.3, MVAPICH 1.6.1

Java Virtual Machine IBM Java 1.6.0 (GC Policy: gencon) X10 X10 2.1.1.1

Performance Evaluation – Single Node

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

1 2 4 6 8 10 12

# o

f sp

eedu

p

# of threads

Performance Characteristics of XAXIS (# of trips: 115000 (1/10), roadnetwork: hiroshima)

100

1000

10000

100000

Objective • To see the performance speed up ratio with varying number of threads and simulation overall steps Experimental Setting: • Road network: hiroshima • # of trips: 115000 (1/10) • # of simulations: 100, 1000, 10000 , 100000 Findings: - This result has revealed that the real traffic data could not consume the full CPU usage. - To evaluate the full capability of XAXIS+MegafficCUI, we need to create artificial data - The expected time for 1 day will be 355 seconds * 10 = 1 hour 0

200

400

600

800

1000

1200

1400

1 2 4 6 8 10 12

# of simulations: 10000 (real data is 86400)

# of threads

Elapsed Time (sec)

S72hs22-4.trl.ibm.com(Intel Xeon 6 core x 2 sockets, Hyper thread :off, 8GB RAM, RHEL5, )IBM J9 VM 1.6.0 (gc policy : gencon , -Xms:4096m,-Xmx6144m)

Performance Evaluation – Multiple Nodes (100 trips per 1 step )

# of Place 1 2 4 8 16

# of Migra2ons

0 3214 8282 14836 13261

Execu2on Time （s）

221.172

132.977

81.267 72.229 49.619

# of trips : 100000, # of steps: 1000, RI=true # of threads per node: 12, heap memory=32GB The origin and destination of each trip is the same, but some trips moves to other places depending on the shortest path.

0

2000

4000

6000

8000

10000

12000

14000

16000

1 2 4 8 16 Th

e nu

mbe

r of m

igra.o

n

Places (Nodes, 12 CPU core per Node)

The number of migra.ons

0

50

100

150

200

250

1 2 4 8 16

Time(s)

Places (Nodes, 12 CPU core per Node)

Simula.on .me

Performance Evaluation – Multiple Nodes (100 trips per 1 step, More migrations)

Place数 1 2 4 8 16

# of Migra2ons 0 12523 26889 41326 63747

Simula2on Time(s)

223.125 139.870 97.121 90.276 64.080

# of trips : 100000, # of steps: 1000, RI=true # of threads per node: 12, heap memory=32GB The origin and destination of each trip is the different place.

0

50

100

150

200

250

1 2 4 8 16

Time(s)

Places(Nodes, 12 CPU core per Node)

Simula.on Time

0

10000

20000

30000

40000

50000

60000

70000

1 2 4 8 16 Th

e nu

mbe

r of m

igra.o

n

Places

The number of migra.ons

The origin and destination of each trip is located at different places.

Discussion §  CPU usage greatly becomes lower if we employs more number of nodes

§  This is because the number of cross points per 1 node becomes less

§  We need more heavy computation or more trips per each step for better scalability

0

5

10

15

20

25

30

35

40

45

1 2 4 8 16

CPU usage(%

)

Places

CPU usage

CPU_us

CPU_sy

Hiroshima with 16 Places

Rio De Janeiro

Singapore

Large-scale traffic simulation with the whole Japanese road network consisting of 1 million cross points and 10 million vehicles

TSUBAME: 2 Petaflops Supercomputer

Performance Analysis on TSUBAME

30 2011 IEEE International Symposium on Workload Characterization

The synchronization overhead greatly affects the performance when hundreds of threads are involved and scattered among distributed systems. As shown in the following graph, if we make the synchronization more loose, we got mostly linear performance scalability.

Towards more scalability § Problem

– As shown in the previous chart, the synchronization overhead greatly affects the performance when hundreds of threads are involved and scattered among distributed systems

§ Possible Solutions –  More loose synchronization without loosing the simulation precision

–  To come up with better parallelization approach

–  Hierarchical synchronization … etc.





§ Discussion

§ Related Work



Related Work §  Yamamoto et al, A Platform for Massive Agent-based Simulation and its

Evaluation, 2007

§  David et al, Distributed Platform for Large-Scale Agent-Based Simulations, 2009,

§  Gorgious et al, Large Scale Distributed Simulation on the Grid, 2006

§  Nayer et al, Large-Scale Multi-Agent-Based Simulation using Exemplars,

§  Dan Chen et al. Large scale agent-based simulation on the grid, Journal Future Generation Computer Systems

§  Yi Zhang et al, Grid-aware Large Scale Distributed Simulation of Agent-based Systems, 2005

§  A flexible, large-scale, distributed agent based epidemic model, WSC 2007

§  Comparison of agent-based modeling software, http://en.wikipedia.org/wiki/Comparison_of_agent-based_modeling_software

33

Demonstration

§ Riodejaneiro

§ Beijing

Concluding Remarks and Future Work § Summary

– We designed and developed X10-based agent simulation platform, and verified the scalable performance on the TSUBAME 2.0 super computer

§ Future Work

– More Performance Optimization •  Agent Migration Overhead, Graph Partitioning, Time Decomposition and

more drastic one … •  Other agent simulations •  Experiments with BlueGene/P and later models, or RIKEN K Super

Computer

x10-based massive parallel large- scale traffic flow...

Documents