x10-based massive parallel large- scale traffic flow...
TRANSCRIPT
X10-based Massive Parallel Large-Scale Traffic Flow Simulation
Toyotaro Suzumura1,2, Sei Kato1, Takashi Imamichi1, Mikio Takeuchi1, Hiroki Kanezashi2, Tsuyoshi Ide1, and Tamiya Onodera1
IBM Research – Tokyo1, Tokyo Institute of Technology2
1
This research was partly supported by the Japan Science and Technology Agency (JST) Core Research of Evolutionary Science and Technology (CREST)
1
2
3 4 5
6
7 8
X10-based Ultra-Large Scale Agent Simulation on the 2 Petaflops Supercomputer
Goal: To build a scalable large-scale agent simulation platform based on X10 that runs on a Super Computer with ten thousands of CPU cores and dual links of 40Gbps Infiniband network
Status: Completed the multi-node version and verified the scalable performance with the Hiroshima road network.
Megaffic
X10 TSUBAME: 2 Petaflops Supercomputer
Simulation Data: Hiroshima # of trips: 10000 (1/100 of real trips) # of simulation steps: 1000 (1/100 of real steps: 24 hours)
0
50
100
150
200
250
1 2 4 8 16
Time(s)
Places
Simula.on .me
196 cores 12 cores
Outline § Motivation
§ XAXIS Overview and Architecture
§ Design for Highly Scalable Platform
§ Performance Evaluation
§ Discussion
§ Related Work
§ Concluding Remarks and Future Work
§ Other Activities
Background: Large-scale Simulation is Everywhere § We have entered into the era where proactive response is needed
§ Highly performance large-scale based simulation is required for timely decision.
4
http://mark.buchanan.pagesperso-orange.fr/nature_economic_modelling.pdf
How can we design and develop highly distributed agent simulation platform ?
§ How can we design and implement a platform that handles millions of agents and multiple simulations concurrently ?
§ How can we handle large-scale graphs consisting of millions of vertices and tens of millions of edges such as the whole Japanese road network ?
1
2
345
6
78
1
2
3 4 5
6
7 8
X10-based Large Scale Agent Simulation on the 2 Petaflops Super Computer
§ Goal: To build a scalable large-scale agent simulation platform that runs on a Super Computer with thousands of cores and dual links of 40Gbps Infiniband network
§ Technical Challenge towards High Scalability : How can we concurrently process multiple agents in a scalable manner ? – How can we divide extremely huge graph into a set of sub-graphs and allocate each subgraph to compute node on a
super computer in order to find the best allocation pattern that balances the communication and computational cost based on the profiling data at runtime ?
à Prior arts tackle similar problem, but the different underlying environment and application needs different optimization scheme
Megaffic
X10
XAXIS: X10-based Agents eXecutive Infrastructure for Simulation § X10-based Distributed Agent Simulation Platform
– X10 is the state-of-the-art PGAS (Partitioned Global Address Space) language that brings high productivity when implementing highly parallel and distributed applications on post-peta or exascale machines • X10 provides the functionality that can seamlessly integrate with legacy
applications written in Java or C++.
§ Programming Model – The agent programming model of XAXIS is derived from our ZASE
[Yamamoto, AAMAS2007] simulation platform
– XAXIS provides compatible API interface of ZASE to developers.
Gaku Yamamoto, et.al, “A Platform for Massive Agent-based Simulation and” , AAMAS 2007
XAXIS Software Stack § The following diagram illustrates the software stack of XAXIS and its
applications.
§ XAXIS in X10 can execute the existing ZASE applications written in Java with slight modification
8
ZASE Simula2on
Run2me (Java)
Agent Simula2on(Java)(e.g. Traffic, CO2 Emission, Auc2on, Marke2ng)
XAXIS : X10-‐Based Simula2on Run2me
Agent Simula2on
(X10)
ZASE-‐XAXIS-‐Bridge (Java)
X10 (Java, C++)
ZASE API ZASE-‐XAXIS-‐Bridge
(X10)
XAXIS Architecture: X10-Based Agent Simulator
A2
onHandleMessage
receiveMessage
Agent Object
Repository
(3) Retrieves an agent object with an agent id
Invoke an onHandleMessage method of the A2 object
9
A1
execute
sendMessage
Agent Directory
Identify a place id
Send message with “async at ” Msg (place id, agent id)
Msg (agent id)
Agent Object
Repository
Agent Directory
Agent Manager Agent Manager
Place P
Identify a place id
User Agent Code
Place Q
User Agent Code
A2
XAXIS Server Global Data
Place 0
Simulation Cycle Management
XAXIS-based Large Scale Traffic Simulator
10
Vehicle Proxy/ Vehicle
CrossPoint (X10 Activity)
Place P Place Q
Simulation Manager
Place 0
Simulation Cycle Management
Vehicle Proxy/ Vehicle
Simulation execution at time T
Road Road
Vehicle Proxy/ Vehicle
CrossPoint (X10 Activity)
Vehicle Proxy/ Vehicle
Road Road
Vehicle Proxy/ Vehicle
CrossPoint (X10 Activity)
Vehicle Proxy/ Vehicle
Simulation execution
Road Road
Vehicle Proxy/ Vehicle
CrossPoint (X10 Activity)
Vehicle Proxy/ Vehicle
Road Road
Simulation execution at timeT
Graph Server (Java) Graph
Simulation execution
Destination (X10 Activity)
Origin (X10 Activity)
SubGraph ?? SubGraph ??
Simulation execution at time T
Simulation execution at time T
Mapping Magaffic Components to X10
Megaffic
X10
Megaffic on XAXIS
XAXIS Runtime
GraphServer Driver.StaticDriver
roadnetwork. Road
simulator. Place
roadnetwork. Area
simulator. Region
roadnetwork. CrossPoint
simulator. Driver
TrafficEnv
Service
citizen. Service
TrafficSim
Launcher
Simulator.Launcher/
RegionLauncher
ShortestPath /Dijkstra
Vehicle
simulator.
Citizen
Vehicle Proxy
simulator. CitizenProx
y
Component Diagram
Cross
Point (Zase Driver)
Cross
Point (ZaseDrive
r)
CrossPoint
(ZaseDriver)
Road (Zase Place)
VehicleProxy
Vehicle
Area (Zase Region)
Road (Zase Place)
Driver
VehicleProxy
Vehicle
Driver
Graph(road network)
in
out
Outline § Motivation
§ XAXIS Overview and Architecture
§ Design for Highly Scalable Platform
§ Performance Evaluation
§ Discussion
§ Related Work
§ Concluding Remarks and Future Work
§ Other Activities
Each Place(X10) manages different set of CrossPoints (XAXIS)
Cross
Point
Cross
Point
CrossPoint
(ZaseDriver)
VehicleProxy
Vehicle
Road
Driver
VehicleProxy
Vehicle Driver
Graph(road network)
in
out Road
Cross
Point
Cross
Point
CrossPoint
(ZaseDriver)
VehicleProxy
Vehicle
Road
Driver
VehicleProxy
Vehicle Driver
Graph(road network)
in
out Road
Cross
Point
Cross
Point
CrossPoint
(ZaseDriver)
VehicleProxy
Vehicle
Road
Driver
VehicleProxy
Vehicle Driver
Graph(road network)
in
out Road
1
2
3 4 5
6
7 8
2
3
4
SubGraph (Road
Network)
SubGraph (Road
Network)
SubGraph (Road
Network)
Design for Vehicle Migration among Different X10 Places
A2
deserialize migrated vehicle
Receive migrated vehicle object
(3) Retrieves an agent object with an agent id
Invoke an onHandleMessage method of the A2 object
16
A1
Execute & send migrate vehicles
Migrate vehicle object Cross Point
Directory Identify a place id
Send vehicle with “async at ” Vehicle object message
Vehicle (migration cross point id)
Cross Point Directory
Cross Point Manager Cross Point Manager
X10 Place (P)
Identify a place id
X10 Place (Q)
Cross Point
ZASE-X Server Global Data
Place 0
Simulation Cycle Management
Cross Point
Migration Vehicle
Repository
Road Network (P)
Road Network (Q)
Design for Vehicle Migration Among X10 Places
CP1
CP2
CP3 CP0
X10 Place 1
X10 Place 2 X10 Place 0
Road 0
Road 1
Road 2
• A road object has information on the identifiers of origin and destination cross points. When looking into the cross point identifier, it is possible to know which X10 place the cross point is located since the identifier is assigned by the X10 DistArray construct.
• A road object also exists at a place where its destination cross point is located. For instance, if certain trip takes CP1 as origin and CP3 as destination, a graph server returns CP1, CP2, and CP3 as a shortest path. When a vehicle firstly enters into a road, it checks whether a road exists at the same X10 place. If not, a manager migrates the vehicle to the next road exists at different X10 place.
CP: Cross Point
Outline § Motivation
§ XAXIS Overview and Architecture
§ Design for Highly Scalable Platform
§ Performance Evaluation
§ Discussion
§ Related Work
§ Concluding Remarks and Future Work
§ Other Activities
TSUBAME 2.0 Supercomputer .
TSUBAME 2.0 System Configuration
TSUBAME 2.0 Specification Specification
CPU Intel Westmere EP (Xeon X5670, L2 Cache: 256 KB, L3: 12MB) 2.93 GHz processors, 12 CPU Cores (24 cores with Hyper Threading) x 2 sockets per 1 node (24 CPU Cores)
RAM 54 GB
OS SUSE Linux Enterprise 11 (Linux kernel: 2.6.32) # of Total Nodes 1466 nodes (We only used up to 1366 nodes) Network Topology Full-Bisection Fat-Tree Topology
Network Voltaire / Mellanox Dual-rail QDR Infiniband (40Gbps x2 = 80 Gbps)
GPGPU Three NVIDIA Fermi M2050 GPUs (*Not used for this work) GCC and OpenMP GCC 4.3.4 (-O3 option) , OpenMP 3.0
OpenMPI OpenMPI 1.5.3, MVAPICH 1.6.1
Java Virtual Machine IBM Java 1.6.0 (GC Policy: gencon) X10 X10 2.1.1.1
Performance Evaluation – Single Node
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
1 2 4 6 8 10 12
# o
f sp
eedu
p
# of threads
Performance Characteristics of XAXIS (# of trips: 115000 (1/10), roadnetwork: hiroshima)
100
1000
10000
100000
Objective • To see the performance speed up ratio with varying number of threads and simulation overall steps Experimental Setting: • Road network: hiroshima • # of trips: 115000 (1/10) • # of simulations: 100, 1000, 10000 , 100000 Findings: - This result has revealed that the real traffic data could not consume the full CPU usage. - To evaluate the full capability of XAXIS+MegafficCUI, we need to create artificial data - The expected time for 1 day will be 355 seconds * 10 = 1 hour 0
200
400
600
800
1000
1200
1400
1 2 4 6 8 10 12
# of simulations: 10000 (real data is 86400)
# of threads
Elapsed Time (sec)
S72hs22-4.trl.ibm.com(Intel Xeon 6 core x 2 sockets, Hyper thread :off, 8GB RAM, RHEL5, )IBM J9 VM 1.6.0 (gc policy : gencon , -Xms:4096m,-Xmx6144m)
Performance Evaluation – Multiple Nodes (100 trips per 1 step )
# of Place 1 2 4 8 16
# of Migra2ons
0 3214 8282 14836 13261
Execu2on Time (s)
221.172
132.977
81.267 72.229 49.619
# of trips : 100000, # of steps: 1000, RI=true # of threads per node: 12, heap memory=32GB The origin and destination of each trip is the same, but some trips moves to other places depending on the shortest path.
0
2000
4000
6000
8000
10000
12000
14000
16000
1 2 4 8 16 Th
e nu
mbe
r of m
igra.o
n
Places (Nodes, 12 CPU core per Node)
The number of migra.ons
0
50
100
150
200
250
1 2 4 8 16
Time(s)
Places (Nodes, 12 CPU core per Node)
Simula.on .me
Performance Evaluation – Multiple Nodes (100 trips per 1 step, More migrations)
Place数 1 2 4 8 16
# of Migra2ons 0 12523 26889 41326 63747
Simula2on Time(s)
223.125 139.870 97.121 90.276 64.080
# of trips : 100000, # of steps: 1000, RI=true # of threads per node: 12, heap memory=32GB The origin and destination of each trip is the different place.
0
50
100
150
200
250
1 2 4 8 16
Time(s)
Places(Nodes, 12 CPU core per Node)
Simula.on Time
0
10000
20000
30000
40000
50000
60000
70000
1 2 4 8 16 Th
e nu
mbe
r of m
igra.o
n
Places
The number of migra.ons
The origin and destination of each trip is located at different places.
Discussion § CPU usage greatly becomes lower if we employs more number of nodes
§ This is because the number of cross points per 1 node becomes less
§ We need more heavy computation or more trips per each step for better scalability
0
5
10
15
20
25
30
35
40
45
1 2 4 8 16
CPU usage(%
)
Places
CPU usage
CPU_us
CPU_sy
Hiroshima with 16 Places
Rio De Janeiro
Singapore
Large-scale traffic simulation with the whole Japanese road network consisting of 1 million cross points and 10 million vehicles
TSUBAME: 2 Petaflops Supercomputer
Performance Analysis on TSUBAME
30 2011 IEEE International Symposium on Workload Characterization
The synchronization overhead greatly affects the performance when hundreds of threads are involved and scattered among distributed systems. As shown in the following graph, if we make the synchronization more loose, we got mostly linear performance scalability.
Towards more scalability § Problem
– As shown in the previous chart, the synchronization overhead greatly affects the performance when hundreds of threads are involved and scattered among distributed systems
§ Possible Solutions – More loose synchronization without loosing the simulation precision
– To come up with better parallelization approach
– Hierarchical synchronization … etc.
Outline § Motivation
§ XAXIS Overview and Architecture
§ Design for Highly Scalable Platform
§ Performance Evaluation
§ Discussion
§ Related Work
§ Concluding Remarks and Future Work
§ Other Activities
Related Work § Yamamoto et al, A Platform for Massive Agent-based Simulation and its
Evaluation, 2007
§ David et al, Distributed Platform for Large-Scale Agent-Based Simulations, 2009,
§ Gorgious et al, Large Scale Distributed Simulation on the Grid, 2006
§ Nayer et al, Large-Scale Multi-Agent-Based Simulation using Exemplars,
§ Dan Chen et al. Large scale agent-based simulation on the grid, Journal Future Generation Computer Systems
§ Yi Zhang et al, Grid-aware Large Scale Distributed Simulation of Agent-based Systems, 2005
§ A flexible, large-scale, distributed agent based epidemic model, WSC 2007
§ Comparison of agent-based modeling software, http://en.wikipedia.org/wiki/Comparison_of_agent-based_modeling_software
33
Demonstration
§ Riodejaneiro
§ Beijing
Concluding Remarks and Future Work § Summary
– We designed and developed X10-based agent simulation platform, and verified the scalable performance on the TSUBAME 2.0 super computer
§ Future Work
– More Performance Optimization • Agent Migration Overhead, Graph Partitioning, Time Decomposition and
more drastic one … • Other agent simulations • Experiments with BlueGene/P and later models, or RIKEN K Super
Computer