2006/1/23yutaka ishikawa, the university of tokyo1 an introduction of gridmpi yutaka ishikawa and...
TRANSCRIPT
2006/1/23 Yutaka Ishikawa, The University of Tokyo 1
An Introduction of GridMPI
Yutaka Ishikawa and Motohiko MatsudaUniversity of Tokyo
Grid Technology Research Center, AIST
(National Institute of Advanced Industrial Science and Technology)
This work is partially supported by the NAREGI project
(1,2)
(2)
(1)
(2)
http://www.gridmpi.org/
2006/1/23 Yutaka Ishikawa, The University of Tokyo 2
Motivation• MPI, Message Passing Interface, has been widely used to program
parallel applications.
• Users want to run such applications over the Grid environment without any modifications of the program.
• However, the performance of existing MPI implementations is not scaled up on the Grid environment.
Wide-areaNetwork
Single (monolithic) MPI applicationover the Grid environment
computing resourcesite A
computing resourcesite A
computing resourcesite B
computing resourcesite B
2006/1/23 Yutaka Ishikawa, The University of Tokyo 3
Motivation• Focus on metropolitan-area, high-bandwidth environment: 10Gp
bs, 500miles (smaller than 10ms one-way latency)– We have already demonstrated that the performance of the NAS parallel be
nchmark programs are scaled up if one-way latency is smaller than 10ms using an emulated WAN environment .
Wide-areaNetwork
Single (monolithic) MPI applicationover the Grid environment
computing resourcesite A
computing resourcesite A
computing resourcesite B
computing resourcesite B
Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh, ``Evaluation of MPI Implementations on Grid-connected Clusters using an Emulated WAN Environment,'' CCGRID2003, 2003
Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh, ``Evaluation of MPI Implementations on Grid-connected Clusters using an Emulated WAN Environment,'' CCGRID2003, 2003
2006/1/23 Yutaka Ishikawa, The University of Tokyo 4
InternetInternet
Issues• High Performance Communication
Facilities for MPI on Long and Fat Networks
– TCP vs. MPI communication patterns
– Network Topology
• Latency and Bandwidth
• Interoperability
– Most MPI library implementations use their own network protocol.
• Fault Tolerance and Migration
– To survive a site failure
• Security
TCP MPI
Designed for streams.
Burst traffic.
Repeat the computation and communication phases.
Change traffic by communication patterns.
2006/1/23 Yutaka Ishikawa, The University of Tokyo 5
Internet
Internet
Issues• High Performance Communication
Facilities for MPI on Long and Fat Networks
– TCP vs. MPI communication patterns
– Network Topology
• Latency and Bandwidth
• Interoperability
– Many MPI library implementations. Most implementations use their own network protocol.
• Fault Tolerance and Migration
– To survive a site failure
• Security
TCP MPI
Designed for streams.
Burst traffic.
Repeat the computation and communication phases.
Change traffic by communication patterns.
Using Vendor C’s MPI library
Using Vendor A’s MPI library
Using Vendor B’s MPI library
Using Vendor D’s MPI library
2006/1/23 Yutaka Ishikawa, The University of Tokyo 6
GridMPI Features• MPI-2 implementation• IMPI (Interoperable MPI) protocol and extension for Grid
– MPI-2– New Collective protocols– Checkpoint
• Integration of Vendor MPI– IBM, Solaris, Fujitsu, and MPICH2
• High Performance TCP/IP implementation on Long and Fat Networks– Pacing the transmission ratio so that the burst transmission is controlled according
to the MPI communication pattern.• Checkpoint
IMPI
Cluster X Cluster Y
VendorMPI YAMPII
2006/1/23 Yutaka Ishikawa, The University of Tokyo 7
Evaluation
• It is almost impossible to reproduce the execution behavior of communication performance in the wide area network.
• A WAN emulator, GtrcNET-1, is used to scientifically examine implementations, protocols, communication algorithms, etc.
GtrcNET-1
GtrcNET-1 is developed at AIST.• injection of delay, jitter, error, …• traffic monitor, frame capture
•Four 1000Base-SX ports•One USB port for Host PC•FPGA (XC2V6000)http://www.gtrc.aist.go.jp/gnet/
2006/1/23 Yutaka Ishikawa, The University of Tokyo 8
Experimental Environment8 PCs
CPU: Pentium4/2.4GHz, Memory: DDR400 512MBNIC: Intel PRO/1000 (82547EI)OS: Linux-2.6.9-1.6 (Fedora Core 2)Socket Buffer Size: 20MB
WAN Emulator
GtrcNET-1
8 PCs
Node7Node7
Host 0Host 0Host 0Host 0Host 0Host 0Node0Node0 Catalyst 3750
Catalyst 3750
Node15Node15
Host 0Host 0Host 0Host 0Host 0Host 0Node8Node8Catalyst 3750
Catalyst 3750
……
… ……
…
•Bandwidth:1Gbps•Delay: 0ms -- 10ms
2006/1/23 Yutaka Ishikawa, The University of Tokyo 9
GridMPI vs. MPICH-G2 (1/4)FT (Class B) of NAS Parallel Benchmarks 3.2
on 8 x 8 processes
One way delay (msec)
Rela
tive
Perfo
rman
ce
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
FT(GridMPI)
FT(MPICH-G2)
2006/1/23 Yutaka Ishikawa, The University of Tokyo 10
GridMPI vs. MPICH-G2 (2/4)IS (Class B) of NAS Parallel Benchmarks 3.2
on 8 x 8 processes
One way delay (msec)
Rela
tive
Perfo
rman
ce
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
IS(GridMPI)
IS(MPICH-G2)
2006/1/23 Yutaka Ishikawa, The University of Tokyo 11
GridMPI vs. MPICH-G2 (3/4)LU (Class B) of NAS Parallel Benchmarks 3.2
on 8 x 8 processes
One way delay (msec)
Rela
tive
Perfo
rman
ce
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
LU(GridMPI)LU(MPICH-G2)
2006/1/23 Yutaka Ishikawa, The University of Tokyo 12
GridMPI vs. MPICH-G2 (4/4)NAS Parallel Benchmarks 3.2 Class B
on 8 x 8 processes
One way delay (msec)
Rela
tive
Perfo
rman
ce
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
SP(GridMPI)BT (GridMPI)MG(GridMPI)CG(GridMPI)SP(MPICH-G2)BT(MPICH-G2)MG(MPICH-G2)CG(MPICH-G2)
No parameters tuned in GridMPI
2006/1/23 Yutaka Ishikawa, The University of Tokyo 13
GridMPI on Actual network • NAS Parallel Benchmarks run using 8
node (2.4GHz) cluster at Tsukuba and 8 node (2.8GHz) cluster at Akihabara
– 16 nodes
• Comparing the performance with– result using 16 node (2.4 GHz)
– result using 16 node (2.8 GHz)
JGN2 Network 10Gbps Bandwidt
h 1.5 msec RTT
JGN2 Network 10Gbps Bandwidt
h 1.5 msec RTT
Pentium-4 2.4GHz x 8connected by 1G Ethernet@ Tsukuba
Pentium-4 2.8 GHz x 8Connected by 1G Ethernet@ Akihabara
60 Km (40mi.)
00.20.40.60.8
11.2
BT CG EP FT IS LU MG SP
2.4 GHz2.8 GHz
Benchmarks
Rel
ativ
e pe
rfor
man
ce
2006/1/23 Yutaka Ishikawa, The University of Tokyo 14
Demonstration
• Easy installation– Download the source– Make it and set up configuration files
• Easy use– Compile your MPI application– Run it !
JGN2 Network 10Gbps Bandwidt
h 1.5 msec RTT
JGN2 Network 10Gbps Bandwidt
h 1.5 msec RTT
Pentium-4 2.4GHz x 8connected by 1G Ethernet@ Tsukuba
Pentium-4 2.8 GHz x 8Connected by 1G Ethernet@ Akihabara
60 Km (40mi.)
2006/1/23 Yutaka Ishikawa, The University of Tokyo 15
NAREGI Software Stack (Beta Ver. 2006)
(( Globus,Condor,UNICOREGlobus,Condor,UNICOREOGSA / WSRF)OGSA / WSRF)
Grid-Enabled Nano-Applications
Grid PSEGrid Programing
-Grid RPC -Grid MPI
Grid Visualization
Grid VM
DistributedInformation Service
Grid Workflow
Super Scheduler
High-Performance & Secure Grid Networking
Data
2006/1/23 Yutaka Ishikawa, The University of Tokyo 16
GridMPI Current Status
• GridMPI version 0.9 was released– MPI-1.2 features are fully supported– MPI-2.0 features are supported except for MPI-IO and
one sided communication primitives– Conformance Tests
• MPICH Test Suite: 0/142 (Fails/Tests)
• Intel Test Suite: 0/493 (Fails/Tests)
• GridMPI version 1.0 will be released in this Spring– MPI-2.0 fully supported
http://www.gridmpi.org/
2006/1/23 Yutaka Ishikawa, The University of Tokyo 17
Concluding Remarks• GridMPI is integrated into the NaReGI package.• GridMPI is not only for production but also our research vehicle
for Grid environment in the sense that the new idea in Grid is implemented and tested.
• We are currently studying high-performance communication mechanisms in the long and fat network:– Modifications of TCP Behavior
• M Matsuda, T. Kudoh, Y. Kodama, R. Takano, and Y. Ishikawa, “TCP Adaptation for MPI on Long-and-Fat Networks,” IEEE Cluster 2005, 2005.
– Precise Software Pacing• R. Takano, T. Kudoh, Y. Kodama, M. Matsuda, H. Tezuka, Y. Ishika
wa, “Design and Evaluation of Precise Software Pacing Mechanisms for Fast Long-Distance Networks”, PFLDnet2005, 2005.
– Collective communication algorithms with respect to network latency and bandwidth.
2006/1/23 Yutaka Ishikawa, The University of Tokyo 18
BACKUP
2006/1/23 Yutaka Ishikawa, The University of Tokyo 19
GridMPI Version 1.0
– YAMPII, developed at the University of Tokyo, is used as the core implementation
– Intra communication by YAMPII ( TCP/IP 、 SCore )– Inter communication by IMPI ( TCP/IP )
MPI API
TC
P/IP
PM
v2
MX
O2G
Vendor M
PI
P2P Interface
Request LayerRequest Interface
IMP
I
LACT Layer(Collectives)
IMP
I
ssh
rsh
SC
ore
Globus
Vendor M
PI
RPIM Interface
2006/1/23 Yutaka Ishikawa, The University of Tokyo 20
GridMPI vs. Others (1/2)NAS Parallel Benchmarks 3.2 Class B
on 8 x 8 processes
One way delay (msec)
Rela
tive
Perfo
rman
ce
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12
FT(GridMPI)IS(GridMPI)LU(GridMPI)SP(GridMPI)BT (GridMPI)MG(GridMPI)CG(GridMPI)FT(MPICH-G2)IS(MPICH-G2)LU(MPICH-G2)SP(MPICH-G2)BT(MPICH-G2)MG(MPICH-G2)CG(MPICH-G2)
2006/1/23 Yutaka Ishikawa, The University of Tokyo 21
GridMPI vs. Others (1/2)NAS Parallel Benchmarks 3.2 Class B
on 8 x 8 processes
0.00
0.50
1.00
1.50
2.00
0ms
5ms
10m
s
0ms
5ms
10m
s
0ms
5ms
10m
s
0ms
5ms
10m
s
0ms
5ms
10m
s
BT CG LU MG SP
GridMPIGridMPI(with PSP)MPICHLAM/MPIYAMPIIMPICH2MPICH-G2
Rela
tive
Perfo
rman
ce
2006/1/23 Yutaka Ishikawa, The University of Tokyo 22
GridMPI vs. Others (2/2)
0.00
0.50
1.00
1.50
2.00
0ms 5ms 10ms 0ms 5ms 10ms
FT IS
GridMPIGridMPI(with PSP)MPICHLAM/MPIYAMPIIMPICH2MPICH-G2
Rela
tive
Perfo
rman
ce
NAS Parallel Benchmarks 3.2 Class Bon 8 x 8 processes
2006/1/23 Yutaka Ishikawa, The University of Tokyo 23
GridMPI vs. Others
Rel
ativ
e Pe
rfor
man
ce
0.00
0.50
1.00
1.50
2.00
2.50
0ms 5ms 10ms 0ms 2ms 5ms 10ms
FT IS
GridMPIGridMPI(withPSP)MPICHLAM/MPIYAMPIIMPICH2MPICH-G2
NAS Parallel Benchmarks 3.2
16 x 16
2006/1/23 Yutaka Ishikawa, The University of Tokyo 24
GridMPI vs. Others
0.00
0.50
1.00
1.50
2.00
0m
s
5m
s
10m
s
0m
s
5m
s
10m
s
0m
s
5m
s
10m
s
0m
s
5m
s
10m
s
0m
s
5m
s
10m
s
BT CG LU MG SP
GridMPI GridMPI(withPSP)MPICH LAM/MPIYAMPII MPICH2MPICH-G2
Rel
ativ
e P
erfo
rman
ce
NAS Parallel Benchmarks 3.2