2006/1/23yutaka ishikawa, the university of tokyo1 an introduction of gridmpi yutaka ishikawa and...

2006/1/23 Yutaka Ishikawa, The University of Tokyo 1

An Introduction of GridMPI

Yutaka Ishikawa 　 and Motohiko MatsudaUniversity of Tokyo

Grid Technology Research Center, AIST

(National Institute of Advanced Industrial Science and Technology)

This work is partially supported by the NAREGI project

(1,2)

(2)

(1)

(2)

http://www.gridmpi.org/


Motivation• MPI, Message Passing Interface, has been widely used to program

parallel applications.

• Users want to run such applications over the Grid environment without any modifications of the program.

• However, the performance of existing MPI implementations is not scaled up on the Grid environment.

Wide-areaNetwork

Single (monolithic) MPI applicationover the Grid environment

computing resourcesite A


computing resourcesite B



Motivation• Focus on metropolitan-area, high-bandwidth environment: 10Gp

bs, 500miles (smaller than 10ms one-way latency)– We have already demonstrated that the performance of the NAS parallel be

nchmark programs are scaled up if one-way latency is smaller than 10ms using an emulated WAN environment .

Wide-areaNetwork

Single (monolithic) MPI applicationover the Grid environment





Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh, ``Evaluation of MPI Implementations on Grid-connected Clusters using an Emulated WAN Environment,'' CCGRID2003, 2003

Motohiko Matsuda, Yutaka Ishikawa, and Tomohiro Kudoh, ``Evaluation of MPI Implementations on Grid-connected Clusters using an Emulated WAN Environment,'' CCGRID2003, 2003


InternetInternet

Issues• High Performance Communication

Facilities for MPI on Long and Fat Networks

– TCP vs. MPI communication patterns

– Network Topology

• Latency and Bandwidth

• Interoperability

– Most MPI library implementations use their own network protocol.

• Fault Tolerance and Migration

– To survive a site failure

• Security

TCP MPI

Designed for streams.

Burst traffic.

Repeat the computation and communication phases.

Change traffic by communication patterns.


Internet

Internet

Issues• High Performance Communication

Facilities for MPI on Long and Fat Networks

– TCP vs. MPI communication patterns

– Network Topology

• Latency and Bandwidth

• Interoperability

– Many MPI library implementations. Most implementations use their own network protocol.

• Fault Tolerance and Migration

– To survive a site failure

• Security

TCP MPI

Designed for streams.

Burst traffic.

Repeat the computation and communication phases.

Change traffic by communication patterns.

Using Vendor C’s MPI library

Using Vendor A’s MPI library

Using Vendor B’s MPI library

Using Vendor D’s MPI library


GridMPI Features• MPI-2 implementation• IMPI (Interoperable MPI) protocol and extension for Grid

– MPI-2– New Collective protocols– Checkpoint

• Integration of Vendor MPI– IBM, Solaris, Fujitsu, and MPICH2

• High Performance TCP/IP implementation on Long and Fat Networks– Pacing the transmission ratio so that the burst transmission is controlled according

to the MPI communication pattern.• Checkpoint

IMPI

Cluster X Cluster Y

VendorMPI YAMPII


Evaluation

• It is almost impossible to reproduce the execution behavior of communication performance in the wide area network.

• A WAN emulator, GtrcNET-1, is used to scientifically examine implementations, protocols, communication algorithms, etc.

GtrcNET-1

GtrcNET-1 is developed at AIST.• injection of delay, jitter, error, …• traffic monitor, frame capture

•Four 1000Base-SX ports•One USB port for Host PC•FPGA (XC2V6000)http://www.gtrc.aist.go.jp/gnet/


Experimental Environment8 PCs

CPU: Pentium4/2.4GHz, Memory: DDR400 512MBNIC: Intel PRO/1000 (82547EI)OS: Linux-2.6.9-1.6 (Fedora Core 2)Socket Buffer Size: 20MB

WAN Emulator

GtrcNET-1

8 PCs

Node7Node7

Host 0Host 0Host 0Host 0Host 0Host 0Node0Node0 Catalyst 3750

Catalyst 3750

Node15Node15

Host 0Host 0Host 0Host 0Host 0Host 0Node8Node8Catalyst 3750

Catalyst 3750

……

… ……

…

•Bandwidth:1Gbps•Delay: 0ms -- 10ms


GridMPI vs. MPICH-G2 (1/4)FT (Class B) of NAS Parallel Benchmarks 3.2

on 8 x 8 processes

One way delay (msec)

Rela

tive

Perfo

rman

ce

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

FT(GridMPI)

FT(MPICH-G2)


GridMPI vs. MPICH-G2 (2/4)IS (Class B) of NAS Parallel Benchmarks 3.2

on 8 x 8 processes


Rela

tive

Perfo

rman

ce

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

IS(GridMPI)

IS(MPICH-G2)


GridMPI vs. MPICH-G2 (3/4)LU (Class B) of NAS Parallel Benchmarks 3.2

on 8 x 8 processes


Rela

tive

Perfo

rman

ce

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

LU(GridMPI)LU(MPICH-G2)


GridMPI vs. MPICH-G2 (4/4)NAS Parallel Benchmarks 3.2 Class B

on 8 x 8 processes


Rela

tive

Perfo

rman

ce

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

SP(GridMPI)BT (GridMPI)MG(GridMPI)CG(GridMPI)SP(MPICH-G2)BT(MPICH-G2)MG(MPICH-G2)CG(MPICH-G2)

No parameters tuned in GridMPI


GridMPI on Actual network • NAS Parallel Benchmarks run using 8

node (2.4GHz) cluster at Tsukuba and 8 node (2.8GHz) cluster at Akihabara

– 16 nodes

• Comparing the performance with– result using 16 node (2.4 GHz)

– result using 16 node (2.8 GHz)

JGN2 Network 10Gbps Bandwidt

h 1.5 msec RTT


h 1.5 msec RTT

Pentium-4 2.4GHz x 8connected by 1G Ethernet@ Tsukuba

Pentium-4 2.8 GHz x 8Connected by 1G Ethernet@ Akihabara

60 Km (40mi.)

00.20.40.60.8

11.2

BT CG EP FT IS LU MG SP

2.4 GHz2.8 GHz

Benchmarks

Rel

ativ

e pe

rfor

man

ce


Demonstration

• Easy installation– Download the source– Make it and set up configuration files

• Easy use– Compile your MPI application– Run it !


h 1.5 msec RTT


h 1.5 msec RTT

Pentium-4 2.4GHz x 8connected by 1G Ethernet@ Tsukuba

Pentium-4 2.8 GHz x 8Connected by 1G Ethernet@ Akihabara

60 Km (40mi.)


NAREGI Software Stack (Beta Ver. 2006)

（（ Globus,Condor,UNICOREGlobus,Condor,UNICOREOGSA / WSRF)OGSA / WSRF)

Grid-Enabled Nano-Applications

Grid PSEGrid 　 Programing

-Grid RPC -Grid MPI

Grid Visualization

Grid VM

DistributedInformation Service

Grid Workflow

Super Scheduler

High-Performance & Secure Grid Networking

Data


GridMPI Current Status

• GridMPI version 0.9 was released– MPI-1.2 features are fully supported– MPI-2.0 features are supported except for MPI-IO and

one sided communication primitives– Conformance Tests

• MPICH Test Suite: 0/142 (Fails/Tests)

• Intel Test Suite: 0/493 (Fails/Tests)

• GridMPI version 1.0 will be released in this Spring– MPI-2.0 fully supported

http://www.gridmpi.org/


Concluding Remarks• GridMPI is integrated into the NaReGI package.• GridMPI is not only for production but also our research vehicle

for Grid environment in the sense that the new idea in Grid is implemented and tested.

• We are currently studying high-performance communication mechanisms in the long and fat network:– Modifications of TCP Behavior

• M Matsuda, T. Kudoh, Y. Kodama, R. Takano, and Y. Ishikawa, “TCP Adaptation for MPI on Long-and-Fat Networks,” IEEE Cluster 2005, 2005.

– Precise Software Pacing• R. Takano, T. Kudoh, Y. Kodama, M. Matsuda, H. Tezuka, Y. Ishika

wa, “Design and Evaluation of Precise Software Pacing Mechanisms for Fast Long-Distance Networks”, PFLDnet2005, 2005.

– Collective communication algorithms with respect to network latency and bandwidth.


BACKUP


GridMPI Version 1.0

– YAMPII, developed at the University of Tokyo, is used as the core implementation

– Intra communication by YAMPII （ TCP/IP 、 SCore ）– Inter communication by IMPI （ TCP/IP ）

MPI API

TC

P/IP

PM

v2

MX

O2G

Vendor M

PI

P2P Interface

Request LayerRequest Interface

IMP

I

LACT Layer(Collectives)

IMP

I

ssh

rsh

SC

ore

Globus

Vendor M

PI

RPIM Interface


GridMPI vs. Others (1/2)NAS Parallel Benchmarks 3.2 Class B

on 8 x 8 processes


Rela

tive

Perfo

rman

ce

0

0.2

0.4

0.6

0.8

1

1.2

0 2 4 6 8 10 12

FT(GridMPI)IS(GridMPI)LU(GridMPI)SP(GridMPI)BT (GridMPI)MG(GridMPI)CG(GridMPI)FT(MPICH-G2)IS(MPICH-G2)LU(MPICH-G2)SP(MPICH-G2)BT(MPICH-G2)MG(MPICH-G2)CG(MPICH-G2)


GridMPI vs. Others (1/2)NAS Parallel Benchmarks 3.2 Class B

on 8 x 8 processes

0.00

0.50

1.00

1.50

2.00

0ms

5ms

10m

s

0ms

5ms

10m

s

0ms

5ms

10m

s

0ms

5ms

10m

s

0ms

5ms

10m

s

BT CG LU MG SP

GridMPIGridMPI(with PSP)MPICHLAM/MPIYAMPIIMPICH2MPICH-G2

Rela

tive

Perfo

rman

ce


GridMPI vs. Others (2/2)

0.00

0.50

1.00

1.50

2.00

0ms 5ms 10ms 0ms 5ms 10ms

FT IS

GridMPIGridMPI(with PSP)MPICHLAM/MPIYAMPIIMPICH2MPICH-G2

Rela

tive

Perfo

rman

ce

NAS Parallel Benchmarks 3.2 Class Bon 8 x 8 processes


GridMPI vs. Others

Rel

ativ

e Pe

rfor

man

ce

0.00

0.50

1.00

1.50

2.00

2.50

0ms 5ms 10ms 0ms 2ms 5ms 10ms

FT IS

GridMPIGridMPI(withPSP)MPICHLAM/MPIYAMPIIMPICH2MPICH-G2

NAS Parallel Benchmarks 3.2

16 x 16


GridMPI vs. Others

0.00

0.50

1.00

1.50

2.00

0m

s

5m

s

10m

s

0m

s

5m

s

10m

s

0m

s

5m

s

10m

s

0m

s

5m

s

10m

s

0m

s

5m

s

10m

s

BT CG LU MG SP

GridMPI GridMPI(withPSP)MPICH LAM/MPIYAMPII MPICH2MPICH-G2

Rel

ativ

e P

erfo

rman

ce

NAS Parallel Benchmarks 3.2

2006/1/23yutaka ishikawa, the university of tokyo1 an introduction of gridmpi yutaka ishikawa and...

Documents

communication phases

communication algorithms

burst traffic

change traffic

emulated wan environment

traffic monitor

network protocol

wide area network