1 mpi and mpich on clusters rusty lusk mathematics and computer science division argonne national...
TRANSCRIPT
1
MPI and MPICH on Clusters
Rusty Lusk
Mathematics and Computer Science Division
Argonne National Laboratory
2
Outline
MPI implementations on clusters– Unix– NT
Cluster-related activities at Argonne– new cluster– MPICH news and plans
– MPICH on NT
– Other stuff– scalable performance visualization– managing jobs and processes– parallel I/O
3
MPI Implementations
MPI’s design for portability + performance has inspired a wide variety of implementations
Vendor implementations for their own machines
Implementations from software vendors Freely available public implementations for a
number of environments Experimental implementations to explore
research ideas
4
MPI Implementations from Vendors
IBM for SP and RS/6000 workstations, OS/390 Sun for Solaris systems, clusters of them SGI for SGI Origin, Power Challenge, Cray T3E
and C90 HP for Exemplar, HP workstations Compaq for (Digital) parallel machines MPI Software Technology, for Microsoft Windows
NT, Linux, Mac Fujitsu for VPP (Pallas) NEC Hitachi Genias (for NT)
5
From the Public for the Public
MPICH– for many architectures, including clusters– http://www.mcs.anl.gov/mpi/mpich
– new: http/www.mcs.anl.gov/mpi/mpich/mpich-nt
LAM– for clusters– http://lam.nd.edu
MPICH-based NT implementations– Aachen– Portugal– Argonne
6
Experimental Implementations
Real-time (Hughes) Special networks and protocols (not TCP)
– MPI-FM (U. of I., UCSD)– MPI-BIP (Lyon)– MPI-AM (Berkeley)– MPI over VIA (Berkeley, Parma)– MPI-MBCF (Tokyo) – MPI over SCI (Oslo)
Wide-area networks– Globus (MPICH-G, Argonne)– Legion (U. of Virginia)– MetaMPI (Germany)
Ames Lab (highly optimized subset) More implementations at http://www.mpi.nd.edu/lam
7
Status of MPI-2 Implementations
Fujitsu: complete (from PALLAS) NEC has I/O, complete early 2000, for SX MPICH & LAM (C++ bindings and most of I/O) LAM has parts of dynamic and one-sided HP has part of one-sided HP and SGI have most of I/O Sun, Compaq are working on parts of MPI-2 IBM has I/O, soon will have one-sided EPCC did one-sided for Cray T3E Experimental implementations (esp. one-sided)
8
Cluster Activities at Argonne
New cluster - Chiba City MPICH Other software
Chiba City8 Computing Towns
256 Dual Pentium III systems1 Visualization Town
32 Pentium III systems with Matrox G400 cards
1 Storage Town 8 Xeon systems
with 300G disk each
Cluster Management 12 PIII Mayor Systems4 PIII Front End Systems
2 Xeon File Servers3.4 TB disk
High Performance Net
64-bit Myrinet
Management NetGigabit and Fast Ethernet
Gigabit External Link
10
Chiba City System Details Purpose:
– Scalable CS research– Prototype application support
System - 314 computers:– 256 computing nodes, PIII
500MHz, 512M, 9G local disk– 32 visualization nodes, PIII
500MHz, 512M, Matrox G200– 8 storage nodes, 500 MHz
Xeon, 512M, 300GB disk: 2.4TB total
– 10 town mayors, 1 city mayor, other management systems: PIII 500 MHz, 512M, 3TB disk
Communications:– 64-bit Myrinet computing net– Switched fast/gigabit ethernet
management net– Serial control network
Software Environment:– Linux (based on RH 6.0), plus
“install your own” OS support– Compilers: GNU g++, PGI, etc
– Libraries and Tools: PETSc, MPICH, Globus, ROMIO, SUMMA3d, Jumpshot, Visualization, PVFS, HPSS, ADSM, PBS + Maui Scheduler
11
Software Research on Clusters at ANL
Scalable Systems Management– Chiba City Management Model (w/LANL, LBNL)
MPI and Communications Software– GigaNet, Myrinet, ServerNetII
Data Management and Grid Services– Globus Services on Linux (w/LBNL, ISI)
Visualization and Collaboration Tools– Parallel OpenGL server (w/Princeton, UIUC)– vTK and CAVE Software for Linux Clusters– Scalable Media Server (FL Voyager Server on Linux Cluster)
Scalable Display Environment and Tools– Virtual Frame Buffer Software (w/Princeton)– VNC (ATT) modifications for ActiveMural
Parallel I/O– MPI-IO and Parallel Filesystems Developments (w/Clemson, PVFS)
12
MPICH
Goals Misconceptions about MPICH MPICH architecture
– the Abstract Device Interface (ADI)
Current work at Argonne on MPICH– Work above the ADI– Work below the ADI– A new ADI
13
Goals of MPICH
As a research project:– to explore tradeoffs between performance and
portability in the context of the MPI standard– to study algorithms applicable to MPI implementation– to investigate interfaces between MPI and tools
As a software project:– to provide a portable, freely available MPI to everyone– to give vendors and others a running start in the
development of specialized MPI implementations– to provide a testbed for other research groups working
on particular aspects of message passing
14
Misconceptions About MPICH
It is pronounced (by its authors, at least) as “em-pee-eye-see-aitch”, not “em-pitch”.
It runs on networks of heterogeneous machines. It runs MIMD parallel programs, not just SPMD. It can use TCP, shared-memory, or both at the
same time (for networks of SMP’s). It runs over native communication on machines
like the IBM SP and Cray T3E (not just TCP). It is not for Unix only (new NT version). It doesn’t necessarily poll (depends on device).
15
MPICH Architecture
MPI Routines
ADI Routines
Channel Device Other Devices
ch_p4
sockets sockets+shmem
ch_shmem ch_eui t3e Globus
The AbstractDevice Interface
Abovethe
Device
Belowthe
Device
ch_NT
sockets (shmem) (+)
16
Recent Work Above the Device
Complete 1.2 compliance (in MPICH-1.2.0)– including even MPI_Cancel for sends
Better MPI derived datatype packing– can also be done below the device
MPI-2 C++ bindings – thanks to Notre Dame group
MPI-2 Fortran-90 module– permits use mpi instead of #include ‘mpif.h’– extends work of Michael Hennecke
MPI-2 I/O– the ROMIO project– layers MPI I/O on any MPI implementation, file system
17
Above the Device (continued)
Error message architecture– Instance-specific error reporting
– “rank 789 invalid” rather than “Invalid rank”
– Internationalization– German– Thai
Thread safety Globus/NGI/collective
18
Below the Device (continued)
Flow control for socket-based devices as defined by IMPI
Better multi-protocol for LINUX SMP’s– mmap solution portable among other Unixes not
usable on Linux– Can’t use MAP_ANONYMOUS with MAP_SHARED!
– SYSV solution works but can cause problems (race condition in deallocation)
– works better than before in MPICH-1.2.0
New NT implementation of the channel device
19
MPICH for NT
open source: build with MS Visual C++ 6.0 and Digital Visual Fortran 6.0, or download dll
complete MPI 1.2, shares above-device code with Unix MPICH
correct implementation, passes test suites from ANL, IBM, Intel
not yet fully optimized only socket device in current release DCOM-based job launcher working on other devices, ADI-3 implementation http://www.mcs.anl.gov/~ashton/mpichbeta.html
20
Preliminary Experiments with Shared Memory on NT
smp ping
0
200
400
600
800
1000
1200
1 10 100 1000 10000 100000 1000000 10000000 100000000
message size
MB
ts/s
ec
shmem
shmem stream 20k
shprocess
shprocess fixed
21
ADI-3: A New Abstract Device
Motivated by: the fact that requirements of MPI-2 are
inconsistent with ADI-2 design– thread safety– dynamic process management– one-sided operations
New capabilities of user-accessible hardware– LAPI– VIA/SIO (NGIO,FIO)– other network interfaces (Myrinet)
Desire for peak efficiency– top-to-bottom overhaul of MPICH– ADI-1: speed of implementation; ADI-2: portability
22
Runtime Environment Research
Fast startup of MPICH jobs via the mpd Experiment with process manager, job manager, scheduler
interface for parallel jobs
mpirun
Scheduler
job
process
23
Scalable Logfiles: SLOG
From IBM via AIX tracing Using MPI profiling mechanism Both automatic and user-defined states Can support large logfiles, yet find and display
sections quickly Freely-available API for reading/writing SLOG
files Current format read by Jumpshot
24
Jumpshot
25
Parallel I/O for Clusters
ROMIO is an implementation of (almost all of) the I/O part of the MPI standard.
It can utilize multiple file systems and MPI implementations.
Included in MPICH, LAM, and MPI from SGI, HP, and NEC
A combination for Clusters: Linux, MPICH, and PVFS (Parallel Virtual File System from Clemson).
26
Conclusion
There are many MPI implementations for clusters; MPICH is one.
MPI implementation, particularly for fast networks, remains an active research area
Argonne National Laboratory, all of whose software is open source, has a number of ongoing and new cluster-related activities– New cluster– MPICH– Tools
27
Available in November