cs160 – lecture 3
DESCRIPTION
CS160 – Lecture 3. Clusters. Introduction to PVM and MPI. Introduction to PC Clusters. What are PC Clusters? How are they put together ? Examining the lowest level messaging pipeline Relative application performance Starting with PVM and MPI. Clusters, Beowulfs, and more. - PowerPoint PPT PresentationTRANSCRIPT
CS160 – Lecture 3
Clusters. Introduction to PVM and MPI
Introduction to PC Clusters
• What are PC Clusters?
• How are they put together ?
• Examining the lowest level messaging pipeline
• Relative application performance
• Starting with PVM and MPI
Clusters, Beowulfs, and more
• How do you put a “Pile-of-PCs” into a room and make them do real work?– Interconnection technologies– Programming them– Monitoring– Starting and running applications– Running at Scale
Beowulf Cluster
• Current working definition: a collection of commodity PCs running an open-source operating system with a commodity interconnection network– Dual Intel PIIIs with fast ethernet, Linux
• Program with PVM, MPI, …
– Single Alpha PCs running Linux
Beowulf Clusters cont’d
• Interconnection network is usually fast ethernet running TCP/IP– (Relatively) slow network
– Programming model is message passing
• Most people now associate the name “Beowulf” with any cluster of PCs– Beowulf’s are differentiated from high-performance
clusters by the network
• www.beowulf.org has lots of information
High-Performance Clusters
• Killer micros: Low-cost Gigaflop processors here for a few kilo$$’s /processor
• Killer networks: Gigabit network hardware, high performance software (e.g. Fast Messages), soon at 100’s-$$$/ connection
• Leverage HW, commodity SW (*nix/Windows NT), build key technologies=> high performance computing in a RICH software environment
Gigabit Networks- Myrinet, SCI, FC-AL, Giganet,GigE,ATM
Cluster Research Groups
• Many other cluster groups that have had impact– Active Messages/Network of workstations (NOW)
UCB
– Basic Interface for Parallelism (BIP) Univ. of Lyon
– Fast Messages(FM)/High Performance Virtual Machines(HPVM) (UIUC/UCSD)
– Real World Computing Partnership (Japan)
– (SHRIMP) Scalable High-performance Really Inexpensive Multi-Processor (Princeton)
Clusters are Different
• A pile of PC’s is not a large-scale SMP server.– Why? Performance and programming model
• A cluster’s closest cousin is an MPP– What’s the major difference? Clusters run N copies of the OS, MPPs usually run one.
Ideal Model: HPVM’s
• HPVM = High Performance Virtual Machine• Provides a simple uniform programming model, abstracts
and encapsulates underlying resource complexity• Simplifies use of complex resources
“Virtual Machine Interface”
Actual system configuration
Application Program
Virtualization of Machines
• Want the illusion that a collection of machines is a single machines– Start, stop, monitor distributed programs– Programming and debugging should work seemlessly– PVM (Parallel Virtual Machine) was the first, widely-
adopted virtualization for parallel computing
• This illusion is only partially complete in any software system. Some issues– Node heterogeneity.– Real network topology can lead to contention
• Unrelated – What is a Java Virtual Machine?
High-Performance Communication
• Level of network interface support + NIC/network router latency– Overhead and latency of communication deliverable bandwidth
• High-performance communication Programmability!– Low-latency, low-overhead, high-bandwidth cluster communication– … much more is needed …
• Usability issues, I/O, Reliability, Availability• Remote process debugging/monitoring
Switched 100 MbitOS mediated access
Switched Multigigabit,User-level accessNetworks
Putting a cluster together
• (16, 32, 64, … X) Individual Node– Eg. Dual Processor Pentium III/733, 1 GB mem, ethernet
• Scalable High-speed network– Myrinet, Giganet, Servernet, Gigabit Ethernet
• Message-passing libraries– TCP, MPI, PVM, VIA
• Multiprocessor job launch– Portable batch System– Load Sharing Facility – PVM spawn, mpirun, rsh
• Techniques for system management– VA Linux Cluster Manager (VACM)– High Performance Technologies Inc (HPTI)
Communication style is message Passing
4 3 2 1Packetized message
1 2
• How do we efficiently get a message from Machine A to Machine B?
• How do we efficiently break a large message into packets and reassemble at receiver?
• How does receiver differentiate among message fragments (packets) from different senders?
A B
Will use the details of FM to illustrate some communication
engineering
FM on Commodity PC’s
• Host Library: API presentation, flow control, segmentation/reassembly, multithreading
• Device driver: protection, memory mapping, scheduling monitors• NIC Firmware: link management, incoming buffer management,
routing, multiplexing/demultiplexing
NIC1280Mbps
P6 busPCI
PentiumII/III
~33 MIPS~450 MIPS
FM HostLibrary
FM NICFirmware
FM DeviceDriver
Fast Messages 2.x Performance
• Latency 8.8s, Bandwidth 100+MB/s, N1/2 ~250 bytes• Fast in absolute terms (compares to MPP’s, internal memory BW)• Delivers a large fraction of hardware performance for short messages • Technology transferred in emerging cluster standards Intel/Compaq/Microsoft’s
Virtual Interface Architecture.
0
10
20
30
40
50
60
70
80
4 16 64 256 1,024 4,096 16,384 65,536Msg size (bytes)
Ban
dw
idth
(M
B/s
)
n1/2n1/2
100+ MB/s100+ MB/s
Comments about Performance
• Latency and Bandwidth are the most basic measurements message passing machines– Will discuss in detail performance models
because• Latency and bandwidth do not tell the entire story
• High-performance clusters exhibit– 10X is deliverable bandwidth over ethernet– 20X – 30X improvement in latency
How does FM really get Speed?• Protected user-level access to network (OS-bypass)• Efficient credit-based flow control
– assumes reliable hardware network [only OK for System Area Networks]
– No buffer overruns ( stalls sender if no receive space)
• Early de-multiplexing of incoming packets– multithreading, use of NT user-schedulable threads
• Careful implementation with many tuning cycles– Overlapping DMAs (Recv), Programmed I/O send– No interrupts! Polling only.
OS-Bypass Background
• Suppose you want to perform a sendto on a standard IP socket?– Operating System mediates access to the network device
• Must trap into the kernel to insure authorization on each and every message (Very time consuming)
• Message is copied from user program to kernel packet buffers• Protocol information about each packet is generated by the OS
and attached to a packet buffer• Message is finally sent out onto the physical device (ethernet)
• Receiving does the inverse with a recvfrom– Packet to kernel buffer, OS strip of header, reassembly of data, OS
mediation for authorization, copy into user program
OS-Bypass
• A user program is given a protected slice of the network interface– Authorization is done once (not per message)
• Outgoing packets get directly copied or DMAed to network interface– Protocol headers added by user-level library
• Incoming packets get routed by network interface card (NIC) into user-defined receive buffers– NIC must know how to differentiate incoming packets. This is
called early demultiplexing.
• Outgoing and incoming message copies are eliminated.• Traps to OS kernel are eliminated
Packet Pathway
User M
essage Buffer
NIC
Programmed I/O
NICPkt Pkt
DMA to/from Network
Pkt
Pkt
Pkt
DMA
User level Handler 2
User level Handler 1U
ser Message B
uffer Pinned DMA receiveregion
• Concurrency of I/O busses
• Sender specifies receiver handler ID
• Flow control keeps DMA region from being overflowed
User B
uffer
Fast Messages 1.x – An example message passing API and library
• API: Berkeley Active Messages– Key distinctions: guarantees(reliable, in-order, flow control), network-processor
decoupling (DMA region)
• Focus on short-packet performance:– Programmed IO (PIO) instead of DMA– Simple buffering and flow control– Map I/O device to user space (OS bypass)
Sender:FM_send(NodeID,Handler,Buffer,size);
// handlers are remote proceduresReceiver:
FM_extract()
What is an active message?
• Usually, message passing has a send with a corresponding explicit receive at the destination.
• Active messages specify a function to invoke (activate) when message arrives– Function is usually called a message handler
The handler gets called when the message arrives, not by the destination doing an explicit receive.
FM 1.x Performance (6/95)
• Latency 14 s, Peak BW 21.4MB/s [Pakin, Lauria et al., Supercomputing95]
• Hardware limits PIO performance, but N1/2 = 54 bytes• Delivers 17.5MB/s @ 128 byte messages (140mbps, greater than OC-3 ATM
deliverable)
0
2
4
6
8
10
12
14
16
18
20
16 32 64 128 256 512 1024 2048
Msg Size (Bytes)
Ban
dw
idth
(MB
/s)
FM
1Gb Ethernet
The FM Layering Efficiency Issue
• How good is the FM 1.1 API?• Test: build a user-level library on top of it and
measure the available performance– MPI chosen as representative user-level library– porting of MPICH 1.0 (ANL/MSU) to FM
• Purpose: to study what services are important in layering communication libraries– integration issues: what kind of inefficiencies arise at
the interface, and what is needed to reduce them [Lauria & Chien, JPDC 1997]
MPI on FM 1.x - Inefficient Layering of Protocols
• First implementation of MPI on FM was ready in Fall 1995
• Disappointing performance, only fraction of FM bandwidth available to MPI applications
0
5
10
15
20
16 32 64 128
256
512
1024
2048
Msg Size
Ban
dw
idth
(M
B/s
)
FM
MPI-FM
MPI-FM Efficiency
• Result: FM fast, but its interface not efficient
0102030405060708090
100
16 32 64 128 256 512 1024 2048
Msg Size
% E
ffic
ien
cy
MPI-FM Layering InefficienciesHeader Source buffer Header Destination buffer
MPI
FM
• Too many copies due to header attachment/removal, lack of coordination between transport and application layers
Redesign API - FM 2.x
• Sending– FM_begin_message(NodeID, Handler, size)
– FM_send_piece(stream,buffer,size) // gather
– FM_end_message()
• Receiving– FM_receive(buffer,size) // scatter
– FM_extract(total_bytes) // rcvr flow control
MPI-FM 2.x Improved Layering
• Gather-scatter interface + handler multithreading enables efficient layering, data manipulation without copies
Header Source buffer Header Destination buffer
MPI
FM
MPI on FM 2.x
• MPI-FM: 91 MB/s, 13s latency, ~4 s overhead– Short messages much better than IBM SP2, PCI limited– Latency ~ SGI O2K
Msg Size
0
10
2030
40
5060
708090
100
4 8
16 32 64
128
256
512
102
4
204
8
419
6
819
2
163
84
327
68
655
36
Ban
dw
idth
(M
B/s
) FM
MPI-FM
MPI-FM 2.x Efficiency
• High Transfer Efficiency, approaches 100% [Lauria, Pakin et al. HPDC7 ‘98]• Other systems much lower even at 1KB (100Mbit: 40%, 1Gbit: 5%)
0102030405060708090
100
4 8 16 32 64 128
256
512
1024
2048
4196
8192
1638
4
3276
8
6553
6
Msg Size
% E
ffic
ien
cy
HPVM III (“NT Supercluster”)
• 256xPentium II, April 1998, 77Gflops– 3-level fat tree (large switches), scalable bandwidth, modular
extensibility
• => 512xPentium III (550 MHz) Early 2000, 280 GFlops– Both with National Center for Supercomputing Applications
77 GF, April 1998280 GF, Early 2000
Supercomputer Performance Characteristics
• Compute/communicate and compute/latency ratios• Clusters can provide programmable characteristics at a dramatically lower system cost
Mflops/Proc Flops/Byte Flops/NetworkRTCray T3E 1200 ~2 ~2,500
SGI Origin2000 500 ~0.5 ~1,000
HPVM NT Supercluster 300 ~3.2 ~6,000
Berkeley NOW II 100 ~3.2 ~2,000
IBM SP2 550 ~3.7 ~38,000
Beowulf(100Mbit) 300 ~25 ~200,000
0
1
2
3
4
5
6
70
10
20
30
40
50
60
Processors
Gig
afl
op
s
Origin-DSM
Origin-MPI
NT-MPI
SP2-MPI
T3E-MPI
SPP2000-DSM
Solving 2D Navier-Stokes Kernel - Performance of Scalable Systems
Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner (2D 1024x1024)
Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)
Is the detail important? Is there something easier?
• Detail of a particular high-performance interface illustrates some of the complexity for these systems– Performance and scaling are very important.
Sometimes the underlying structure needs to be understood to reason about applications.
• Class will focus on distributed computing algorithms and interfaces at a higher level (message passing)
How do we program/run such machines?
• PVM (Parallel Virtual Machine) provides– Simple message passing API– Construction of virtual machine with a software console– Ability to spawn (start), kill (stop), monitor jobs
• XPVM is a graphical console, performance monitor
• MPI (Message Passing Interface)– Complex and complete message passing API– Defacto, community-defined standard– No defined method for job management
• Mpirun provided as a tool for the MPICH distribution
– Commericial and non-commercial tools for monitoring debugging• Jumpshot, VaMPIr, …
Next Time …
• Parallel Programming ParadigmsShared Memory
Message passing