u.c. berkeley and lbnl
DESCRIPTION
Compiler-generated code. Compiler-specific runtime system. GASNet Extended API. GASNet Core API. Network Hardware. Distributed Data Structures. Firehose Algorithm for distributed management of DMA registration. System Description 4-node cluster Dual-CPU 2.8Ghz Pentium 4 Xeon, 4GB mem - PowerPoint PPT PresentationTRANSCRIPT
U.C. Berkeley and LBNLU.C. Berkeley and LBNL
• Global address space languages support
• Global pointers and distributed arrays
• User controls layout of data across nodes
• Direct read and write to remote memory
• Single Program Multiple Data (SPMD) control
• Similar to using threads, but with remote accesses
• Global synchronization, barriers
• Languages: UPC, Co-Array Fortran, Titanium
• GASNet - A common communication system tailored for global address space languages
• Global address space languages support
• Global pointers and distributed arrays
• User controls layout of data across nodes
• Direct read and write to remote memory
• Single Program Multiple Data (SPMD) control
• Similar to using threads, but with remote accesses
• Global synchronization, barriers
• Languages: UPC, Co-Array Fortran, Titanium
• GASNet - A common communication system tailored for global address space languages
Distributed Data Structures
Distributed Data Structures
Latency PerformanceLatency Performance
GASNet GoalsGASNet Goals• Language-independence: Compatibility with several global-address
space languages and compilers
• UPC, Titanium, Co-array Fortran, possibly others..
• Hide language- or compiler-specific details, such as shared-pointer representation
• Hardware-independence: variety of parallel architectures & OS's
• SMP: Linux/UNIX SMP's, Origin 2000, etc.
• Clusters of uniprocessors or SMP's: IBM SP, Compaq AlphaServer, Linux/UNIX clusters, etc.
• Support many high-performance networks: Infiniband, Myrinet/GM, Quadrics/elan, IBM/LAPI, Dolphin, MPI
• Ease of implementation on new hardware
• Allow quick prototype implementations
• Implementations can leverage performance features of hardware
• Provide both portability & performance
• Language-independence: Compatibility with several global-address space languages and compilers
• UPC, Titanium, Co-array Fortran, possibly others..
• Hide language- or compiler-specific details, such as shared-pointer representation
• Hardware-independence: variety of parallel architectures & OS's
• SMP: Linux/UNIX SMP's, Origin 2000, etc.
• Clusters of uniprocessors or SMP's: IBM SP, Compaq AlphaServer, Linux/UNIX clusters, etc.
• Support many high-performance networks: Infiniband, Myrinet/GM, Quadrics/elan, IBM/LAPI, Dolphin, MPI
• Ease of implementation on new hardware
• Allow quick prototype implementations
• Implementations can leverage performance features of hardware
• Provide both portability & performance
GASNet Core APIGASNet Core API
Global Address Space LanguagesGlobal Address Space Languages
GASNet Extended APIGASNet Extended API
Bandwidth PerformanceBandwidth Performance
Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Kathy YelickParry Husbands, Costin Iancu, Mike Welcome, Kathy Yelick
Compiler-generated code
Compiler-specific runtime system
GASNet Extended API
GASNet Core API
Network Hardware
• Wider interface that includes more complicated operations
• We provide a reference implementation of the extended API in terms of the core API
• Implementors can choose to directly implement any subset for performance - leverage hardware support for higher-level operations
• Wider interface that includes more complicated operations
• We provide a reference implementation of the extended API in terms of the core API
• Implementors can choose to directly implement any subset for performance - leverage hardware support for higher-level operations
• Most basic required network primitives
• Implemented directly on each platform
• Minimal set of network functions needed to support a working implementation
• General enough to implement everything else
• Based heavily on active messages paradigm
• Provides powerful extensibility mechanism
• Most basic required network primitives
• Implemented directly on each platform
• Minimal set of network functions needed to support a working implementation
• General enough to implement everything else
• Based heavily on active messages paradigm
• Provides powerful extensibility mechanism
http://upc.nersc.govhttp://upc.nersc.gov [email protected]@lbl.gov
Bandwidth of GASNet vapi-conduit and MVAPICH 0.9.1 MPI
0
100
200
300
400
500
600
700
800
1 2 4 8 16 32 64 128
256
512
1KB
2KB
4KB
8KB
16KB
32KB
64KB
128K
B
256K
B
512K
B1M
B
Message Size (bytes)
Ba
nd
wid
th (
MB
/se
c)
MVAPICH MPI
gasnet_put_nb_bulk (source pre-pinned)
gasnet_put_nb_bulk (source not pinned)
• Using Mellanox VAPI interface
• Vendor implementation of the InfiniBand Verbs
• Useful minor extensions beyond Verbs
• GASNet Core API: Active Messages
• Based on Send/Recv operations
• Simple flow control
• Uses an additional thread for improved responsiveness
• GASNet Extended API: RDMA
• Thin layer over InfiniBand RDMA (puts and gets)
• Simple record attached to each CQE for completion
• No dynamic memory registration yet (bounce-buffers used for out-of-segment references)
• Using Mellanox VAPI interface
• Vendor implementation of the InfiniBand Verbs
• Useful minor extensions beyond Verbs
• GASNet Core API: Active Messages
• Based on Send/Recv operations
• Simple flow control
• Uses an additional thread for improved responsiveness
• GASNet Extended API: RDMA
• Thin layer over InfiniBand RDMA (puts and gets)
• Simple record attached to each CQE for completion
• No dynamic memory registration yet (bounce-buffers used for out-of-segment references)
Future WorkFuture WorkImplementing GASNet on InfiniBandImplementing GASNet on InfiniBand
Roundtrip Latency of GASNet vapi-conduit and MVAPICH 0.9.1 MPI
0
5
10
15
20
25
30
1 2 4 8 16 32 64 128 256 512 1KB 2KB
Message Size (bytes)
Ro
un
dtr
ip L
aten
cy (u
s)
MPI_Send/MPI_Recv ping-pong
gasnet_put_nb + sync
System Description•4-node cluster•Dual-CPU 2.8Ghz Pentium 4 Xeon, 4GB mem•Red Hat 8.0 w/ Linux 2.4.18-14•Mellanox InfiniHost (Cougar) IB-4X HCAs
–Firmware 3.0, SDK 3.0.1•DivergeNet 8-port IB-4X switch
System Description•4-node cluster•Dual-CPU 2.8Ghz Pentium 4 Xeon, 4GB mem•Red Hat 8.0 w/ Linux 2.4.18-14•Mellanox InfiniHost (Cougar) IB-4X HCAs
–Firmware 3.0, SDK 3.0.1•DivergeNet 8-port IB-4X switch
Firehose Algorithm for distributed management
of DMA registration
Firehose Algorithm for distributed management
of DMA registration
VAPI conduit:
• Dynamic memory registration (work in progress)
• Extension of “firehose” algorithm for region-oriented registration
• C. Bell and D. Bonachea. “A New DMA Registration Strategy for Pinning-Based High Performance Networks ” CAC 2003
• RDMA-based barrier
• General performance tuning
General GASNet:
• Collective Communication
• Point-to-point Scatter/Gather and Strided put/get operations
• New GASNet conduit implementations:
• Florida State University: SCI/Dolphin
• UC Berkeley: shmem (Cray, SGI and Quadrics), and Cray X-1
VAPI conduit:
• Dynamic memory registration (work in progress)
• Extension of “firehose” algorithm for region-oriented registration
• C. Bell and D. Bonachea. “A New DMA Registration Strategy for Pinning-Based High Performance Networks ” CAC 2003
• RDMA-based barrier
• General performance tuning
General GASNet:
• Collective Communication
• Point-to-point Scatter/Gather and Strided put/get operations
• New GASNet conduit implementations:
• Florida State University: SCI/Dolphin
• UC Berkeley: shmem (Cray, SGI and Quadrics), and Cray X-1
GASNet Put/Get Roundtrip Latency (min over msg sz)
0
5
10
15
20
25
30
35
40
45
50
55
60
65
mpi elan mpi elan mpi elan mpi elan mpi gm mpi gm mpi lapi lapi-poll
mpi gm mpi vapi
Ro
un
dtr
ip L
ate
nc
y (
mic
ros
ec
on
ds
)
put_nb
get_nb
quadrics falcon(Alpha)
quadrics colt
(Alpha)
quadrics opus(IA64)
quadrics lemieux(Alpha)
myrinetalvarez(x86)
Colony/GXseaborg
(PowerPC)
infinibandpcp
(x86 PCI-X)
myrinetcitris(IA64)
myrinetpcp
(x86 PCI-X)
GASNet Put/Get Bulk Flood Bandwidth (max over msg sz)
0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
mpi elan mpi elan mpi elan mpi elan mpi gm mpi gm mpi lapi lapi-poll
mpi gm mpi vapi
Ba
nd
wid
th (
MB
/se
c)
put_nb_bulk
get_nb_bulk
quadrics falcon(Alpha)
quadrics colt
(Alpha)
quadrics opus(IA64)
quadrics lemieux(Alpha)
myrinetalvarez(x86)
Colony/GXseaborg
(PowerPC)
infinibandpcp
(x86 PCI-X)
myrinetcitris(IA64)
myrinetpcp
(x86 PCI-X)
Myrinet-IA64 Roundtrip ping-pong latency
0
5
10
15
20
25
30
35
1 10 100 1000 10000
Message Size (bytes)
Rou
ndtr
ip la
tenc
y (u
s)
putgetput_nbiget_nbiput_nbget_nb
Quadrics-AlphaServer Flood Bandwidth (bulk)
0
50
100
150
200
250
300
0 10000 20000 30000 40000 50000 60000 70000
Message Size (bytes)
Ban
dw
idth
(MB
/sec
)
put_bulkget_bulkput_nbi_bulkget_nbi_bulkput_nb_bulkget_nb_bulk
Quadrics-AlphaServer Roundtrip ping-pong latency
0
2
4
6
8
10
12
14
16
18
20
1 10 100 1000 10000
Message Size (bytes)
Ro
un
dtr
ip la
tenc
y (u
s)
putgetput_nbiget_nbiput_nbget_nb
Myrinet-IA64 Flood Bandwidth (bulk)
0
50
100
150
200
250
0 10000 20000 30000 40000 50000 60000 70000
Message Size (bytes)
Ban
dw
idth
(MB
/sec
)
put_bulkget_bulkput_nbi_bulkget_nbi_bulkput_nb_bulkget_nb_bulk