boston, may 22 nd, 2013 ipdps 1 acceleration of an asynchronous message driven programming paradigm...
TRANSCRIPT
Boston, May 22nd, 2013 IPDPS 1
Acceleration of an Asynchronous Message Driven Programming Paradigm on
IBM Blue Gene/Q
Sameer Kumar*IBM T J Watson Research Center,
Yorktown Heights, NY
Yanhua Sun, Laxmikant KaleDepartment of Computer Science
University of Illinois at Urbana Champaign
Boston, May 22nd, 2013 IPDPS 2
Overview
• Charm++ programming model• Blue Gene/Q machine
– Programming models and messaging libraries on Blue Gene/Q
• Optimization of Charm++ on BG/Q• Performance results• Summary
Boston, May 22nd, 2013 IPDPS 3
Charm++ Programming Model
• Asynchronous Message Driven Programming– Users decompose the problem (over decomposition)– Intelligent runtime : task processor mapping, communication load
balancing, fault tolerance
– Overlap computation and communication via asynchronous communication
– Execution driven by available message data
Boston, May 22nd, 2013 IPDPS 4
Charm++ Runtime System
• Non SMP mode– One process per hardware thread– Each process has a separate charm scheduler
• SMP Mode– Single or a few processes per network node– Multiple threads executing charm++ schedulers in the
same address space– Lower space overheads as read only data structures
are not replicated– Communication threads can drive network progress– Communication within the node via pointer exchange
Boston, May 22nd, 2013 IPDPS 5
2. Single Chip Module
4. Node Board:32 Compute NodesOptical Modules, Link Chips; 5D Torus
6. Rack:2 Midplanes1, 2 or 4 I/O drawers
7. System:Up to 96 racks or more20 petaflops+
3. Compute Card (Node):Chip module16 GB DDR3 Memory
5b. I/O drawer:8 I/O cards8 PCIe Gen2 x8 slots
5a. Midplane:16 Node Cards
1. BG/Q Chip:17 PowerPC cores
Blue Gene/Q
Boston, May 22nd, 2013 IPDPS 6
Blue Gene/Q Architecture
• Integrated scalable 5D torus– Virtual Cut-Through routing– Hardware assists for collective &
barrier functions– FP addition support in network– RDMA
• Integrated on-chip Message Unit
• 272 concurrent endpoints• 2 GB/s raw bandwidth on all
10 links – each direction -- i.e. 4 GB/s bidi– 1.8 GB/s user bandwidth
• protocol overhead• 5D nearest neighbor exchange
measured at 1.76 GB/s per link (98% efficiency)
• Processor architecture– Implemented 64-bit PowerISATM v2.06 – 1.6 GHz @ 0.8V. – 4-way Simultaneous Multi- Threading– Quad FPU– 2-way concurrent issue – In-order execution with dynamic branch
prediction• Node architecture
– Large multi-core SMPs with 64 threads/node
– Relatively small amount of memory per thread: 16 GB node share by 64 threads
Boston, May 22nd, 2013 IPDPS 7
New Hardware Features
• Scalable L2 Atomics– Atomic operations can be invoked on 64bit words in
DDR– Several operations supported including load-increment,
store-add, store-XOR ..– Bounded atomics supported
• Wait on pin– Thread can arm a wakeup unit and go to wait– Core resources such load/store pipeline slots,
arithmetic units not used – Thread awakened by
• Network packet• Store to a memory location that results in an L2 invalidate• Inter-process-interrupt (IPI)
Boston, May 22nd, 2013 IPDPS 8
PAMI API
PAMI Messaging Library on BG/Q
MPICH22.x
BG/Qmessaging
implementation
PERCSmessaging
implementation
PAMIADI
Applications
BG/Q
MU SPI
PERCS
HAL API
IBM MPI2.x
MPCI
Intel x86messaging
implementation
Intel x86
APGAS Runtime
CAFRuntime
X10Runtime
UPCRuntime
GAARMCI
CHARM++ GASNet
Mid
dlew
are
Sys
tem
Sof
twar
e
PAMI: Parallel Active Messaging Interface
Boston, May 22nd, 2013 IPDPS 9
Point-to-point Operations
• Active messages– A registered handle is called on the remote node– PAMI_Send_immediate for short transfers– PAMI_Send
• One-sided remote DMA– PAMI_Get, PAMI_Put : application initiates RDMA
with remote virtual address– PAMI_Rget, PAMI_Rput: application first
exchanges memory regions before starting RDMA transfer
Boston, May 22nd, 2013 IPDPS 10
Multi-threading in PAMI
• Multi-context communication–Enable several threads in a multi-core architecture concurrent
access to the network–Eliminate contention for shared resources–Enable parallel send and receive operations on different
contexts via different BG/Q injection and reception FIFOs • Endpoint addressing scheme
–Communication is between network endpoints, not processes, threads, or tasks
• Multiple contexts progressed on multiple communication threads
• Communication threads on BG/Q wait on pin–L2 writes or network packets can awaken communication
threads with very low overheads• Post work to PAMI contexts via PAMI_Context_post
–Work posted to a concurrent L2 atomic queue –Work functions advanced by main or communication threads
Boston, May 22nd, 2013 IPDPS 11
Charm++ Port over PAMI on BG/Q
Boston, May 22nd, 2013 IPDPS 12
Charm++ Port and Optimizations
• Ported the converse machine interface to make PAMI API calls
• Explored various optimizations– Lockless queues– Scalable memory allocation– Concurrent communication
• Allocate multiple PAMI contexts• Multiple communication threads driving multiple PAMI
contexts
– Optimize short messages• Manytomany
Boston, May 22nd, 2013 IPDPS 13
Lockless Queues• Concurrent producer consumer array based queues based on L2
atomic increments• Overflow queue used when L2 queue is full• Threads in the same process can send messages via concurrent
enqueues
Boston, May 22nd, 2013 IPDPS 14
Scalable Memory Allocation
• Systems software on BG/Q calls glibc shared arena allocator– Malloc
• Find an available arena and lock it• Allocate and return memory buffer• Release lock
– Free• Find arena where buffer was allocated from• Lock arena, free buffer in that arena and unlock
– Free results in thread contention – Can slow down short malloc/free calls typically used in
Charm++ applications such as NAMD
Boston, May 22nd, 2013 IPDPS 15
Scalable Memory Allocation (2)
• Optimize via memory pools of short buffers • L2 atomic queues for fast thread concurrent
access• Allocate
– Dequeue from Charm thread’s local memory pool if memory buffer is available
– If pool is empty allocate via glibc malloc
• De allocate– Enqueue to owner thread’s pool via a lockless
enqueue– Release via glibc free if owner thread’s pool is full
Boston, May 22nd, 2013 IPDPS 16
Multiple Contexts and Communication Threads
• Maximize concurrency in sends and receives• Charm++ SMP mode creates multiple PAMI contexts
– Sub groups of Charm++ worker threads are associated with a PAMI context
– For example at 64 threads/node we use 16 PAMI contexts• Sub groups of 4 threads access a PAMI context• PAMI library calls protected via critical sections• Worker threads advance PAMI contexts when idle
– This mode is suitable for compute bound applications• SMP mode with communication threads
– Each PAMI context advanced by a different communication thread
– Charm++ worker threads post work via PAMI_Context_post– Charm++ worker threads do not advance PAMI contexts– This mode is suitable for communication bound applications
Boston, May 22nd, 2013 IPDPS 17
Optimize Short Messages
• CmiDirectManytomany– Charm++ interface to optimize a burst of short messages– Message buffer addresses and sizes registered ahead– Communication operations kicked off via a start call– Completion callback notifies Charm++ scheduler when data has
been fully sent and received– Charm++ scheduling and header overheads are eliminated
• We parallelize burst sends of several short messages by posting work to multiple communication threads– Worker threads call PAMI_Context_post with a work function– Work functions execute PAMI_Send_immediate to calls move
data on the network– On the receiver data is directly moved to registered destination
buffers
Boston, May 22nd, 2013 IPDPS 18
Performance Results
Boston, May 22nd, 2013 IPDPS 19
Converse Internode Ping Pong Latency
Boston, May 22nd, 2013 IPDPS 20
Converse Intranode Ping Pong Latency
Boston, May 22nd, 2013 IPDPS 21
Scalable Memory Allocation
64 Threads on a node allocate and free 100 buffers in each iteration
Boston, May 22nd, 2013 IPDPS 22
Performance Impact of L2 Atomic Queues
Speedup 2.7x
Speedup 1.5x
NAMD APoA1 Benchmark
Boston, May 22nd, 2013 IPDPS 23
NAMD Application on 512 Nodes
Time Profile with 32 Worker Threads and 8 Communication Threads per node
Time Profile with 48 Worker Threads and No Communication Threads per node
512 Nodes
Boston, May 22nd, 2013 IPDPS 24
PME Optimization with CmiDirectManytoMany
1024 Nodes
Boston, May 22nd, 2013 IPDPS 25
3D Complex to Complex FFT
Nodes
128X128X128 64X64X64 32X32X32
p2p m2m p2p m2m p2p m2m
64 3030 1826 787 507 457 142
128 2019 1426 731 459 398 127
256 1930 944 625 268 379 110
512 1785 677 625 229 376 93
1024 1560 583 621 208 377 74
Complex to Complex Forward + Backward 3D FFT Time in Microseconds
Boston, May 22nd, 2013 IPDPS 26
NAMD APoA1 Benchmark Performance Results
BG/Q Time Step 0.68 ms/step
Boston, May 22nd, 2013 IPDPS 27
Summary
• Presented several optimizations for the Charm++ runtime on the Blue Gene/Q machine
• SMP mode outperforms non-SMP– Best performance on BG/Q with 1 to 4 processes
per node and 16 to 64 threads/process
• Best time step of 0.68ms/step for the NAMD application with the APoA1 benchmark
Boston, May 22nd, 2013 IPDPS 28
Thank You
Questions?