![Page 1: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/1.jpg)
4/15/05 1 of 37
Implementing Optimized Collective Communication Routines on the IBM BlueGene/L Supercomputer
CS 425 term projectBy Sam Miller
[email protected], 18, 2005
![Page 2: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/2.jpg)
4/15/05 2 of 37
Outline
• What is BlueGene/L? (5 slides)• Hardware (3 slides)• Communication Networks (2 slides)• Software (2 slides)• MPI and MPICH (1 slide)• Collective Algorithms (5 slides)• Better Collective Algorithms! (12 slides)• Performance• Conclusion
![Page 3: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/3.jpg)
4/15/05 3 of 37
Abbreviations Today
• BGL = BlueGene/L• CNK = Compute Node Kernel• MPI = Message Passing Interface• MPICH2 = MPICH 2.0 from Argonne Labs• ASIC = Application Specific Integrated Circuit• ALU = Arithmetic Logic Unit• IBM = International Biscuit Makers (duh)
![Page 4: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/4.jpg)
4/15/05 4 of 37
What is BGL 1/2
• Massively parallel distributed memory cluster of embedded processors
• 65,536 nodes! 131,072 processors!• Low power requirements• Relatively small, compared to predecessors• Half system installed at LLNL• Other systems going online too
![Page 5: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/5.jpg)
4/15/05 5 of 37
What is BGL 2/2
• BlueGene/L at LLNL (360 Tflops)– 2,500 square feet, half a tennis court
• Earth Simulator (40 Tflops)– 35,000 square feet, requires an entire building
![Page 6: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/6.jpg)
4/15/05 6 of 37
![Page 7: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/7.jpg)
4/15/05 7 of 37
![Page 8: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/8.jpg)
4/15/05 8 of 37
![Page 9: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/9.jpg)
4/15/05 9 of 37
Hardware 1/3
• CPU is PowerPC 440– Designed for embedded applications– Low power, low clock frequency (700 MHz)– 32 bit :-(
• FPU is custom 64-bit– Each PPC 440 core has two of these– The two FPUs operate in parallel– @ 700MHz this is 2.8 Gflops per PPC 440 core
![Page 10: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/10.jpg)
4/15/05 10 of 37
Hardware 2/3
• BGL ASIC – Two PPC 440 cores, four FPUs– L1, L2, L3 caches– DDR memory controller– Logic for 5 separate communications networks– This forms one compute node
![Page 11: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/11.jpg)
4/15/05 11 of 37
Hardware 3/3
• To build the entire 65,536 node system– Two ASICs with 256 or 512 MB DDR RAM
form a compute card– Sixteen compute cards form a node board– Sixteen node boards form a midplane– Two midplanes form a rack– Sixty four racks brings us to:– 2x16x16x2x64 = 65,536!
![Page 12: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/12.jpg)
4/15/05 12 of 37
QuickTime™ and aGraphics decompressor
are needed to see this picture.
![Page 13: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/13.jpg)
4/15/05 13 of 37
Communication Networks 1/2
• Five different networks– 3D torus
• Primary for MPI library– Global tree
• Used for collectives on MPI_COMM_WORLD• Used by compute nodes to communicate with I/O nodes
– Global interrupt• 1.5 usec latency over entire 65k node system!
– JTAG• Used for node bootup and servicing
– Gigabit Ethernet• Used by I/O nodes
![Page 14: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/14.jpg)
4/15/05 14 of 37
Communication Networks 2/2
• Torus– 6 neighbors have bi-directional links at 154
MB/sec– Guarantees reliable, deadlock free delivery– Chosen due to high bandwidth nearest neighbor
connectivity– Used in prior supercomputers, such as Cray
T3E
![Page 15: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/15.jpg)
4/15/05 15 of 37
Software 1/2
• Compute node runs stripped down Linux called CNK– Two threads, 1 per CPU– No context switching, no VM– Standard glibc interface, easy to port– 5000 lines of C++
• I/O nodes run standard PPC Linux– They have disk access– Run a daemon called console I/O daemon (ciod)
![Page 16: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/16.jpg)
4/15/05 16 of 37
Software 2/2
• Network software has 3 layers– Topmost is MPI Library– Middle is Message Layer
• Allows transmission of arbitrary buffer sizes
– Bottom is Packet layer• Very simple• Stateless interface to torus, tree, and GI hardware• Facilitates sending & receiving packets
![Page 17: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/17.jpg)
4/15/05 17 of 37
MPICH
• Developed by Argonne National Labs• Open source, freely available, standards
compliant MPI implementation• Used by many vendors• Chosen by IBM due to use of Abstract
Device Interface (ADI) and design for scalability
![Page 18: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/18.jpg)
4/15/05 18 of 37
Collective Algorithms 1/5
• Collectives can be implemented with basic send and receives– Better algorithms exist
• Default MPICH2 collectives perform poor on BGL– Assume crossbar network, poor node mapping– Point-to-point messages incur high overhead– No knowledge of network specific features
![Page 19: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/19.jpg)
4/15/05 19 of 37
Collective Algorithms 2/5
• Optimization is tricky– Message size and communicator shape are deciding
factors– Large messages == optimize bandwidth– Short messages == optimize latency
• I will not talk about short message collectives further today
• If optimized algorithm isn’t available, BGL falls back on default MPICH2– It will work because point-to-point messages work– Performance will suck however
![Page 20: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/20.jpg)
4/15/05 20 of 37
Collective Algorithms 3/5
• Conditions for selecting optimized collective algorithm are made locally– What is wrong with this?
• Example:char buf[100], buf2[20000];if (rank == 0) MPI_Bcast(buf, 100, …);else MPI_Bcast(buf2, 20000, …);
– Not legal according to MPI standard, but…– What if one node uses the optimized algorithm and the
others use the MPICH2 algorithm?• Deadlock - or worse
![Page 21: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/21.jpg)
4/15/05 21 of 37
Collective Algorithms 4/5
• Solution to previous problem:– Make optimization decisions globally– This incurs a slight latency hit– Thus, only used when offsetting increases in
bandwidth are important: Ex: long message collectives
![Page 22: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/22.jpg)
4/15/05 22 of 37
Collective Algorithms 5/5
• Remainder of slides– MPI_Bcast– MPI_Reduce, MPI_Allreduce– MPI_Alltoall, MPI_Alltoallv
• Using both the tree and torus networks– Tree operates only on MPI_COMM_WORLD
• Has a built in ALU, but only fixed point :-(
– Torus has deposit bit feature, requires rectangular communicator shape (for most algorithms)
![Page 23: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/23.jpg)
4/15/05 23 of 37
Broadcast 1/3
• MPICH2– Binomial tree for short messages– Scatter then Allgather for large messages– Perform poor on BGL due to high CPU overhead and
lack of topology awareness
• Torus– Uses deposit bit feature– For n-dimension mesh, 1/n of message is sent in each
direction concurrently
• Tree– Does not use ALU
![Page 24: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/24.jpg)
4/15/05 24 of 37
Broadcast 2/3
• Red lines represent one spanning tree of half the message
• Blue lines represent another spanning tree of the other message half
![Page 25: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/25.jpg)
4/15/05 25 of 37
Broadcast 3/3
![Page 26: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/26.jpg)
4/15/05 26 of 37
Reduce & Allreduce 1/4
• Reduce essentially a reverse broadcast• Allreduce is a reduce followed by broadcast• Torus
– Can’t use deposit bit feature– CPU bound, bandwidth is poor– Solution: Hamiltonian path, huge latency penalty, but
great bandwidth• Tree
– Natural choice for reduction using integers!– Floating point performance is bad
![Page 27: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/27.jpg)
4/15/05 27 of 37
Reduce & Allreduce 2/4
• Hamiltonian path for 4x4x4 cube
![Page 28: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/28.jpg)
4/15/05 28 of 37
Reduce & Allreduce 3/4
![Page 29: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/29.jpg)
4/15/05 29 of 37
Reduce & Allreduce 4/4
![Page 30: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/30.jpg)
4/15/05 30 of 37
Alltoall and Alltoallv 1/5
• MPICH2 has 4 algorithms– Yes 4 separate ones– BGL performace is bad due to network hot spots and
CPU overhead
• Torus– No communicator size restriction!– Does not use deposit bit feature
• Tree– No alltoall tree algorithm, it would not make sense
![Page 31: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/31.jpg)
4/15/05 31 of 37
Alltoall and Alltoallv 2/5
• BGL Optimized torus algorithm– Uses randomized packet injection– Each node creates a destination list– Each node has same seed value, different offset
• Bad memory performance?– Yes!– Torus payload is 240 bytes (8 cache lines)– Multiple packets in adjacent cache lines to each
destination are injected before advancing• Measurements showed 2 packets to be optimal
![Page 32: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/32.jpg)
4/15/05 32 of 37
Alltoall and Alltoallv 3/5
![Page 33: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/33.jpg)
4/15/05 33 of 37
Alltoall and Alltoallv 4/5
![Page 34: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/34.jpg)
4/15/05 34 of 37
Alltoall and Alltoallv 5/5
![Page 35: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/35.jpg)
4/15/05 35 of 37
Conclusion
• Optimized collectives on BGL off to a good start– Superior performance than MPICH2– Exploit knowledge about network features– Avoid performance penalties like memory copies and
network hot spots
• Much work remains– Short message collectives– Non-rectangular communicators for the torus network– Tree collectives using communicators other than
MPI_COMM_WORLD– Other collectives: scatter, gather, etc.
![Page 36: Implementing Optimized Collective Communication Routines ...web.cs.iastate.edu/~cs425/presentations/miller-presentation.pdfImplementing Optimized Collective Communication Routines](https://reader033.vdocuments.mx/reader033/viewer/2022042807/5f773e3b98801e20293289f7/html5/thumbnails/36.jpg)
4/15/05 36 of 37
Questions?