optimization of collective communication in intra-cell mpi

Optimization of Collective Optimization of Collective Communication in Intra-Cell MPICommunication in Intra-Cell MPI

Ashok Srinivasan

Florida State University

[email protected]

Ashok Srinivasan

Florida State University

[email protected]

Goals

1. Efficient implementation of collectives for intra-Cell MPI

2. Evaluate the impact of different algorithms on the performance

Goals

1. Efficient implementation of collectives for intra-Cell MPI

2. Evaluate the impact of different algorithms on the performance

Collaborators: A. Kumar1, G. Senthilkumar1, M. Krishna1, N. Jayam1, P.K. Baruah1, R. Sarma1,

S. Kapoor2

1 Sri Sathya Sai University, Prashanthi Nilayam, India2 IBM, Austin

Acknowledgment: IBM, for providing access to a Cell blade under the VLP program

Collaborators: A. Kumar1, G. Senthilkumar1, M. Krishna1, N. Jayam1, P.K. Baruah1, R. Sarma1,

S. Kapoor2

1 Sri Sathya Sai University, Prashanthi Nilayam, India2 IBM, Austin

Acknowledgment: IBM, for providing access to a Cell blade under the VLP program

OutlineOutline

Cell Architecture

Intra-Cell MPI Design Choices

Barrier

Broadcast

Reduce

Conclusions and Future Work

A PowerPC core, with 8 co-processors (SPE) with 256 K local

store each

Shared 512 MB - 2 GB main memory - SPEs can DMA

Peak speeds of 204.8 Gflops in single precision and 14.64 Gflops

in double precision for SPEs

204.8 GB/s EIB bandwidth, 25.6 GB/s for memory

Two Cell processors can be combined to form a Cell blade with

global shared memory

Cell ArchitectureCell Architecture

DMA put timesDMA put times

Memory to Memory Copy using:

• SPE local store

• memcpy by PPE

Memory to Memory Copy using:

• SPE local store

• memcpy by PPE

Intra-Cell MPI Design ChoicesIntra-Cell MPI Design Choices

Cell features In order execution, but DMAs can be out of order Over 100 simultaneous DMAs can be in flight

Constraints Unconventional, heterogeneous architecture SPEs have limited functionality, and can act directly only on local stores SPEs access main memory through DMA Use of PPE should be limited to get good performance

MPI design choices Application data in: (i) local store or (ii) main memory MPI data in: (i) local store or (ii) main memory PPE involvement: (i) active or (ii) only during initialization and finalization Collective calls can: (i) synchronize or (ii) not synchronize

Barrier (1)Barrier (1)

OTA List: “Root” receives notification from all others, and then acknowledges through a DMA list

OTA: Like OTA List, but root notifies others through individual non-blocking DMAs

SIG: Like OTA, but others notify root through a signal register in OR mode

Degree-k TREE In each step, a node has k-1 children In the first phase, children notify parents In the second phase, parents acknowledge

children

Barrier (2)Barrier (2)

PE: Consider SPUs to be a logical hypercube – in each step, each SPU exchanges messages with neighbor along one dimension

DIS: In step i, SPU j sends to SPU j + 2i and receives from j – 2i

Comparison of MPI_Barrier on different hardware

P Cell (PE) s

Xeon/Myrinet s

NEC SX-8 s SGI Altix BX2 s

8 0.4 10 13 3

16 1.0 14 5 5

Alternatives Atomic increments in main memory

– several microseconds PPE coordinates using mailbox –

tens of microseconds

Broadcast (1)Broadcast (1)

OTA on 4 SPUs OTA: Each SPE copies data to its location

Different shifts are used to avoid hotspots in memory Different shifts on larger number of SPUs yield results

that are close to each other

AG on 16 SPUs AG: Each SPE is responsible for a

different portion of data Different minimum sizes are tried


TREEMM on 12 SPUs TREEMM: Tree structured Send/Recv type

implementation Data for degrees 2 and 4 are close Degree 3 is best, or close to it, for all SPU counts

TREE on 16 SPUs TREE: Pipelined tree structured

communication based on local stores Results are similar to this figure for

other SPU counts


Broadcast on 16 SPEs (2 processors) TREE: Pipelined tree structured communication based on LS TREEMM: Tree structured Send/Recv type implementation AG: Each SPE is responsible for a different portion of data OTA: Each SPE copies data to its location G: Root copies all data

Broadcast with good choice of

algorithms for each data size and SPE count

Maximum main memory bandwidth is also shown


Each node of the SX-8 has 8 vector processors capable of 16 Gflop/s, with 64 GB/s bandwidth to memory from each processor The total bandwidth to memory for a node is 512 GB/s Nodes are connected through a crossbar switch capable of 16 GB/s in each direction

The Altix is a CC-NUMA system with a global shared memory Each node contains eight Itanium 2 processors Nodes are connected using NUMALINK4 -- bandwidth between processors on a node is 3.2 GB/s, and between

nodes 1.6 GB/s

Data Size

Cell (PE) s Infiniband s NEC SX-8 s SGI Altix BX2 s

P = 8 P = 16 P = 8 P = 16 P = 8 P = 16 P = 8 P = 16

128 B 1.7 3.1 18 10

1 KB 2.0 3.7 25 20

32 KB 12.8 33.7 220

1 MB 414 653 100 215 2600 3100

Comparison of MPI_Bcast on different hardware

ReduceReduce

Reduce of MPI_INT with MPI_SUM on 16 SPUs Similar trends were observed for other SPU counts

Data Size

Cell (PE) s IBM SP s

NEC SX-8 s SGI Altix BX2 s

P = 8 P = 16

P = 16 P = 8 P = 16

P = 8 P = 16

128 B 3.06 5.69 40

1 KB 4.41 8.8 60

1 MB 689 1129 13000 230

350

10000 12000

Each node of the IBM SP was a 16-processor SMP

Comparison of MPI_Bcast on different hardware

Conclusions and Future WorkConclusions and Future Work

Conclusions The Cell processor has good potential for MPI implementations

PPE should have a limited role High bandwidth and low latency even with application data in main

memory But local store should be used effectively, with double buffering to hide latency Main memory bandwidth is then the bottleneck

Current and future work Implemented

Collective communication operations optimized for contiguous data Future work

Optimize collectives for derived data types with non-contiguous data

optimization of collective communication in intra-cell mpi

Documents

cell blade

spe local store memcpy

local store eachshared

local storesresults

intracell mpievaluate

memorytwo cell processors

main memorympi data

gb main memory spes