non-blocking collectives for mpi -...

Computer Architecture Past & Future Why Non blocking Collectives? LibNBC And Applications? Ongoing Efforts

Non-Blocking Collectives for MPI– overlap at the highest level –

Torsten Höfler

Open Systems LabIndiana University

Bloomington, IN, USA

Institut für Wissenschaftliches RechnenTechnische Universität Dresden

Dresden, Germany

27th May 2008


Outline

1 Computer Architecture Past & Future

2 Why Non blocking Collectives?

3 LibNBC

4 And Applications?

5 Ongoing Efforts


Fundamental Assumptions (I)

We need more powerful machines!

Solutions for real-world scientific problems need huge

processing power (more than available)

Capabilities of single PEs have fundamental limits

The scaling/frequency race is currently stagnating

Moore’s law is still valid (number of transistors/chip)

Instruction level parallelism is limited (pipelining, VLIW,

multi-scalar)

Explicit parallelism seems to be the only solution

Single chips and transistors get cheaper

Implicit transistor use (ILP, branch prediction) have their

limits


Fundamental Assumptions (II)

Parallelism requires communication

Local or even global data-dependencies exist

Off-chip communication becomes necessary

Bridges a physical distance (many PEs)

Communication latency is limited

It’s widely accepted that the speed of light limits

data-transmission

Example: minimal 0-byte latency for 1m ≈ 3.3ns ≈ 13

cycles on a 4GHz PE

Bandwidth can hide latency only partially

Bandwidth is limited (physical constraints)

The problem of “scaling out” (especially iterative solvers)


Assumptions about Parallel Program Optimization

Collective Operations

Collective Operations (COs) are an optimization tool

CO performance influences application performance

optimized implementation and analysis of CO is non-trivial

Hardware Parallelism

More PEs handle more tasks in parallel

Transistors/PEs take over communication processing

Communication and computation could run simultaneously

Overlap of Communication and Computation

Overlap can hide latency

Improves application performance


The LogGP Model

CPU

NetworkL

level

time

Sender Receiver

o s or

g + m*G g + m*G


Interconnect Trends

Technology Change

modern interconnects offload communication to

co-processors (Quadrics, InfiniBand, Myrinet)

TCP/IP is optimized for lower host-overhead (e.g., Gamma)

even legacy Ethernet supports protocol offload

L+ g +m ·G >> o

⇒ we prove our expectations with benchmarks of the user CPU

overhead


LogGP Model Examples - TCP

0

100

200

300

400

500

600

0 10000 20000 30000 40000 50000 60000

Tim

e in

mic

rose

co

nd

s

Datasize in bytes (s)

MPICH2 - G*s+gMPICH2 - o

TCP - G*s+gTCP o


LogGP Model Examples - Myrinet/GM

0

50

100

150

200

250

300

350

0 10000 20000 30000 40000 50000 60000

Tim

e in

mic

rose

co

nd

s


Open MPI - G*s+gOpen MPI - o

Myrinet/GM - G*s+gMyrinet/GM - o


LogGP Model Examples - InfiniBand/OpenIB

0

10

20

30

40

50

60

70

80

90

0 10000 20000 30000 40000 50000 60000

Tim

e in

mic

rose

co

nd

s


Open MPI - G*s+gOpen MPI - o

OpenIB - G*s+gOpenIB - o


Outline



3 LibNBC

4 And Applications?

5 Ongoing Efforts


Isend/Irecv is there - Why Collectives?

Gorlach, ’04: “Send-Receive Considered Harmful”

⇔ Dijkstra, ’68: “Go To Statement Considered Harmful”

point to point

if ( rank == 0) then

call MPI_SEND(...)

else

call MPI_RECV(...)

end if

vs. collective

call MPI_GATHER(...)

cmp. math libraries vs. loops


Sparse Collectives

But my algorithm only needs nearest neighbor communication!?

⇒ this is a collective too, just sparse (cf. sparse BLAS)

sparse communication with neighbors on process

topologies

graph topology makes it generic

many optimization possibilities (process placing, overlap,

message scheduling/forwarding)

easy to implement

not part of MPI but fully implemented in LibNBC and

proposed to the MPI Forum


Why non blocking Collectives

scale typically with O(log2P) sends

wasted CPU time: log2P · (L+Gall)

Fast Ethernet: L = 50-60Gigabit Ethernet: L = 15-20

InfiniBand: L = 2-71µs ≈ 6000 FLOP on a 3GHz Machine

... and many collectives synchronize unneccessarily


Modelling the Benefits

LogGP Model for Allreduce:

tallred = 2 · (2o + L+m ·G) · ⌈log2P⌉ +m · γ · ⌈log2P

0.01

0.1

1

10

100

1000

10 100 1000 10000

Tim

e (

us)

CPU overhead (1kiB)Network time (1kiB)

15us non-blocking overhead


CPU Overhead Benchmarks

Allreduce, LAM/MPI 7.1.2/TCP over GigE

10 20

30 40

50 60Communicator Size 1

10 100

1000 10000

100000

Data Size

0 0.005 0.01

0.015 0.02

0.025 0.03

CPU Usage (share)


Performance Benefits

overlap

leverage hardware parallelism (e.g. InfiniBandTM)

overlap similar to non-blocking point-to-point

pseudo synchronization

avoidance of explicit pseudo synchronization

limit the influence of OS noise


Process Skew

caused by OS interference or unbalanced application

worse if processors are overloaded

multiplies on big systems

can cause dramatic performance decrease

all nodes wait for the last

Example

Petrini et. al. (2003) ”The Case of the Missing Supercomputer

Performance: Achieving Optimal Performance on the 8,192

Processors of ASCI Q”


MPI_Bcast with P0 delayed - Jumpshotp

roce

sses

time

P0

P1

P3

P2


MPI_Ibcast with P0 delayed + overlap - Jumpshotp

roce

sses

time

P0

P1

P3

P2


Outline



3 LibNBC

4 And Applications?

5 Ongoing Efforts


Non-Blocking Collectives - Interface

extension to MPI-2

”mixture” between non-blocking ptp and collectives

uses MPI_Requests and MPI_Test/MPI_Wait

Interface

MPI_Ibcast(buf, count, MPI_INT, 0, comm, &req);

MPI_Wait(&req);

Proposal

Hoefler et. al. (2006): ”Non-Blocking Collective Operations for

MPI-2”


Non-Blocking Collectives - Implementation

implementation available with LibNBC

written in ANSI-C and uses only MPI-1

central element: collective schedule

a coll-algorithm can be represented as a schedule

trivial addition of new algorithms

Example: dissemination barrier, 4 nodes, node 0:

send to 1 recv from 3 end send to 2 recv from 2 end

LibNBC download: http://www.unixer.de/NBC


Overhead Benchmarks - Gather with

InfiniBand/MVAPICH on 64 nodes

0

5000

10000

15000

20000

25000

30000

0 50000 100000 150000 200000 250000 300000

Ru

ntim

e (

s)

Datasize (bytes)

MPI_GatherNBC_Igather


Overhead Benchmarks - Scatter with


0

5000

10000

15000

20000

25000

30000

0 50000 100000 150000 200000 250000 300000

Ru

ntim

e (

s)

Datasize (bytes)

MPI_ScatterNBC_Iscatter


Overhead Benchmarks - Alltoall with


0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

0 50000 100000 150000 200000 250000 300000

Ru

ntim

e (

s)

Datasize (bytes)

MPI_AlltoallNBC_Ialltoall


Overhead Benchmarks - Allreduce with


0

10000

20000

30000

40000

50000

60000

0 50000 100000 150000 200000 250000 300000

Ru

ntim

e (

s)

Datasize (bytes)

MPI_AllreduceNBC_Iallreduce


Outline



3 LibNBC

4 And Applications?

5 Ongoing Efforts


Linear Solvers - Domain Decomposition

First Example

Naturally Independent Computation - 3D Poisson Solver

iterative linear solvers are used in many scientific kernels

often used operation is vector-matrix-multiply

matrix is domain-decomposed (e.g., 3D)

only outer (border) elements need to be communicated

can be overlapped


Domain Decomposition

nearest neighbor communication

can be implemented with MPI_Alltoallv or sparse

collectives

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Process−local data

Halo−data

2D Domain

P1 P2 P3

P4 P5 P6 P7

P10P9P8 P11

P0


Parallel Speedup (Best Case)

0

20

40

60

80

100

8 16 24 32 40 48 56 64 72 80 88 96

Speedup

Number of CPUs

Eth blockingEth non-blocking

0

20

40

60

80

100

8 16 24 32 40 48 56 64 72 80 88 96S

peedup

Number of CPUs

IB blockingIB non-blocking

Cluster: 128 2 GHz Opteron 246 nodes

Interconnect: Gigabit Ethernet, InfiniBandTM

System size 800x800x800 (1 node ≈ 5300s)


Parallel Data Compression

Second Example

Data Parallel Loops - Parallel Compression

Automatic transformations (C++ templates)

typical loop structure:

for (i=0; i < N/P; i++) {

compute(i);

}

comm(N/P);


Parallel Compression Communication Overhead

0

0.1

0.2

0.3

0.4

0.5

Co

mm

un

ica

tio

n O

ve

rhe

ad

(s)

64 32 16 8

MPI/BLMPI/NBCOF/NBC


Parallel 3d Fast Fourier Transform

Third Example

Specialized Algorithms - A parallel 3d-FFT with overlap

Specialized design to achieve the highest overlap. Less

cache-friendly!


Non-blocking Collectives - 3D-FFT

Derivation from “normal” implementation

distribution identical to “normal” 3D-FFT

first FFT in z direction and index-swap identical

Design Goals to Minimize Communication Overhead

start communication as early as possible

achieve maximum overlap time

Solution

start MPI_Ialltoall as soon as first xz-plane is ready

calculate next xz-plane

start next communication accordingly ...

collect multiple xz-planes (tile factor)


Transformation in z Direction

Data already transformed in y direction

y x

z

1 block = 1 double value (3x3x3 grid)



Transform first xz plane in z direction

y x

z

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

pattern means that data was transformed in y and z direction


Transformation z Direction

start MPI_Ialltoall of first xz plane and transform second plane

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

y x

z

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

cyan color means that data is communicated in the background



start MPI_Ialltoall of second xz plane and transform third plane

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

y x

z

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

data of two planes is not accessible due to communication


Transformation in x Direction

start communication of the third plane and ...

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

y x

z��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

we need the first xz plane to go on ...



... so MPI_Wait for the first MPI_Ialltoall!

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

y x

z��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

and transform first plane (new pattern means xyz transformed)



Wait and transform second xz plane

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

y x

z��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

first plane’s data could be accessed for next operation



wait and transform last xz plane

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

y x

z��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

done! → 1 complete 1D-FFT overlaps a communication


10243 3d-FFT over InfiniBand

0

1

2

3

4

5

6

7

FF

T T

ime

(s)

1 ppn 2 ppn

MPI/BLMPI/NB

NBC/NB

P=128, “Coyote”@LANL - 128/64 dual socket 2.6GHz Opteron nodes


10243 3d-FFT on the XT4

0

2

4

6

8

10

12

14

16

18

20

FF

T T

ime

(s)

32 procs 64 procs 128 procs

MPI/BLNBC/NB

“Jaguar”@ORNL - Cray XT4, dual socket dual core 2.6GHz Opteron


10243 3d-FFT on the XT4 (Communication Overhead)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Co

mm

un

ica

tio

n O

ve

rhe

ad

(s)

32 procs 64 procs 128 procs

MPI/BLNBC/NB

“Jaguar”@ORNL - Cray XT4, dual socket dual core 2.6GHz Opteron


6403 3d-FFT InfiniBand (Communication Overhead)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

FF

T C

om

mu

nic

atio

n T

ime

(s)

64 32 16 8 4 2

MPI/BLMPI/NBCOF/NBC

“Odin”@IU - dual socket dual core 2.0GHz Opteron InfiniBand


Outline



3 LibNBC

4 And Applications?

5 Ongoing Efforts


Ongoing Work

LibNBC

analysis of multi-threaded implementation

optimized collectives

Collective Communication

optimized collectives for InfiniBandTM (topology-aware)

using special hardware support

Applications

work on more applications

⇒ interested in collaborations (ask me!)


Discussion

THE ENDQuestions?

Thank you for your attention!

non-blocking collectives for mpi -...

Documents