abdelhalim amer *, huiwei lu *, pavan balaji *, satoshi matsuoka + *argonne national laboratory, il,...

19
Abdelhalim Amer * , Huiwei Lu * , Pavan Balaji * , Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing MPI and Hybrid MPI+Threads Applications at Scale: Case Study with BFS 1 PPMM’15, in conjunction with CCGRID’15, May 4-7, 2015, Shenzhen, Guangdong, China

Upload: oswin-anderson

Post on 02-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

Abdelhalim Amer*, Huiwei Lu*, Pavan Balaji*, Satoshi Matsuoka+

*Argonne National Laboratory, IL, USA+Tokyo Institute of Technology, Tokyo, Japan

Characterizing MPI and Hybrid MPI+Threads Applications at Scale:

Case Study with BFS

1

PPMM’15, in conjunction with CCGRID’15, May 4-7, 2015, Shenzhen, Guangdong, China

Page 2: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

Systems with massive core counts already in production– Tianhe-2: 3,120,000 cores– Mira: 3,145,728 HW

threads Core density is increasing Other resources do not scale at

the same rate– Memory per core is reducing– Network endpoints

[1] Peter Kogge. Pim & memory: The need for a revolution in architecture. The Argonne Training Program on Extreme-Scale Computing (ATPESC), 2013.

Evolution of the memory capacity per core in the Top500 list [1]

2

Evolution of High-End Systems

Page 3: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

Problem Domain Target Architecture

Core0 Core1 Core0 Core1

Core2 Core3 Core2 Core3

Node 0 Node 1

Core0 Core1 Core0 Core1

Core2 Core3 Core2 Core3

Node 2 Node3

3

Parallelism with Message Passing

ß

Page 4: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

MPI-only = Core Granularity Domain Decomposition

Domain Decomposition with MPI vs. MPI+X

MPI+X = Node Granularity Domain Decomposition

Process Communication

Process Threads

Page 5: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

MPI-only = Core Granularity Domain Decomposition

ProcessCommunication

(single copy)

Boundary Data (extra memory)

MPI vs. MPI+X

MPI+X = Node Granularity Domain Decomposition

Process Threads

Shared Data

Page 6: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

• The process model has inherent limitations• Sharing is becoming a requirement• Using threads needs careful thread-safety implementations

6

Process Model vs. Threading Model with MPIProcesses Threads

Data all private Global data all shared

Sharing requires extra work (e.g. MPI-3 shared memory)

Sharing is given, consistency is not and implies protection

Communication fine-grained (core-to-core) Communication coarse-grained (typically node-to-node)

Space overhead is high (buffers, boundary data, MPI runtime, etc)

Space overhead is reduced

Contention only for system resources Contention for system resources and shared data

No thread-safety overheads Magnitude of thread-safety overheads depend on the application and MPI runtime

Page 7: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

MPI_THREAD_SINGLE– No additional threads

MPI_THREAD_FUNNELED– Master thread communication only

MPI_THREAD_SERIALIZED– Threaded communication serialized

MPI_THREAD_MULTIPLE– No restrictions

• Restriction

• Low Thread-Safety Costs

• Flexibility

• High Thread-Safety Costs

7

MPI + Threads Interoperation by the Standard An MPI process is allowed to spawn multiple threads Threads share the same rank A thread blocking for communication must not block other

threads Applications can specify the way threads interoperate with MPI

Page 8: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

Search in graph Neighbors first Solves many problems in graph theory

Graph500 benchmark BFS kernel Kronecker graph as input Communication

Two-sided nonblocking

This small, synthetic graph was generated by a method called Kronecker multiplication. Larger versions of this generator, modeling real-world graphs, are used in the Graph500 benchmark. (Courtesy of Jeremiah Willcock, Indiana University) [Sandia National Laboratory]

0

1 2 3

4 5

6

8

Breadth First Search and Graph500

Page 9: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

0

1 2 3

4 5

6

9

Breadth First Search Baseline Implementation

While(1){

Process_Current_Level();

Synchronize();

MPI_Allreduce(QLength); if(QueueLenth == 0) break;}

Sync()

Sync()

Page 10: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

MPI Only Hybrid MPI + OpenMP

• MPI_THREAD_MULTIPLE • Shared read queue• Private temp write queues • Private buffers• Lock-Free/Atomic-Free

While(1){

Process_Current_Level(); Synchronize();

MPI_Allreduce(QLength); if(QueueLenth == 0) break;}

While(1){ #pragma omp parallel { Process_Current_Level(); Synchronize(); }

MPI_Allreduce(QLength); if(QueueLenth == 0) break;}

10

MPI only to Hybrid BFS

Page 11: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

1024 2048 4096 8192 163840

5

10

15

20

25

30

35

Processes ThreadsProcesses_est Threads_est

Number of Cores

To

tal C

om

mu

nic

ati

on

(G

B)

1024 2048 4096 8192 163841

10

100

1000

Processes Threads

Number of Cores

Nu

mb

er

Of

Me

ss

ag

es

(M

illio

ns

)

Problem size = 226 vertices (SCALE = 26)

11

Communication Characterization

Communication Volume (GB) Message Count

Page 12: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

Architecture Blue Gene/Q

Processor PowerPC A2

Clock frequency 1.6 GHz

Cores per node 16

HW threads/Core 4

Number of nodes 49152

Memory/node 1GB

Interconnect Proprietary

Topology 5D Torus

Compiler GCC 4.4.7

MPI library MPICH 3.1.1

Network driver BG/Q V1R2M1

12

Target Platform

• Memory/HW thread = 256 MB!

• We use in the following 1 rank/thread per core

• MPICH: global critical section

Page 13: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

128

1024

8191

.999

9999

9998

6553

5.99

9999

9999

5242

87.9

9999

9999

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Processes

Hybrid

Number of Cores

Pe

rfo

rma

nc

e (

GT

EP

S)

13

Baseline Weak Scaling Performance

Page 14: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

14

Main Sources of Overhead

512 1024 2048 4096 8192 163840%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Computation User Polling MPI_Test MPI_Others

Number of Cores

BF

S T

ime

512 1024 2048 4096 8192 163840%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Compute OMP_Sync User Polling

MPI_Test MPI_Others

Number of Cores

BF

S T

ime

MPI-only MPI+Threads

Page 15: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

Make_Progress(){ MPI_Test(recvreq,flag) if(flag) compute();

for(each process P){ MPI_Test(sendreq[P],flag) if(flag) buffer_free[P] = 1; }}

Eager polling for communication progress

O(P)

Synchronize(){ for(each process P) MPI_Isend(buf,0,P, sendreq[P]);

while(!all_procs_done) Check_Incom_Msgs();}

Global synchronization (2.75G messages for 512K cores)

15

Non-Scalable Sub-Routines

O(P2) Empty Messages

Page 16: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

Use a lazy polling (LP) policy Use the MPI 3 nonblocking barrier (IB)

Weak Scaling Results 16

Fixing the Scalability Issues

128

1024

8191

.999

9999

9998

6553

5.99

9999

9999

5242

87.9

9999

9999

0

2

4

6

8

10

12

MPI-Only

Hybrid

MPI-Only-Optmized

Hybrid-Optmized

Number of Cores

Per

form

ance

(G

TE

PS

)

Page 17: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

1 10 1000.1

1

10

100

1000Global-CS Per-Object-CS

Number of Threads per Node

Av

g M

PI_

Te

st

Tim

e [

10

00

cy

c]

MPI_Test Latency 17

Thread Contention in the MPI Runtime

Default: global critical section to avoid extra overheads in uncontended cases Fine-grained critical section can be used for highly contented scenarios

Page 18: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

Profiling with 1K NodesWeak Scaling Performance

1 2 4 8 16 32 640%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Compute OMP_Sync User PollingMPI_Test MPI_Others

Number of Threads per NodeB

FS

Tim

e

128

1024

8191

.999

9999

9998

6553

5.99

9999

9999

5242

87.9

9999

9999

0

2

4

6

8

10

12

14

16

18Processes+LP+IB

Hybrid+LP+IB

Hybrid+LP+IB+FG

Number of Cores

Pe

rfo

rma

nc

e (

GT

EP

S)

18

Performance with Fine-Grained Concurrency

Page 19: Abdelhalim Amer *, Huiwei Lu *, Pavan Balaji *, Satoshi Matsuoka + *Argonne National Laboratory, IL, USA +Tokyo Institute of Technology, Tokyo, Japan Characterizing

The coarse-grained MPI+X communication model is generally more scalable

In BFS, MPI+X reduced for example the– O(P) polling overhead– O(P2) empty messages for global sync

The model does not fix root scalability issues Thread-safety overheads can be significant source It is not a fatality:

– Various techniques can be used thread contention and safety overheads

– We are actively working on improving multhreading support in MPICH (MPICH derivatives can benefit from it)

Characterizing MPI+shared-memory vs. MPI+threads models is being considered for a future study

19

Summary