non-uniformly communicating non-contiguous data: a case study with petsc and mpi p. balaji, d....

30
Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics and Computer Science Argonne National Laboratory

Upload: merry-evangeline-bond

Post on 29-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Non-uniformly CommunicatingNon-contiguous Data:

A Case Study with PETSc and MPI

P. Balaji, D. Buntinas, S. Balay, B. Smith,

R. Thakur and W. Gropp

Mathematics and Computer Science

Argonne National Laboratory

Page 2: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Numerical Libraries in HEC

• Developing parallel applications is a complex task

– Discretizing physical equations to numerical forms

– Representing the domain of interest as data points

• Libraries allow developers to abstract low-level

details

– E.g., Numerical Analysis, Communication, I/O

• Numerical libraries (e.g., PETSc, ScaLAPACK,

PESSL)

– Parallel data layout and processing

– Tools for distributed data layout (matrix, vector)

– Tools for data processing (SLES, SNES)

Page 3: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Overview of PETSc• Portable, Extensible Toolkit for

Scientific Computing

• Software tools for solving PDEs

– Suite of routines to create vectors, matrices and distributed arrays

– Sequential/parallel data layout

– Linear and nonlinear numerical solvers

• Widely used in Nanosimulations, Molecular dynamics, etc.

• Uses MPI for communication

BLAS LAPACK MPI

Matrices Vectors Index Sets

KSP(Krylov subspace Methods)

PC(Preconditioners)

Draw

SNES(Nonlinear Equation Solvers) SLES

(Linear Equation Solvers)

TS(Time Stepping)

PDE Solvers

Application CodesLevel of

Abstraction

Page 4: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Handling Parallel Data Layouts in PETSc• Grid layout exposed to the application

– Structured or Unstructured (1D, 2D, 3D)

– Internally managed as a single vector of data

elements

– Representation often suited to optimize its operations

• Impact on communication:

– Data representation and communication pattern

might not be ideal for MPI communication operations

– Non-uniformity and Non-contiguity in communication

are the primary culprits

Page 5: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Presentation Layout

• Introduction

• Impact of PETSc Data Layout and Processing on

MPI

• MPI Enhancements and Optimizations

• Experimental Evaluation

• Concluding Remarks and Future Work

Page 6: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Local Data Point

Data Layout and Processing in PETSc

• Grid layouts: data is divided among processes– Ghost data points shared

• Non-contiguous Data Communication– 2nd dimension of the grid

• Non-uniform communication– Structure of the grid

– Stencil type used

– Sides larger than corners

Process Boundary

Ghost Data Point

Proc 1Proc 0

Box-type stencil

Proc 1Proc 0

Star-type stencil

Page 7: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

• MPI Derived Datatypes– Application describes noncontiguous data layout to

MPI

– Data is either packed to contiguous buffers and pipelined (sparse layouts) or sent individually (dense layouts)

• Good for simple algorithms, but very restrictive– Lookup upcoming content to predecide algorithm to

use

– Multiple parses on the datatype loses context!

Non-contiguous Communication in MPI

Non-contiguous Data layout

Save Context Send DataSave Context

Packing Buffer

Page 8: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Issues with Lost Datatype Context

• Rollback of context not possible– Datatypes could be recursive

• Duplication of context not possible– Context information might be large

– When datatype elements are small, context could be larger than the datatype itself

• Search of context possible, but very expensive– Quadratically increasing search time with increasing

datatype size

– Currently used mechanism!

Page 9: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Non-uniform Collective Communication

• Non-uniform communication

algorithms are optimized for

“uniform” communication

• Case Studies

– Allgatherv uses a ring

algorithm

• Causes idleness if data

volumes are very different

– Alltoallw sends data to nodes

in round-robin manner

• MPI processing is sequential

Large Message

Small Message

0

1

2

3

4

5

6

Page 10: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Presentation Layout

• Introduction

• Impact of PETSc Data Layout and Processing on MPI

• MPI Enhancements and Optimizations

• Experimental Evaluation

• Concluding Remarks and Future Work

Page 11: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Dual-context Approach forNon-contiguous Communication• Previous approaches are in-efficient in complex

designs– E.g., if a look-ahead is performed to understand the

structure of the upcoming data, the saved context is lost

• Dual-context approach retains the data context– Look-aheads are performed using a separate context

– Completely eliminates the search timeNon-contiguous Data layout

Save Context

Send Data

Save ContextLook-ahead

Packing Buffer

Page 12: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Non-Uniform Communication: AllGatherv

• Single point of distribution is the primary bottleneck

• Identify if a small fraction of messages are very large– Floyd and Rivest Algorithm

– Linear time detection of outliers

• Binomial Algorithms– Recursive doubling or

Dissemination

– Logarithmic time

Large Message

Small Message

Page 13: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Non-uniform Communication: Alltoallw• Distributing messages to be sent out as bins

(based on message size) allows differential treatment to nodes

• Send out small messages first– Nodes waiting for small messages have to wait lesser

– Ratio of increase in time for nodes waiting for larger messages is much smaller

– No skew for zero-byte data with lesser synchronization

• Most helpful for non-contiguous messages– MPI processing (e.g., packing) is sequential for non-

contiguous messages

Page 14: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Presentation Layout

• Introduction

• Impact of PETSc Data Layout and Processing on MPI

• MPI Enhancements and Optimizations

• Experimental Evaluation

• Concluding Remarks and Future Work

Page 15: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Experimental Testbed• 64-node Cluster

– 32 nodes with dual Intel EM64T 3.6GHz processors• 2MB L2 Cache, 2GB DDR2 400MHz SDRAM

• Intel E7520 (Lindenhurst) Chipset

– 32 nodes with dual Opteron 2.8GHz processors• 1MB L2 Cache, 4GB DDR 400MHz SDRAM

• NVidia 2200/2050 Chipset

• RedHat AS4 with kernel.org kernel 2.6.16

• InfiniBand DDR (16Gbps) Network:– MT25208 adapters connected through a 144-port

switch

• MVAPICH2-0.9.6 MPI implementation

Page 16: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Non-uniform Communication Evaluation

Non-contiguous Communication

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

64 128 256 512 1024Grid Size

Late

ncy

(us)

MVAPI CH2-0.9.6

MVAPI CH2-New

Timing Breakup (1024 Grid Size)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

MVAPI CH2-0.9.6 MVAPI CH2-NewPe

rcen

tage

Tim

e

Search Pack Communicate

Search time can dominate performance if the working context is lost!

Page 17: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

AllGatherv EvaluationAllGatherv Latency vs. Message Size

0

200

400

600

800

1000

1200

1400

1600

18001 8

64

512 4K

32K

Message Size (bytes)

Late

ncy

(us)

MVAPI CH2-0.9.6

MVAPI CH2-New

AllGatherv Latency vs. System Size

0

200

400

600

800

1000

1200

1400

1600

1800

2 4 8 16 32 64

Number of Processes

Late

ncy

(us)

MVAPI CH2-0.9.6

MVAPI CH2-New

Page 18: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Alltoallw Evaluation

0

200

400

600

800

1000

1200

1400

1600

1800

2 4 8 16 32 64 128

Number of Processes

Late

ncy

(us)

MVAPI CH2-0.9.6

MVAPI CH2-New

Our algorithm reduces the skew introduced due to the Alltoallw operations by sending out smaller messages first and allowing the corresponding applications to progress

Page 19: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

PETSc Vector ScatterPETSc Vecscatter Performance

0

5000000

10000000

15000000

20000000

25000000

30000000

35000000

2 4 8 16 32 64 128

Number of Processes

Late

ncy

(us)

MVAPI CH2-0.9.5

MVAPI CH2-New

Hand-tuned

Relative I mprovement

-20

0

20

40

60

80

100

2 4 8 16 32 64 128

Number of Processes

Perc

enta

ge I

mpr

ovem

ent

MVAPI CH2-0.9.5

Hand-tuned

Page 20: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

3-D Laplacian Multigrid SolverApplication Perf ormance

0

10

20

30

40

50

60

70

80

90

4 8 16 32 64 128

Number of Processors

Exec

utio

n T

ime

(s)

MVAPI CH2-0.9.6

MVAPI CH2-New

Hand-Tuned

Perf ormance I mprovement

-20

0

20

40

60

80

100

4 8 16 32 64 128

Number of Processors

Perc

enta

ge I

mpr

ovem

ent

(%)

MVAPI CH2-0.9.6

Hand-Tuned

Page 21: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Presentation Layout

• Introduction

• Impact of PETSc Data Layout and Processing on MPI

• MPI Enhancements and Optimizations

• Experimental Evaluation

• Concluding Remarks and Future Work

Page 22: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Concluding Remarks and Future Work• Non-uniform and Non-contiguous communication is

inherent in several libraries and applications• Current algorithms deal with non-uniform

communication in a same way as uniform communication

• Demonstrated that more sophisticated algorithms can give close to 10x improvements in performance

• Designs are a part of MPICH2-1.0.5 and 1.0.6– To be picked up by MPICH2 derivatives in later

releases

• Future Work:– Skew tolerance in non-uniform communication– Other libraries and applications

Page 23: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Thank You

Group Web-page: http://www.mcs.anl.gov/radix

Home-page: http://www.mcs.anl.gov/~balaji

Email: [email protected]

Page 24: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Backup Slides

Page 25: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Noncontiguous Communication in PETSc

0 8 16 192 384

Copy Buffer

vector (count = 8, stride = 8)

contiguous (count = 3)

double | double | double double | double | double double | double | double

contiguous (count = 3) contiguous (count = 3)

• Data might not always be

contiguously laid out in

memory

– E.g., Second dimension of a

structured grid

• Communication is performed

by packing data

• Pipelining copy and

communication is important

for performance

Page 26: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Hand-tuning vs. Automated optimization• Nonuniformity and noncontiguity in data

communication is inherent in several applications– Communicating unequal amounts of data to the

different peer processes

– Communication data from noncontiguous memory locations

• Previous research has primarily focused on uniform and contiguous data communication

• Accordingly applications and libraries tried hand-tuning attempts to convert communication formats– Manually packing noncontiguous data

– Re-implementing collective operations in the application

Page 27: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Non-contiguous Communication in MPI• MPI Derived Datatypes

– Common approach for non-contiguous communication

– Application describes noncontiguous data layout to MPI

– Data is either packed into contiguous memory (sparse layouts) or sent as independent segments (dense layouts)

• Pipelining of packing and communication improves performance, but requires context information!

Non-contiguous Data layout

Save Context

Send Data

Save Context

Packing Buffer

Page 28: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Issues with Non-contiguous Communication• Current approach is simple and works as long as

there is a single parse on the noncontiguous data

• More intelligent algorithms might suffer:– E.g., lookup upcoming datatype content to predecide

algorithm to use

– Multiple parses on the datatype lose the context !

– Searching for the lost context every time requires quadratically increasing time with datatype size

• PETSc non-contiguous communication suffers with such high search times

Page 29: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

MPI-level EvaluationNoncontiguous Communication Perf ormance

0

200000

400000

600000

800000

1000000

64 128 256 512 1024Grid Size

Tim

e (u

s)

MVAPI CH2-0.9.6

MVAPI CH2-New

Allgatherv Perf ormance

0

500

1000

1500

2000

1 4 16 64 256

1024

4096

1638

4

Message Size (bytes)

Tim

e (u

s)

MVAPI CH2-0.9.6

MVAPI CH2-New

Allgatherv Perf ormance

0

500

1000

1500

2000

2 4 8 16 32 64Number of Processors

Tim

e (u

s)

MVAPI CH2-0.9.6

MVAPI CH2-New

Alltoallw Perf ormance

0

500

1000

1500

2000

2 4 8 16 32 64 128Number of Processors

Tim

e (u

s)

MVAPI CH2-0.9.6

MVAPI CH2-New

Page 30: Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics

Experimental Results

• MPI-level Micro-benchmarks

– Non-contiguous data communication time

– Non-uniform collective communication

• Allgatherv Operation

• Alltoallw Operation

• PETSc Vector Scatter Benchmark

– Performs communication only

• 3-D Laplacian Multigrid Solver Application

– Partial differential equation solver

– Utilizes PETSc numerical solver operations