parallel io library benchmarking on gpfs - · pdf file– io file system hints ... •...

17
NERSC is supported by the Office of Advanced Scientific Computing Research in the Department of Energy Office of Science under contract number DE-AC02-05CH11231. Parallel IO Library Benchmarking on GPFS NERSC Lawrence Berkeley National Lab Katie Antypas July 18, 2007

Upload: doantu

Post on 22-Mar-2018

267 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing

NERSC is supported by the Office of Advanced Scientific Computing

Research in the Department of Energy Office of Science under

contract number DE-AC02-05CH11231.

Parallel IO Library Benchmarking onGPFS

NERSC

Lawrence Berkeley National Lab

Katie Antypas

July 18, 2007

Page 2: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing

Acknowledgements

Many thanks to Hongzhang Shan, John Shalf and Jay Srinivasan for input and feedback

Page 3: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing

Motivation

• With simulations creating larger and larger datafiles we are approaching the point where a naive one file per processor IO approach is no longer feasible– Difficult for post processing

– Not portable

– Many small files, bad for storage systems

• Parallel IO approaches to a single file offer an alternative, but do have an overhead

Page 4: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing

Objective

• Explore overhead from Parallel IO libraries HDF5 and Parallel NetCDF compared to one file per processor IO

• Examine effects of GPFS file hints on application performance.

Page 5: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing

• Benchmark Application Methodology

• Bassi Benchmarking– IO File System Hints

– IO Library Comparison• HDF5

• Parallel NetCDF

• Fortran one-file-per-processor

– Overhead Compared to direct IO

• Jacquard Benchmarking– IO Library Comparison

– Overhead compared to direct IO

Outline

Page 6: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing

Bassi and Jacquard Details

712

(356 2

proc

nodes)

888

(111 8

proc

nodes)

Total

Procs

~3.1GB/sec

5 DDN

couplets *

620 MB/sec

InfinibandGPFSOpteronLinux

Networks

Jacquard

~ 8GB/sec

6 VSD *

~1-2GB/sec

FederationGPFSPower 5IBMBassi

Peak IO

Bandwidth

InterconnectFile

system

Proc

Arch

VendorMachine

Page 7: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing

• IO from FLASH3 code– Astrophysics code designed primarily for

compressible reactive flow

– Parallel, scales well to thousands of processors

– Typical IO pattern of many large physics codes

– Writes large contiguous chunks of grid data

• Multiple IO output formats – Parallel IO libraries built on top of MPI-IO

• HDF5 to a single file

• Parallel-NetCDF to a single file

– One file per processor Fortran unformatted write

• Similar data format to IOR benchmark

FLASH3 IO Benchmark

Helium burning on neutron stars

Flame-vortex

interactions

Magnetic

Rayleigh-Taylor

Orzag/Tang MHD

vortex

Rayleigh-Taylor instability

Page 8: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing

• Only writing out 4 dimensional datasets x,y,z,proc

• Layout of data is contiguous

• 2 experiments weak scaling IO

– Each processors writes 64 MB of data

• Fits in block buffer cache

• 96x * 96y * 96z * 8bytes * 9vars = 64MB

– Each processor writes 576 MB of data

• Above 256 MB/proc no longer see caching effect

• 200x * 200y * 200z * 8 bytes * 9 vars = 576 MB

Modified FLASH3 IO Benchmark

Proc 0 Proc 1 Proc 2 Proc n

x

y

z

Page 9: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing

MPI-IO/GPFS File Hints

• MPI-IO allows the user to pass file hints to optimize performance

• IBM has taken this a step farther by implementingadditional file hints in their implementation of MPI-IO to take advantage of GPFS features

• File hint IBM_largeblock_io = true– Disables data shipping

– Data shipping used to prevent multiple MPI tasks from accessing conflicting GPFS file blocks

– Each GPFS file block bound to single task

– Aggregates small write calls

– Data shipping disabled saves overhead on MPI-IO data shipping but only recommended when writing/reading large contiguous IO chunks.

MPI_Info_set(FILE_INFO_TEMPLATE, "IBM_largeblock_io", "true");

Page 10: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing

0

400

800

1200

1600

1 2 4 8 16 32 64 128 256 384

Processors

HDF5 No Hint

HDF5

1 2 4 8 16 32 64 128 256 384

Processors

Parallel netCDF No Hint

Parallel netCDF

On average, for runs on more than 8 processors HDF5

received a 135% performance increase compared with a

45% improvement for Parallel NetCDF

Effects of IBM_largeblock_io = true on Bassi

Weak Scaling IO Test (64 MB/Proc)

HDF5 Parallel NetCDF

Single Bassi

NodeSingle Bassi

Node

12

Seconds4 Seconds

64 MB 24.5 GB

Page 11: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing

The effects of the file hint are more significant for both HDF5

and Parallel NetCDF at larger file sizes. And while the rates

for HDF5 and Parallel NetCDF are similar without the file

hint, the effects of data shipping turned off is again much

larger with HDF5

Effects of IBM_largeblock_io = true

Weak Scaling IO Test (576 MB/Proc)HDF5 Parallel NetCDF

0

500

1000

1500

2000

2500

3000

3500

1 2 4 8 16 32 64 128 256 384

Processors

HDF5 No Hint

HDF5

1 2 4 8 16 32 64 128 256 384

Processors

Parallel netCDF No Hint

Parallel netCDF

4Gig variable

size limitation

prevented

runs at higher

concurrencies

Single Bassi

Node

Single Bassi

Node

576 MB 221 GB

Page 12: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing

• Buffering/cache effect in place

• Parallel IO libraries have overhead opening files, creating

data objects and for synchronization

• One file per processor outperforms parallel IO libraries,

but user must consider post processing and file

management cost

Assumptions/Setup

• Each processor writes

64MB data

• Contiguous chunks of

data

• HDF5, Pnetcdf use

IBM_largeblock_io =

true to turn off data

shipping

BassiIO Library Comparison Weak Scaling

IO (64MB/proc)

0

1000

2000

3000

4000

5000

6000

7000

8000

1 2 4 8 16 32 64 128 256 384

Processors

HDF5

Parallel netCDF

Fortran One File Per Proc

10

Seconds

16GB/sec

64 MB 24.5 GB

Page 13: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing

Assumptions/Setup

• Each processor writes

576MB data

• Contiguous chunks of

data

• HDF5, Pnetcdf use

IBM_largeblock_io = true

to turn off data shipping

Bassi

One file per processor IO begins to diverge from parallel IO

strategies around 16 and 32 processors, however the absolute

time difference between the two is still relatively low.

IO Library Comparison Weak Scaling

(576 MB/proc)

0

2000

4000

6000

8000

1 2 4 8 16 32 64 128 256 384

Processors

HDF5

Parallel netCDF

Fortran One File Per Proc

C One File Per Proc

4Gig variable size

limitation for

PnetCDF

10

Seconds

576 MB 221 GB

Page 14: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing

Jacquard

Assumptions/Setup

• Each processor writes

64MB data

• Contiguous chunks of

data

• No File Hints to

mvapich

implementation of MPI-

IO

• MPI-IO file hints for GPFS optimization are not

implemented for mvapich MPI-IO

• Pure MPI-IO approach produces results similar to

HDF5 and Parallel-NetCDF indicating performance

bottleneck at MPI-IO implementation rather than higher

level libraries

IO Library Comparison Weak Scaling

IO (64 MB/proc)

0

200

400

600

800

1000

1 2 4 8 16 32 64 128

procs

HDF5

Parallel netCDF

Fortran One File Per Proc

40

Seconds

64 MB 8.1 GB

Page 15: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing

IO Rates level off for HDF5 and Fortran one file per

processor IO at 16 and 32 processors

Assumptions/Setup

• Each processor

writes 576MB data

• Contiguous chunks

of data• No File Hints to

mvapich

implementation of

MPI-IO

IO Library Comparison Weak Scaling

IO (576 MB/proc)

0

400

800

1200

1600

2000

1 2 4 8 16 32 64 128

Processors

HDF5

Parallel netCDF

Fortran One File Per Proc~60

Seconds

576 MB 73.7 GB

Page 16: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing

Conclusions

• MPI-IO file hint IBM_largeblock_io gives significant performance boost for HDF5, less for Parallel NetCDF -exploring why.

• IBM_largeblock_io file hint must be used with IBM MPI-IO implementationa and GPFS (ie doesn’t work for Jacquard)

• Yes, there is an overhead for usingparallel IO libraries, but probably not as bad as users expect

Page 17: Parallel IO Library Benchmarking on GPFS - · PDF file– IO File System Hints ... • One file per processor outperforms parallel IO libraries, but user must consider post processing