high performance computing: concepts, methods & means parallel i/o : file systems and libraries

High Performance Computing: Concepts, Methods & Means

Parallel I/O : File Systems and Libraries

Prof. Thomas SterlingDepartment of Computer Science

Louisiana State University

March 29th, 2007

2

Topics

• Introduction

• RAID

• Distributed File Systems (NFS)

• Parallel File Systems (PVFS2)

• Parallel I/O Libraries (MPI-IO)

• Parallel File Formats (HDF5)

• Additional Parallel File Systems (GPFS)

• Summary – Materials for Test

3

Topics

• Introduction

• RAID







• Storage capacity: 1 TB per drive• Areal density: 132 Gbit/in2 (perpendicular recording)• Rotational speed: 15,000 RPM• Average latency: 2 ms• Seek time

– Track-to-track: 0.2 ms– Average: 3.5 ms– Full stroke: 6.7 ms

• Sustained transfer rate: up to 125 MB/s• Non-recoverable error rate: 1 in 1017

• Interface bandwidth:– Fibre channel: 400 MB/s– Serially Attached SCSI (SAS): 300 MB/s– Ultra320 SCSI: 320 MB/s– Serial ATA (SATA): 300 MB/s

Permanent Storage: Hard Disks Review

4

Storage – SATA & Overview- Review

• Serial ATA is the newest commodity hard disk standard.

• SATA uses serial buses as opposed to parallel buses used by ATA and SCSI.

• The cables attached to SATA drives are smaller and run faster (around 150 MB/s).

• The Basic disk technologies remain the same across the three busses

• The platters in disk spin at variety of speeds, faster the platters spin the faster the data can be read off the disk and data on the far end of the platter will become available sooner.

• Rotational speeds range between 5400 RPM to 15000 RPM

• Faster the platters rotate, the lower the latency and higher the bandwidth.

5

PATA vs SATA

I/O Needs on Parallel Computers

• High Performance– Take advantage of parallel I/O paths (when available) – Support application-level data access and throughput needs

• Data Integrity– sanely deal with hardware and power failures

• Single Namespace– All nodes and users “see” the same file systems– Equal access from anywhere on the resource.

• Ease of Use – Where possible, a parallel file system should be accessible

in consistent way, in the same ways as a traditional UNIX-style file systems.

6Ohio Supercomputer Center

7

Topics

• Introduction

• RAID







Parallel I/O - RAID• RAID stands for Redundant Array of Inexpensive

Disks provides a mechanism by which the performance and storage properties of individual disks can be aggregated

• Group of disks appear to be a single large disks; performance of multiple disks is better than single disks.

• Using multiple disks helps store data in multiple places allowing the system to continue functioning.

• Both software and hardware raid solutions available.

• Hardware solutions are more expensive, but provide better performance without CPU overhead.

• Software solutions provide various levels of flexibility but have associated computational overhead.

8

RAID : Key Concepts• Variety of RAID allocation schemes :• RAID 0 (disk striping without redundant

storage) :– Data is striped across multiple disks.

– The result of striping is a logical storage device that has the capacity of each disk times the number of disks present in the raid array.

– Both read and write performances are accelerated.

– Each byte of data can be read from multiple locations, so interleaving reads between disks can help double read performance.

– No Fault tolerance

– High transfer rates

– High request rates

9http://www.drivesolutions.com/datarecovery/raid.shtml

RAID : Key Concepts• RAID 1 (disk mirroring):

– Complete copies of data are stored on multiple locations.

– Capacity of one of these RAID sets will be half of its raw capacity. Read performance is accelerated and is comparable to Raid 0.

– Writes are slowed down, as new data needs to be transmitted multiple times.

• RAID 5:– Like Raid 0 data is striped across multiple disks,

with parity being distributed across the disks.

– For any block of data stored across the drives, their parity checksum is computed and is stored on a predetermined disk.

– Read performance of RAID 5 is reduced as the parity data is distributed across drives, and the write performance lags behind because of checksum computation. 10

http://www.drivesolutions.com/datarecovery/raid.shtml

11

Topics

• Introduction

• RAID







Distributed File Systems

• A distributed file system is a file system that is stored locally on one system (server) but is accessible by processes on many systems (clients).

• Multiple processes access multiple files simultaneously.

• Other attributes of a DFS may include :– Access control lists (ACLs)– Client-side file replication– Server- and client- side caching

• Some examples of DFSes:– NFS (Sun)– AFS (CMU)– DCE/DFS (Transarc / IBM)– CIFS (Microsoft)

• Distributed file systems can be used by parallel programs, but they have significant disadvantages :

– The network bandwidth of the server system is a limiting factor on performance– To retain UNIX-style file consistency, the DFS software must implement some form of

locking which has significant performance implications


Distributed File System : NFS

• Popular means for accessing remote file systems in a local area network.

• Based on the client-server model , the remote file systems are “mounted” via NFS and accessed through the Linux virtual file system (VFS) layer.

• NFS clients cache file data, periodically checking with the original file for any changes.

• The loosely-synchronous model makes for convenient, low-latency access to shared spaces.

• NFS avoids the common locking systems used to implement POSIX semantics.

13

Why NFS is bad for Parallel I/O• Clients can cache data indiscriminately, and tend to

block boundaries. • When nearby regions of a file are written by different

processes on different clients, the result is undefined due to lack of consistency control.

• Secondly all file operations are remote operations. Extensive file locking required to implement sequential consistency

• Communication between client and server typically uses relatively slow communication channels, adding to performance degradation.

• Inefficient specification (eg. a read operation involves two RPC operations (one for look-up of file handle and second for reading of file data) 14

15

Topics

• Introduction

• RAID







Parallel File Systems• Parallel File System is one in which there are multiple servers as

well as clients for a given file system, equivalent of RAID across several file systems.

• Multiple processes can access the same file simultaneously• Parallel File Systems are usually optimized for high performance

rather than general purpose use, common optimization criterion being : – very large block sizes ( => 64kB)

– relatively slow metadata operations (eg. fstat()) compared to reads and writes

– Special APIs for direct access

• Examples of Parallel file systems include : – GPFS (IBM)

– LUSTRE (Cluster File Systems)

– PVFS2 (Clemson/ANL)


Characteristics of Parallel File Systems

• Three Key Characteristics :– Various hardware I/O data storage resources

– Multiple connections between these hardware devices and compute resources.

– High-performance, concurrent access to these I/O resources.

• Multiple physical I/O devices and paths ensures sufficient bandwidth for the high performance desired.

• Parallel I/O systems include both the hardware and number of layers of software

17

Storage HardwareStorage Hardware

Parallel File SystemParallel File System

Parallel I/O (MPI I/O)Parallel I/O (MPI I/O)

High-Level I/O LibraryHigh-Level I/O Library

Parallel File Systems: Hardware Layer

• I/O Hardware is usually comprised of disks, controllers, and interconnects for data movement.

• Hardware determines the maximum raw bandwidth and the minimum latency of the system.

• Bisection bandwidth of the underlying transport determines the aggregate bandwidth of the resulting parallel I/O system.

• At the hardware level, data is accessed at the granularity of blocks, either physical disk blocks or logical blocks spread across multiple physical devices such as in a RAID array.

• Parallel File Systems :– manage data on the storage hardware,– present this data as a directory hierarchy, – coordinate access to files and directories in a consistent

manner

• File systems usually provide a UNIX like interface, allowing users to access contiguous regions of files.

18





Parallel File Systems :Other Layers

• Lower level interfaces may be provided by the file system for higher-performance access.

• Above the parallel file systems are the parallel I/O layers provided in the form of libraries such as MPI I/O.

• The parallel I/O layer provides a low level interface and operations such as collective I/O.

• Scientific applications work with structured data for which a higher level API written on top of MPI-IO such as HDF5 or parallel netCDF are used.

• HDF5 and parallel netCDF allow the scientists to represent the data sets in terms closer to those used in their applications.

19





PVFS2• PVFS2 designed to provide :

– a modular networking and storage subsystems– structured data request format modeled after MPI datatypes– flexible and extensible data distribution models– distributed metadata– tunable consistency semantics, and – support for data redundancy.

• Supports variety of network technologies including Myrinet, Quadrics, and Infiniband.

• Also supports variety of storage devices including locally attached hardware, SANs and iSCSI

• Key abstractions include : – Buffered Message Interface (BMI) : non-blocking network interface– Trove : non-blocking storage interface– Flows : mechanism to specify a flow of data between network and storage

20

PVFS2 Software Architecture • Buffered Messaging Interface (BMI)

– Non blocking interface that can be used with many High performance network fabrics

– Currently TCP/IP and Myrinet (GM) networks exist

• Trove : – Non blocking interface that can be used with

a number of underlying storage mechanisms. – Trove storage objects consist of stream of

bytes and keyword/value pair space.– Keyword/value pairs are convenient for

arbitrary metadata storage and directory entries, while stream of bytes provides ideal storage for the stream of bytes.

21

Network Disk

Client API Request Processing

Job Sched

BMI Flo-wsDist

Job Sched

BMI Flo-wsDist

Tro-ve

Client Server

PVFS2 Software Architecture• Flows :

– Combines network and storage subsystems by providing mechanism to describe flow of data between network and storage.

– Provide a point for optimization to optimize data movement between a particular network and storage pair to exploit fast paths.

• The job scheduling layer provides a common interface to interact with BMI, Flows, and Trove and checks on their completion

• The job scheduler is tightly integrated with a state machine that is used to track operations in progress.

22

Network Disk

Client API Request Processing

Job Sched

BMI Flo-wsDist

Job Sched

BMI Flo-wsDist

Tro-ve

Client Server

The PVFS2 Components• The 4 major components to a PVFS

system are : – Metadata Server (mgr)– I/O Server (iod)– PVFS native API (libpvfs)– PVFS Linux kernel support

• Metadata Server (mgr) : – manages all the file metadata for PVFS

files, using a daemon which atomically operates on the file metadata.

– PVFS avoids the pitfalls of many storage area network approaches, which have to implement complex locking schemes to ensure that metadata stays consistent in the face of multiple accesses.

23

The PVFS2 Components

• I/O daemon: – handles storing and retrieving

file data stored on local disks connected to a node using traditional read(), write, etc for access to these files.

• PVFS native API provides user-space access to the PVFS servers.

• The library handles the operations necessary to move data between user buffers and PVFS servers.

24

metadata access data access

http://csi.unmsm.edu.pe/paralelo/pvfs/desc.html

Parallel File Systems Comparison

25

Comparison of NFS vs. GPFSFile-System Features NFS GPFS Introduced: 1985 1998

Original vendor: Sun IBM

Example at LC: /nfs/tmpn /p/gx1

Primary role: Share files among machines Fast parallel I/O for large files

Easy to scale? No Yes

Network needed: Any TCP/IPnetwork

Only IBM SP"switch"

Access control method: UNIX permission bits (CHMOD)

UNIX permission bits (CHMOD)

Block size: 256 byte 512 Kbyte (White) Stripe width: Depends on RAID 256 Kbyte

Maximum file size: 2 Gbyte (longer with v3) 26 Gbyte

File consistency:

.....uses client buffering? Yes Yes (see diagram)

.....uses server buffering? Yes (see diagram)

.....uses locking? No Yes (token passing)

.....lock granularity? Byte range

.....lock managed by? Requesting compute node

Purged at LC? Home, No;Tmp, Yes

Yes

Supports file quotas? Yes No 26

27

Topics

• Introduction

• RAID







MPI-IO Overview

• Initially developed as a research project at the IBM T. J. Watson Research Center in 1994

• Voted by the MPI Forum to be included in MPI-2 standard (Chapter 9)

• Most widespread open-source implementation is ANL’s ROMIO, written by Rajeev Thakur (http://www-unix.mcs.anl.gov/romio/ )

• Integrates file access with the message passing infrastructure, using similarities between send/receive and file write/read operations

• Allows MPI datatypes to describe meaningfully data layouts in files instead of dealing with unorganized streams of bytes

• Provides potential for performance optimizations through the mechanism of “hints”, collective operations on file data, or relaxation of data access atomicity

• Enables better file portability by offering alternative data representations 28

MPI-IO Features (I)

• Basic file manipulation (open/close, delete, space preallocation, resize, storage synchronization, etc.)

• File views (define what part of a file each process can see and how it is interpreted)

– Processes can view file data independently, with possible overlaps– The users may define patterns to describe data distributions both in file and

in memory, including non-contiguous layouts– Permit skipping over fixed header blocks (“displacements”)– Views can be changed by tasks at any time

• Data access positioning– Explicitly specified offsets (suffix “_at”)– Independent data access by each task via individual file pointers (no suffix)– Coordinated access through shared file pointer (suffix “_shared”)

• Access synchronism– Blocking– Non-blocking (include split-collective operations)

29

MPI-IO Features (II)

• Access coordination– Non-collective (no additional suffix)– Collective (suffix: “_all” for most blocking calls, “_begin” and “_end” for split-

collective, or “_ordered” for equivalent of shared pointer access)

• File interoperability (ensures portability of data representation)– Native: for purely homogeneous environments– Internal: heterogeneous environments with implementation-defined data

representation (subset of “external32”)– External32: heterogeneous environments using data representation defined

by the MPI-IO standard

• Optimization hints (the “info” interface)– Access style (e.g. read_once, write_once, sequential, random, etc.)– Collective buffering components (buffer and block sizes, number of target

nodes)– Striping unit and factor– Chunked I/O specification– Preferred I/O devices

• C, C++ and Fortran bindings30

MPI-IO Types

• Etype (elementary datatype): the unit of data access and positioning; all data accesses are performed in etype units and offsets are measured in etypes

• Filetype: basis for partitioning the file among processes: a template for accessing the file; may be identical to or derived from the etype

31Source: http://www.mhpcc.edu/training/workshop2/mpi_io/MAIN.html

MPI-IO File ViewsA view defines the current set of data visible and accessible from an open

file as an ordered set of etypes• Each process has its own view of the file, defined by: a displacement, an etype,

and a filetype

• Displacement: an absolute byte position relative to the beginning of file; defines where a view begins

32

33

MPI-IO: File Open

Function: MPI_File_open()

int MPI_File_open(MPI_Comm comm, char *filename, int amode,

MPI_Info info, MPI_File *fh);

Description:Opens the file identified by filename on all processes in comm group, using access mode specified in amode. The operation is collective; all participating processes must pass identical values for amode and use the filename referencing the same file. Successful call returns the open file handle in fh, which can be used to subsequently access the file.

It is possible to open file independently from other processes by passing MPI_COMM_SELF in comm argument.

#include <mpi.h>...MPI_File fh;int err;.../* create a writable file with default parameters */err = MPI_File_open(MPI_COMM_WORLD, “/mnt/piofs/testfile”, MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);if (err != MPI_SUCCESS) {/* handle error here */}...

#include <mpi.h>...MPI_File fh;int err;.../* create a writable file with default parameters */err = MPI_File_open(MPI_COMM_WORLD, “/mnt/piofs/testfile”, MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);if (err != MPI_SUCCESS) {/* handle error here */}...

34

MPI-IO: File Close

Function: MPI_File_close()

int MPI_File_open(MPI_File *fh);

Description:Synchronizes file state (equivalent to implicit invocation of MPI_File_sync), and then closes the file associated with handle fh. The user must ensure that all oustanding non-blocking requests and split-collective operations associated with handle fh have completed. If the file was opened with access mode MPI_MODE_DELETE_ON_CLOSE, it is deleted from the file system.

#include <mpi.h>...MPI_File fh;int err;.../* open a file storing the handle in fh *//* perform file access */...err = MPI_File_close(&fh);if (err != MPI_SUCCESS) {/* handle error here */}...

#include <mpi.h>...MPI_File fh;int err;.../* open a file storing the handle in fh *//* perform file access */...err = MPI_File_close(&fh);if (err != MPI_SUCCESS) {/* handle error here */}...

35

MPI-IO: Set File View

Function: MPI_File_set_view()

int MPI_File_set_view(MPI_File fh, MPI_Offset disp, MPI_Datatype etype,

MPI_Datatype filetype, char *datarep, MPI_Info info);

Description:Changes the process’ view of data file, setting the start of the view to disp, the type of file data to etype, the distribution of file data to processes to filetype, and data representation to datarep. Resets the individual and shared file pointers to zero. The call is collective, requiring the values for datarep and etype extents to be identical for all processes. The data representation must be one of: “native”, “internal” or “external32”.

#include <mpi.h>...MPI_File fh;int err;.../* open file storing the handle in fh */.../* view the file as stream of integers with no header, using native data representation */err = MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);if (err != MPI_SUCCESS) {/* handle error */}...

#include <mpi.h>...MPI_File fh;int err;.../* open file storing the handle in fh */.../* view the file as stream of integers with no header, using native data representation */err = MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);if (err != MPI_SUCCESS) {/* handle error */}...

36

MPI-IO: Read File with Explicit Offset

Function: MPI_File_read_at()

int MPI_File_read_at(MPI_File fh, MPI_Offset offs, void *buf, int count,

MPI_Datatype type, MPI_Status *status);

Description:Reads count elements of type type from file represented by fh at offset offs, storing them in buffer pointed to by buf. Offset offs is expressed in etype units relative to the current view associated with the file handle fh. Successful call returns the amount of data transferred in status.

#include <mpi.h>...MPI_File fh;MPI_Status stat;int buf[3], err;.../* open file storing the handle in fh */...MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);/* read the third triad of integers from file */err = MPI_File_read_at(fh, 6, buf, 3, MPI_INT, &stat);...

#include <mpi.h>...MPI_File fh;MPI_Status stat;int buf[3], err;.../* open file storing the handle in fh */...MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);/* read the third triad of integers from file */err = MPI_File_read_at(fh, 6, buf, 3, MPI_INT, &stat);...

37

MPI-IO: Write to File with Explicit Offset

Function: MPI_File_write_at()

int MPI_File_write_at(MPI_File fh, MPI_Offset offs, void *buf, int count,

MPI_Datatype type, MPI_Status *status);

Description:Writes count elements of type type from buffer buf to file represented by fh at offset offs. Offset offs is expressed in etype units relative to the current view associated with the file handle fh. Successful call returns the amount of data transferred in status.

#include <mpi.h>...MPI_File fh;MPI_Status stat;int err;double dt = 0.0005;.../* open file storing the handle in fh */...MPI_File_set_view(fh, 0, MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL);/* store timestep as the first item in file */err = MPI_File_write_at(fh, 0, &dt, 1, MPI_DOUBLE, &stat);...

#include <mpi.h>...MPI_File fh;MPI_Status stat;int err;double dt = 0.0005;.../* open file storing the handle in fh */...MPI_File_set_view(fh, 0, MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL);/* store timestep as the first item in file */err = MPI_File_write_at(fh, 0, &dt, 1, MPI_DOUBLE, &stat);...

38

MPI-IO: Read File Collectively with Individual File Pointers

Function: MPI_File_read_all()

int MPI_File_read_all(MPI_File fh, void *buf, int count, MPI_Datatype type,

MPI_Status *status);

Description:All processes in communicator group associated with the file handle fh read their respective count elements of types type from file at the offsets determined by the current values of file pointers cached on their file handles, storing them in buffers pointed to by buf. Successful call returns the amount of data transferred in status.

#include <mpi.h>...MPI_File fh;MPI_Status stat;int buf[20], err;.../* open file storing the handle in fh */...MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);/* read 20 integers at current file offset in every process */err = MPI_File_read_all(fh, buf, 20, MPI_INT, &stat);...

#include <mpi.h>...MPI_File fh;MPI_Status stat;int buf[20], err;.../* open file storing the handle in fh */...MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, “native”, MPI_INFO_NULL);/* read 20 integers at current file offset in every process */err = MPI_File_read_all(fh, buf, 20, MPI_INT, &stat);...

39

MPI-IO: Write to File Collectively with Individual File Pointers

Function: MPI_File_write_all()

int MPI_File_write_all(MPI_File fh, void *buf, int count, MPI_Datatype type,

MPI_Status *status);

Description:All processes in communicator group associated with the file handle fh write their respective count elements of types type from buffers buf to file at the offsets determined by the current values of file pointers cached on their file handles. Successful call returns the amount of data transferred in status.

#include <mpi.h>...MPI_File fh;MPI_Status stat;double t;int err, rank;.../* open file storing the handle in fh; compute t */...MPI_Comm_rank(MPI_COMM_WORLD, &rank);/* interleave time values t from each process at the beginning of file */MPI_File_set_view(fh, rank*sizeof(t), MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL);err = MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat);...

#include <mpi.h>...MPI_File fh;MPI_Status stat;double t;int err, rank;.../* open file storing the handle in fh; compute t */...MPI_Comm_rank(MPI_COMM_WORLD, &rank);/* interleave time values t from each process at the beginning of file */MPI_File_set_view(fh, rank*sizeof(t), MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL);err = MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat);...

40

MPI-IO: File Seek

Function: MPI_File_seek()

int MPI_File_seek(MPI_File fh, MPI_Offset offs, int whence);

Description:Updates the value of the individual file pointer according to whence, which has the following possible values:• MPI_SEEK_SET: the pointer is set to offs• MPI_SEEK_CUR: the pointer is set to the current value plus offs• MPI_SEEK_END: the pointer is set to the end of file plus offs.

#include <mpi.h>...MPI_File fh;MPI_Status stat;double t;int rank;.../* open file storing the handle in fh; compute t */...MPI_Comm_rank(MPI_COMM_WORLD, &rank);/* interleave time values t from each process at the beginning of file */MPI_File_set_view(fh, 0, MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL);MPI_File_seek(fh, MPI_SEEK_SET, rank);MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat);...

#include <mpi.h>...MPI_File fh;MPI_Status stat;double t;int rank;.../* open file storing the handle in fh; compute t */...MPI_Comm_rank(MPI_COMM_WORLD, &rank);/* interleave time values t from each process at the beginning of file */MPI_File_set_view(fh, 0, MPI_DOUBLE, MPI_DOUBLE, “native”, MPI_INFO_NULL);MPI_File_seek(fh, MPI_SEEK_SET, rank);MPI_File_write_all(fh, &t, 1, MPI_DOUBLE, &stat);...

MPI-IO Data Access Classification

41Source: http://www.mpi-forum.org/docs/mpi2-report.pdf

Example: Scatter to File

42Example created by Jean-Pierre Prost from IBM Corp.

Scatter Example Source

43

#include "mpi.h"

static int buf_size = 1024;static int blocklen = 256;static char filename[] = "scatter.out";

main(int argc, char **argv){ char *buf, *p; int myrank, commsize; MPI_Datatype filetype, buftype; int length[3]; MPI_Aint disp[3]; MPI_Datatype type[3]; MPI_File fh; int mode, nbytes; MPI_Offset offset; MPI_Status status;

/* initialize MPI */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &commsize);

#include "mpi.h"

static int buf_size = 1024;static int blocklen = 256;static char filename[] = "scatter.out";

main(int argc, char **argv){ char *buf, *p; int myrank, commsize; MPI_Datatype filetype, buftype; int length[3]; MPI_Aint disp[3]; MPI_Datatype type[3]; MPI_File fh; int mode, nbytes; MPI_Offset offset; MPI_Status status;

/* initialize MPI */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &commsize);

/* initialize buffer */ buf = (char *) malloc(buf_size); memset(( void *)buf, '0' + myrank, buf_size);

/* create and commit buftype */ MPI_Type_contiguous(buf_size, MPI_CHAR, &buftype); MPI_Type_commit(&buftype);

/* create and commit filetype */ length[0] = 1; length[1] = blocklen; length[2] = 1; disp[0] = 0; disp[1] = blocklen * myrank; disp[2] = blocklen * commsize; type[0] = MPI_LB; type[1] = MPI_CHAR; type[2] = MPI_UB;

MPI_Type_struct(3, length, disp, type, &filetype); MPI_Type_commit(&filetype);

/* open file */ mode = MPI_MODE_CREATE | MPI_MODE_WRONLY;

/* initialize buffer */ buf = (char *) malloc(buf_size); memset(( void *)buf, '0' + myrank, buf_size);

/* create and commit buftype */ MPI_Type_contiguous(buf_size, MPI_CHAR, &buftype); MPI_Type_commit(&buftype);

/* create and commit filetype */ length[0] = 1; length[1] = blocklen; length[2] = 1; disp[0] = 0; disp[1] = blocklen * myrank; disp[2] = blocklen * commsize; type[0] = MPI_LB; type[1] = MPI_CHAR; type[2] = MPI_UB;

MPI_Type_struct(3, length, disp, type, &filetype); MPI_Type_commit(&filetype);

/* open file */ mode = MPI_MODE_CREATE | MPI_MODE_WRONLY;

Scatter Example Source (cont.)

44

MPI_File_open(MPI_COMM_WORLD, filename, mode, MPI_INFO_NULL, &fh);

/* set file view */ offset = 0; MPI_File_set_view(fh, offset, MPI_CHAR, filetype, "native", MPI_INFO_NULL);

/* write buffer to file */ MPI_File_write_at_all(fh, offset, (void *)buf, 1, buftype, &status);

/* print out number of bytes written */ MPI_Get_elements(&status, MPI_CHAR, &nbytes); printf( "TASK %d ====== number of bytes written = %d ======\n", myrank, nbytes);

/* close file */ MPI_File_close(&fh);

/* free datatypes */ MPI_Type_free(&buftype); MPI_Type_free(&filetype);

/* free buffer */ free (buf);

/* finalize MPI */ MPI_Finalize();}

MPI_File_open(MPI_COMM_WORLD, filename, mode, MPI_INFO_NULL, &fh);

/* set file view */ offset = 0; MPI_File_set_view(fh, offset, MPI_CHAR, filetype, "native", MPI_INFO_NULL);

/* write buffer to file */ MPI_File_write_at_all(fh, offset, (void *)buf, 1, buftype, &status);

/* print out number of bytes written */ MPI_Get_elements(&status, MPI_CHAR, &nbytes); printf( "TASK %d ====== number of bytes written = %d ======\n", myrank, nbytes);

/* close file */ MPI_File_close(&fh);

/* free datatypes */ MPI_Type_free(&buftype); MPI_Type_free(&filetype);

/* free buffer */ free (buf);

/* finalize MPI */ MPI_Finalize();}

Data Access Optimizations

45

Data Sieving 2-phase I/O

Collective Read Implementation in ROMIO

Source: http://www-unix.mcs.anl.gov/~thakur/papers/romio-coll.pdf

ROMIO Scaling Examples

• Bandwidths obtained for 5123 arrays (astrophysics benchmark) on Argonne IBM SP

46

Processors Independent I/O Collective I/O

16 1.26 MB/s 64.8 MB/s

32 1.25 MB/s 69.5 MB/s

48 1.36 MB/s 70.6 MB/s

Processors Independent I/O Collective I/O

16 12.8 MB/s 68.5 MB/s

32 6.46 MB/s 82.6 MB/s

48 5.83 MB/s 88.4 MB/s

Write Operations

Read Operations

Source: http://www-unix.mcs.anl.gov/~thakur/sio-demo/astro.html

Independent vs. Collective Access

47

Collective I/O on IBM SPIndividual I/O on IBM SP

Source: http://www-unix.mcs.anl.gov/~thakur/sio-demo/upshot.html

48

Topics

• Introduction

• RAID







Introduction to HDF5

• Acronym for Hierarchical Data Format, a portable, freely distributable, and well supported library, file format, and set of utilities to manipulate it

• Explicitly designed for use with scientific data and applications• Initial HDF version was created at NCSA/University of Illinois at Urbana-

Champaign in 1988• First revision in widespread use was HDF4• Main HDF features include:

– Versatility: supports different data models and associated metadata

– Self-describing: allows an application to interpret the structure and contents of a file without any extraneous information

– Flexibility: permits mixing and grouping various objects together in one file in a user-defined hierarchy

– Extensibility: accommodates new data models, added both by the users and developers

– Portability: can be shared across different platforms without preprocessing or modifications

• HDF5 is the most recent incarnation of the format, adding support for new type and data models, parallel I/O, and streaming, and removing a number of existing restrictions (maximal file size, number of objects per file, flexibility of type use, storage management configurability, etc.) as well as improving the performance

49

HDF5 File Layout• Major object classes: groups and datasets

• Namespace resembles file system directory hierarchy (groups ≡ directories, datasets ≡ files)

• Alias creation supported through links (both soft and hard)

• Mounting of sub-hierachies is possible

50

User’s viewLow-level

organization

HDF5 API & Tools

Library functionality grouped by function name prefix

• H5: general purpose functions

• H5A: attribute interface

• H5D: dataset manipulation

• H5E: error handling

• H5F: file interface

• H5G: group creation and access

• H5I: object identifiers

• H5P: property lists

• H5R: references

• H5S: dataspace definition

• H5T: datatype manipulation

• H5Z: inline data filters and compression

51

Command-line utilities• h5cc, h5c++, h5fc: C, C++ and

Fortran compiler wrappers• h5redeploy: updates compiler tools

after installation in new location• h5ls, h5dump: lists hierarchy and

contents of a HDF5 file• h5diff: compares two HDF5 files• h5repack, h5repart: rearranges or

repartitions a file• h5toh4, h4toh5: converts between

HDF5 and HDF4 formats• h5import: imports data into HDF5 file• gif2h5, h52gif: converts image data

between gif and HDF5 formats

Basic HDF5 Concepts• Group

– Structure containing zero or more HDF5 objects (possibly other groups)

– Provides a mechanism for mapping a name (path) to an object

– “Root” group is a logical container of all other objects in a file

• Dataset– A named array of data elements (possibly multi-dimensional)

– Specifies the representation of the dataset the way it will be stored in HDF5 file through associated datatype and dataspace parameters

• Dataspace– Defines dimensionality of a dataset (rank and dimension sizes)

– Determines the effective subset of data to be stored or retrieved in subsequent file operations (aka selection)

• Datatype– Describes atomically accessed element of a dataset

– Permits construction of derived (compound) types, such as arrays, records, enumerations

– Influences conversion of numeric values between different platforms or implementations

• Attribute– A small, user-defined structure attached to a group, dataset or named datatype,

providing additional information 52

HDF5 Spatial Subset Examples

53Source: http://hdf.ncsa.uiuc.edu/HDF5/RD100-2002/All_About_HDF5.pdf

HDF5 Virtual File Layer


• Developed to cope with large number of available storage subsystem variations

• Permits custom file driver implementations and related optimizations

Overview of Data Storage Options


Simultaneous Spatial and Type Transformation Example


Simple HDF5 Code Example

57

/* Writing and reading an existing dataset. */#include "hdf5.h"#define FILE "dset.h5"

int main() { hid_t file_id, dataset_id; /* identifiers */ herr_t status; int i, j, dset_data[4][6];

/* Initialize the dataset. */ for (i = 0; i < 4; i++) for (j = 0; j < 6; j++) dset_data[i][j] = i * 6 + j + 1;

/* Open an existing file. */ file_id = H5Fopen(FILE, H5F_ACC_RDWR, H5P_DEFAULT); /* Open an existing dataset. */ dataset_id = H5Dopen(file_id, "/dset"); /* Write the dataset. */ status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data);

status = H5Dread(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data);

/* Close the dataset. */ status = H5Dclose(dataset_id); /* Close the file. */ status = H5Fclose(file_id);}

/* Writing and reading an existing dataset. */#include "hdf5.h"#define FILE "dset.h5"

int main() { hid_t file_id, dataset_id; /* identifiers */ herr_t status; int i, j, dset_data[4][6];

/* Initialize the dataset. */ for (i = 0; i < 4; i++) for (j = 0; j < 6; j++) dset_data[i][j] = i * 6 + j + 1;

/* Open an existing file. */ file_id = H5Fopen(FILE, H5F_ACC_RDWR, H5P_DEFAULT); /* Open an existing dataset. */ dataset_id = H5Dopen(file_id, "/dset"); /* Write the dataset. */ status = H5Dwrite(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data);

status = H5Dread(dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, dset_data);

/* Close the dataset. */ status = H5Dclose(dataset_id); /* Close the file. */ status = H5Fclose(file_id);}

Parallel HDF5

• Relies on MPI-IO as the file layer driver• Uses MPI for internal communications• Most of the functionality controlled through property lists

(requires minimal HDF5 interface changes)• Supports both individual and collective file access• Three raw data storage layouts: contiguous, chunking and

compact• Enables additional optimizations through derived MPI datatypes

(esp. for regular collective accesses)• Limitations

– Chunked storage with overlapping chunks (results non-deterministic)

– Read-only compression

– Writes with variable length datatypes not supported

58

59

Topics

• Introduction

• RAID



• Parallel I/O Libraries (MPI IO, ROMIO)

• Parallel File Formats (HDF5..)



General Parallel File System (GPFS)• Brief history:

– Based on the Tiger Shark parallel file system developed at the IBM Almaden Research Center in 1993 for AIX

• Originally targeted at dedicated video servers• The multimedia orientation influenced GPFS command names: they all contain “mm”

– First commercial release was GPFS V1.1 in 1998

– Linux port released in 2001; Linux-AIX interoperability supported since V2.2 in 2004

• Highly scalable– Distributed metadata management

– Permits incremental scaling

• High-performance– Large block size with wide striping

– Parallel access to files from multiple nodes

– Deep prefetching

– Adaptable mechanism for recognizing access patterns

– Multithreaded daemon

• Highly available and fault tolerant– Data protection through journaling, replication, mirroring and shadowing

– Ability to recover from multiple disk, node and connectivity failures (heartbeat mechanism)

– Recovery mechanism implemented in all layers 60

GPFS Features (I)

61Source: http://www-03.ibm.com/systems/clusters/software/gpfs.pdf

GPFS Features (II)

62

GPFS Architecture

63Source: http://www.redbooks.ibm.com/redbooks/pdfs/sg245610.pdf

Components Internal to GPFS Daemon

• Configuration Manager (CfgMgr)– Selects the node acting as Stripe Group Manager for each file system

– Checks for the quorum of nodes required for the file system usage to continue

– Appoints successor node in case of failure

– Initiates and controls recovery procedure

• Stripe Group Manager (FSMgr, aka File System Manager)– Strictly one per each GPFS file system

– Maintains availability information of disks comprising the file system (physical storage)

– Processes modifications (disk removals and additions)

– Repairs file system and coordinates data migration when required

• Metanode– Manages metadata (directory block updates)

– Its location may change (e.g. a node obtaining access to the file may become the metanode)

• Token Manager Server– Synchronizes concurrent access to files and ensures consistency among caches

– Manages tokens, or per-object locks• Mediates token migration when another node requests token conflicting with the existing token (token

stealing)

– Always located on the same node as Stripe Group Manager 64

GPFS Management Functions & Their Dependencies

65Source: http://www.redbooks.ibm.com/redbooks/pdfs/sg246700.pdf

Components External to GPFS Daemon

• Virtual Shared Disk (VSD, aka logical volume)– Enables nodes in one SP system partition to share disks with the other nodes in the same

system partition

– VSD node can be a client, a server (owning a number of VSDs, and performing data reads and writes requested by client nodes), or both at the same time

• Recoverable Virtual Shared Disk (RVSD)– Used together with VSD to provide high availability against node failures reported by Group

Services

– Runs recovery scripts and notifies client applications

• Switch (interconnect) Subsystem– Starts switch daemon, responsible for initializing and monitoring the switch

– Discovers and reacts to topology changes; reports and services status/error packets

• Group Services– Fault-tolerant, highly available and partition-sensitive service monitoring and coordinating

changes related to another subsystem operating in the partition

– Operates on each node within the partition, plus the control workstation for the partition

• System Data Repository (RSD)– Location where the configuration data are stored

66

Read Operation Flow in GPFS

67

Write Operation Flow in GPFS

68

Token Management in GPFS

69

• First lock request for an object requires a message from node N1 to the token manager• Token server grants token to N1 (subsequent lock requests can be granted locally)• Node N2 requests token for the same file (lock conflict)• Token server detects conflicting lock request and revokes token from N1• If N1 was writing to file, the data is flushed to disk before the revocation is complete• Node N2 gets the token from N1

GPFS Write-behind and Prefetch

70

• As soon as application’s write buffer is copied into the local pagepool, the write is operation is complete from client’s perspective

• GPFS daemon schedules a worker thread to finalize the request by issuing I/O calls to the device driver

• GPFS estimates the number of blocks to read ahead based on disk performance and rate at which application is reading the data

• Additional prefetch requests are processed asynchronously with the completion of the current read

Some GPFS Cluster Models

71Joined (AIX and Linux) modelMixed (NSD and direct attached) model

Network Shared Disk (NSD) with dedicated server model Direct attached model

Comparison of GPFS to Other File Systems

72

73

Topics

• Introduction

• RAID



• Parallel I/O Libraries (MPI IO, ROMIO)

• Parallel File Formats (HDF5..)



74

Summary – Material for the Test

• Need for Parallel I/O (slide 6)• RAID concepts (slides 8-10)• Distributed File System Concepts NFS (slides 12, 13)• Why NFS is bad for parallel I/O (slide 14)• Parallel File System Concepts (slides 16-19)• PVFS (slides 20-24)• MPI-IO concepts & features (slides 29-32)• MPI-IO API & functionalities (slides 33-41)

high performance computing: concepts, methods & means parallel i/o : file systems and libraries

Documents