project 4 scidac all hands meeting march 26-27, 2002 pis:alok choudhary, wei-keng liao grad...

Project 4

SciDAC All Hands Meeting

March 26-27, 2002

PIs: Alok Choudhary, Wei-keng Liao

Grad Students: Avery Ching, Kenin Coloma, Jianwei Li

ANL Collaborators:

Bill Gropp, Rob Ross, Rajeev Thakur

Enabling High Performance Application I/O

Wei-keng Liao Northwestern University

Outline1. Design of parallel netCDF APIs

– Using MPI-IO underlying (student: Jianwei Li)

– Collaborators: Bill Gropp, Rob Ross, Rajeev Thakur (ANL)

2. Non-contiguous data access on PVFS– Design of non-contiguous access APIs (student: Avery Ching)

– Interfaces to the MPI-IO (student: Kenin Coloma)

– Applications: FLASH, tiled visualization

– Collaborators: Bill Gropp, Rob Ross, Rajeev Thakur (ANL)

3. High level data access patterns– ENZO astrophysics application

– Access patterns of an AMR application

NetCDF OverviewNetCDF (network Common Data Form) is an interface for array-oriented data access. It defines a machine-independent file format for representing multi-dimensional arrays with ancillary data, and provide I/O library for creation, access, and sharing of array-oriented data.

Each netCDF file is a dataset, which contains a set of named arrays.

Dataset Component

• Dimensions name, length– Fixed dimension– UNLIMITED dimension

• Variables: named arrays name, type, shape, attributes,

array data– Fixed sized variables: array of fixed

dimensions– Record variables: array with its most-

significant dimension UNLIMITED– Coordinate variables: 1-D array with the

same name as its dimension

• Attributes name, type, values, length– Variable attributes– Global attributes

netCDF example { // CDL notation for netCDF dataset

dimensions: // dimension names and lengths lat = 5, lon = 10, level = 4, time = unlimited;

variables: // var types, names, shapes, attributes float temp(time,level,lat,lon);

temp:long_name = "temperature"; temp:units = "celsius";

float rh(time,lat,lon); rh:long_name = "relative humidity"; rh:valid_range = 0.0, 1.0; // min and

max int lat(lat), lon(lon), level(level), time(time);

lat:units = "degrees_north"; lon:units = "degrees_east"; level:units = "millibars"; time:units = "hours since 1996-1-1";

// global attributes: :source = "Fictional Model

Output"; data: // optional data assignments

level = 1000, 850, 700, 500; lat = 20, 30, 40, 50, 60; lon = -160,-140,-118,-96,-84,-52,-45,-35,-25,-15; time = 12; rh = .5,.2,.4,.2,.3,.2,.4,.5,.6,.7,

.1,.3,.1,.1,.1,.1,.5,.7,.8,.8,

.1,.2,.2,.2,.2,.5,.7,.8,.9,.9,

.1,.2,.3,.3,.3,.3,.7,.8,.9,.9, 0,.1,.2,.4,.4,.4,.4,.7,.9,.9; // 1 record

allocated }

Design Parallel netCDF APIs• Goal

– Maintain exactly the same original netCDF file format

– Provide parallel I/O functionalities• On top of MPI-IO

• High level parallel APIs– Minimize the argument list change of netCDF APIs– For legacy codes with minimal changes

• Low level parallel APIs– Using MPI-IO components, e.g. derived data types– For MPI-IO experienced users

NetCDF File Structure

● Header (dataset definition, extendable) - Number of records allocated - Dimension list - Global attribute list - Variable list

● Data (row-major, big-endian, 4 byte aligned) - Fixed-sized(non-record) data data for each variable is stored contiguously in defined order - Record data (non-contiguous between records of a var) a variable number of fixed-size records, each of which contains one record

for each of the record variables in defined order.

NetCDF APIs• Dataset APIs -- create/open/close a dataset, set the dataset to

define/data mode, and synchronize dataset changes to disk• Input: path, mode for create/open; dataset ID for opened dataset• Output: dataset ID for create/open

• Define mode APIs -- define dataset: add dimensions, variables• Input: opened dataset ID; dimension name and length to define dimension;

or variable name, number of dimensions, shape to define variable• Output: dimension ID; or variable ID

• Attribute APIs -- add, change, and read attributes of datasets• Input: opened dataset ID; attribute No. or attribute name to access

attribute; or attribute name, type, and value to add/change attribute• Output: attribute value for read attribute

• Inquiry APIs -- inquire dataset metadata (in memory): dim(id, name, len), var(name, ndims, shape, id)• Input: opened dataset id; dim name or id, or variable name or id• Output: dimension info, or variable info

• Data mode APIs – read/write variable (access method: single value, whole array, subarray, strided subarray, sampled subarray)• Input: opened dataset ID; variable id; element start index, count, stride,

index map.

Design of Parallel APIs• Two file descriptors

– NetCDF file descriptor: For header I/O (reuse of old code)Performed only by process 0

– MPI_File handle: For data array I/OPerformed by all processes

• Implicit MPI file handle and communicator– Added into the internal data structure– MPI communicator passed as an argument in create/open

• I/O implementation using MPI-IO– File view and offsets are computed from metadata in

header and user-provided arguments (start, count, stride)– Users choose either collective or non-collective I/O calls

Collective/Non-collective APIs

• Dataset APIs– Collective calls over the communicator

passed into the create or open call– All processes collectively switches between

define and data mode

• Define mode, attribute, inquiry APIs– Collective or non-collective calls– Operate in local memory (all processes have

identical header structures)

• Data mode APIs– Collective or non-collective calls– Access method: single value, whole array,

subarray, strided subarray

Changes in High-level Parallel APIsOriginal

netCDF APIs Parallel APIs Argument changed

Need MPI-IO

Dataset

nc_create nc_createAdd MPI_Comm

yes

nc_open nc_open

nc_enddef nc_enddef

No changenc_redef nc_redef

nc_close nc_close

nc_sync nc_sync

Define mode,Attribute,Inquiry

all No change No change no

Data mode

nc_put_var_ type* nc_put_var_ type

No change yesnc_get_var_ type

nc_get_var_ type

nc_get_var_ type_all

* type = text | uchar | schar | short | int | long | float | double

Example Code - Write• Create a dataset

– Collective– The input arguments should be the same among

processes– The returned ncid is different among processes (but

refers the same dataset)– All processes put in define mode

• Define dimensions– Non-collective– All processes should have the same definitions

• Define variables– Non-collective– All processes should have the same definitions

• Add attributes– Non-collective– All processes should have put the same attributes

• End define– Collective– All processes switch from define mode to data

mode

• Write variable data– All processes do a number of collective write to

write the data for each variable– Can do independent write, if you like– Each process provide different argument values

which are set locally

• Close the dataset– Collective

status = nc_create(comm, "test.nc", NC_CLOBBER, &ncid);

/* dimension */status = nc_def_dim(ncid, "x", 100L, &dimid1);status = nc_def_dim(ncid, "y", 100L, &dimid2);status = nc_def_dim(ncid, "z", 100L, &dimid3);status = nc_def_dim(ncid, "time", NC_UNLIMITED, &udimid);

square_dim[0] = cube_dim[0] = xytime_dim[1] = dimid1;square_dim[1] = cube_dim[1] = xytime_dim[2] = dimid2;cube_dim[2] = dimid3;xytime_dim[0] = udimid; time_dim[0] = udimid;

/* variable */status = nc_def_var (ncid, "square", NC_INT, 2, square_dim,

&square_id);status = nc_def_var (ncid, "cube", NC_INT, 3, cube_dim, &cube_id);status = nc_def_var (ncid, "time", NC_INT, 1, time_dim, &time_id);status = nc_def_var (ncid, "xytime", NC_INT, 3, xytime_dim,

&xytime_id);

/* attributes */status = nc_put_att_text (ncid, NC_GLOBAL, "title", strlen(title),

title);status = nc_put_att_text (ncid, square_id, "description",

strlen(desc), desc);

status = nc_enddef(ncid);

/* variable data */nc_put_vara_int_all(ncid, square_id, square_start, square_count,

buf1);nc_put_vara_int_all(ncid, cube_id, cube_start, cube_count, buf2);

nc_put_vara_int_all(ncid, time_id, time_start, time_count, buf3);nc_put_vara_int_all(ncid, xytime_id, xytime_start, xytime_count,

buf4);

status = nc_close(ncid);

The only change

Example Code - Read

status = nc_open(comm, filename, 0, &ncid);

status = nc_inq(ncid, &ndims, &nvars, &ngatts, &unlimdimid);

/* global attributes */for (i = 0; i < ngatts; i++) { status = nc_inq_attname(ncid, NC_GLOBAL, i, name); status = nc_inq_att (ncid, NC_GLOBAL, name, &type, &len); status = nc_get_att_text(ncid, NC_GLOBAL, name, valuep);}

/* variables */for (i = 0; i < nvars; i++) { status = nc_inq_var(ncid, i, name, vartypes+i, varndims+i, vardims[i],

varnatts+i);

/* variable attributes */ for (j = 0; j < varnatts[i]; j++) { status = nc_inq_attname(ncid, varids[i], j, name); status = nc_inq_att (ncid, varids[i], name, &type, &len); status = nc_get_att_text(ncid, varids[i], name, valuep); }}

/* variable data */for (i = 0; i < NC_MAX_VAR_DIMS; i++) start[i] = 0;for (i = 0; i < nvars; i++) { varsize = 1;

/* dimensions */ for (j = 0; j < varndims[i]; j++) { status = nc_inq_dim(ncid, vardims[i][j], name, shape + j); if (j == 0) { shape[j] /= nprocs; start[j] = shape[j] * rank; } varsize *= shape[j]; }

status = nc_get_vara_int_all(ncid, i, start, shape, (int *)valuep);}

status = nc_close(ncid);

• Open the dataset– Collective– The input arguments should be

the same among processes– The returned ncid is different

among processes (but refers the same dataset)

– All processes put in data mode

• Dataset inquiries– Non-collective– Count, name, len, datatype

• Read variable data– All processes do a number of

collective read to read the data from each variable in (B, *, *) manner

– Can do independent read, if you like

– Each process provide different argument values which are set locally

• Close the dataset– Collective

The only change

Non-contiguous Data Access on PVFS

• Problem definition• Design approaches

– Multiple I/O– Data sieving– PVFS list_io

• Integration into MPI-IO• Experimental results

– Artificial benchmark– FLASH application I/O– Tile visualization

Non-contiguous Data Access

• Data access that is not adjacent in memory or file– Non-contiguous in memory,

contiguous in file

– Non-contiguous in file, contiguous in memory

– Non-contiguous in file, non-contiguous in memory

• Two applications– FLASH astrophysics

application

– Tile visualization

Non-contiguous in file

Contiguous in memory

Non-contiguous in memory

Memory

File

Memory

File

Memory

File

Non-contiguous in memory

Contiguous in file

Non-contiguous in file

Multiple I/O Requests

Application

ContiguousData Region



I/ORequest

I/OServer

I/OServer

I/ORequest

I/ORequest

• Intuitive strategy– One I/O request per contiguous

data segment

• Large number of I/O requests to the file system– Communication costs between

applications and I/O servers become significant which can dominates the I/O time

I/OServer

First read requestSecond read request

File

Data Sieving I/O

• Reads a contiguous chunk frm the file into a temporary buffer

• Extract/update the requested portions– Number of requests reduced

– I/O amount increased

– Number of I/O requests depends on the size of sieving buffer

• Write back to file (for write operations)

Application


I/ORequest

I/OServer

I/OServer

I/OServer

I/ORequest




First I/O request Second I/O request

File

PVFS List_io• Combine non-contiguous

I/O requests into a single request

• Client support– APIs pvfs_list_read,

pvfs_list_write

– I/O request -- a list of file offsets and file lengths

• I/O server support– Wait for trailing list of file

offsets and lengths following I/O request

Application

PVFS Library




I/ORequest

I/OServer

I/OServer

I/OServer

Artificial Benchmark

• Contiguous in memory, non-contiguous in file

• Parameters:– Number of accesses– Number of processors– Stride size = file size / number of accesses– Block size = stride size / number of processors

File

Memory

Proc 0 Proc 1 Proc 2

Stride

4 accesses

Benchmark Results

• Parameter configurations– 8 clients

– 8 I/O servers

– 1 Gigabyte file size

300

Number of Accesses

200

100

Data Sieving

40k

List_io

Multiple I/O

800k20k 600k400k200k100k80k60k

600

500

400

0

Tim

e (i

n s

eco

nd

s)

Read

0

50

100

150

200

250

300

350

400

10k 20k 30k 40k 50k 60k 70k 80k 90k

Number of Accesses

Tim

e (

se

co

nd

s)

Write

Multiple I/O

List_io

• Avoid caching effect at I/O servers– Read/write 4 files alternatively

since each I/O server has 512 MB memory

FLASH Application• An astrophysics application

developed at University of Chicago– Simulate the accretion of matter onto a

compact star, and the subsequent stellar evolution, including nuclear burning either on the surface of the compact star, or in its interior

• I/O benchmark measures the performance of the FLASH output: produces checkpoint files, plot-files– A typical large production run generates

~ 0.5 Tbytes (100 checkpoint files and 1,000 plot-files)

This image, the interior This image, the interior of an exploding star, of an exploding star, depicts the distribution depicts the distribution of pressure during a of pressure during a star explosionstar explosion

FLASH -- I/O Access Pattern

X-Axis

Y-Axis

Z-Axis

FLASH block structure

Variable 0

Variable 1

Variable 2

Variable 23

Blocks toaccess in

X-axis

Blocks toaccess in

Y-axis

Guard Cells

Cut a sliceof the block Each element

has 24 variables

Memory Organization

• Each processor has 80 cubes– Each has guard cells and a sub-

cube which holds the data to be output

• Each element in the cube contains 24 variables, each is of type double (8 bytes)– Each variable is partitioned

among all processors

• Output pattern– All variables are saved into a

single file, one after another

FLASH I/O ResultsAccess patterns:• In memory

– Each contiguous segment is small, 8 bytes

– Stride size between two segments is small, 192 bytes

• From memory to file– Multiple I/O: 8*8*8*80*24 =

983,040 request per processors

– Data sieving: 24 requests per processor

– List_io: 8*8*8*80*24/64 = 15,360 requests per processor (64 is the max number of offset-length pairs)

• In file – Each contiguous segment is of

size 8*8*8*8 = 4096 bytes written by each processor

– The output file is of size

8 MB * number of procs

1

10

100

1000

10000

100000

multiple I/O datasieving I/O list I/O

tim

e (s

eco

nd

s)

2 clients 4 clients

Tile Visualization

• Preprocess “frames” into streams of tiles by staging tile data on visualization nodes

• Read operations only

• Each node reads one sub-tile

• Each sub-tile has ghost regions overlapped with other sub-tiles

• The noncontiguous nature of this file access becomes apparent in its logical file representation

Tile 1 Tile 2 Tile 3

Tile 4 Tile 5 Tile 6

Example layout

• 3x2 display

• Frame size of 2532x1408 pixels

• Tile size of 1024x768 w/ overlap

• 3 byte RGB pixels

• Each frame is stored as a file of size 10MB

...

Single node’s file viewProc 0

Proc 1Proc 2

Integrate List_io to ROMIO

Filetype offsets & lengths Datatype offsets & lengths

...

.........

File Memory

pvfs_read_list(Memory offsets/lengths, File offsets/lengths)

• Then, using the list, ROMIO steps through both file and memory addresses

• ROMIO generates memory and file offsets and lengths to pass through pvfs_list_io

• ROMIO calls pvfs_list_io after all data has been read, or the set max array size has been reached, in which case a new list is generated

• ROMIO uses the internal ADIO function flatten to break both the filetypes and datatypes down into a list of offset and length pairs

Tile I/O ResultsCollective data sieving Collective read_list Non-collective data sieving Non-collective read_list

accu

mu

late

d t

ime

4 compute nodes 1740 MB

0

50

100

150

200

250

4 8 12 16


4 8 12 16


4 8 12 16

io nodes

0

50

100

150

200

250

0

50

100

150

200

250


0

10

20

30

40

4 8 12 16


4 8 12 16


4 8 12 16

io nodes

0

10

20

30

40

0

10

20

30

40


0123456789

10

4 8 12 16


00.5

11.5

22.5

33.5

44.5

5

4 8 12 16


4 8 12 16


4 8 12 16


4 8 12 16

io nodes


4 8 12 16

io nodes

00.5

11.5

22.5

33.5

44.5

5

00.5

11.5

22.5

33.5

44.5

5

0123456789

10

0123456789

10

Analysis of Tile I/O Results

• Collective operations theoretically should be faster, but …

• Hardware problem– Fast Ethernet: overhead in the collective I/O takes too

long to catch back up with the independent I/O requests

• Software problem– A lot of extra data movement in ROMIO collectives --

the aggregation isn't as smart as it could be

• Plans to do– Use MPE logging facilities to figure out the problem– Study of the ROMIO implementation, find bottlenecks

in the collectives and try to weed them out

High Level Data Access Patterns

• Study of file access patterns of astrophysics applications– FLASH from University of Chicago– ENZO from NCSA

• Design of data management framework using XML and database– Essential metadata collection– Trigger rules for automatic I/O optimization

ENZO Application• Simulate the formation of

a cluster of galaxies starting near the big bang until the present day

• It is used to test theories of how galaxy forms by comparing the results with what is really observed in the sky today

• File I/O using HDF-4• Dynamic load balance using MPI• Data partitioning: Adaptive Mesh Refinement (AMR)

AMR Data Access Pattern• Adaptive Mesh Refinement partitions

problem domain into sub-domains recursively an dynamically

• A grid can only be owned by a processor but one processor can have many grids.

• Check-pointing is performed– Each grid is written to a separate file

(independent writes)

• During re-start– Sub-domain hierarchy need not be re-

constructed– Grids at the same time stamp are read

altogether

• During visualization– All grids are combined into a top grid

x x x

x

xx

x

x

x

xx

x

x

x

x

x

x

x

x x x

x

xxx

x x

xx

x

x

x

AMR Hierarchy Represented in XML

• AMR hierarchy is naturally mapped into XML hierarchy

• XML is embedded in a relational database

• Metadata queries/update through the database

• Database can handle multiple queries simultaneously – ideal for parallel applications

<Array name="density" dim="3">

</Array>

<Dimension value="10 8 12"/>

</Grid>


<FileName value="grid0.dat">

</type>


</Grid>

</Grid>

</Grid>

<Grid id="3" level="2">

</DataSet>



<GridRank value="3" />



<Producer name="astro" /><DataSet name="grid">

<type IsComplex="1">

float32, int32, double64

grid.xml

elementattribute

element

typekey

1

attribute

34567

0

node

2

null

020446

parentkey

8 attribute level 6

cdata

elementattributeelementattribute

GridRank

name

value

Producer

idGrid

name

nameDataSet

0

"grid"

0

3

0

"ENZO"

grid.xml table

File System Based XML

(ns1, x) empty 0, r11, r2

directory name

file name(ns2, x) y1, text1

y2, text20, r3

Some text_T_file contents

(ns2, x) y1, text3 0, r4

More text_T_

r 0

_T_: text (used in character node nodes)

</ns2:y>

</ns1:x>

More text

_E_: element map_A_: attributes of the element node_N_ : name spce

r 1r 3

r 2r 4

</ns2:y>

<ns2:y y1="text3">

Some text

<ns1:x>

<ns2:y y1="text1" y2="text2">

_N_ _E_

_N_

_E_

_E_

_A_ _N_ _A_

_A_

file_based.xmlr 4

r 0

r 3

r 1 r 2

• File system is used to support the decomposition of XML documents into files and directories

• This representation consists of an arbitrary hierarchy of directories and files, and preserves the XML philosophy of being textual in representation but requires no further use of an XML parser to process the document

• Metadata locates near to the scientific data

Summary• List_io API incorporated into PVFS for non-

contiguous data access– Read operation is completed– Write operation in progress

• Parallel netCDF APIs– High-level APIs --- will completed soon– Low-level APIs --- interfaces already defined– Validater

• High level data access patterns– Access patterns of AMR applications– Other types of applications

project 4 scidac all hands meeting march 26-27, 2002 pis:alok choudhary, wei-keng liao grad...

Documents

sizednonrecord data

ancillary data

arrayoriented data access

mpiio student

netcdf dataset dimensions

east level

fictional model output

lon rh