peter cao, the hdf group ([email protected]) michael wan ......me tadaa in hdf5 files sored...

1
The HDF5‐iRODS Module A Data Grid System for Object Level Access Peter Cao, The HDF Group ([email protected]) Michael Wan, San Diego Supercomputer Center ([email protected]) Simulations can generate very large and complex datasets. Researchers at different sites need fast access to both recent and historical data. Storage, networking, and compute platforms vary; some cannot handle full datasets. Frequently, only subsets of the data are of interest. HDF5, iRODS, and the HDF5‐iRODS module address these challenges. HDFView, a visual tool for browsing and editing HDF files, was extended to use the HDF5‐ iRODS module so that users can view HDF5 files stored remotely on the iRODS server. http://www.hdfgroup.org/projects/irods This project was sponsored by CIP/NLADR, an NSF PACI Project in support of NCSA‐SDSC collaboration, and managed by the CyberInfrastructure Partnership (CIP), a joint effort led by NCSA and SDSC. The work was carried out by The HDF Group and the SDSC SRB team. The ASC/Alliance Center for Astrophysical Thermonuclear Flashes at the University of Chicago provided the FLASH simulation data (HDF5 files) and other assistance. The islice tool was based on extract_slice_from_chkpnt, a slice tool previously developed by Paul Ricker (NCSA/UIUC). The Integrated Rule‐Oriented Data System (iRODS) is a data grid system developed by the Data Intensive Cyber Environments (DICE) group. The most powerful feature of iRODS is the Distributed Rule Engine, which allows users to automate enforcement of management policies by applying iRODS Rules that control the execution of all data access and manipulation operations at distributed sites. DATA CHALLENGES iRODS HDF5‐iRODS MODULE HDFView APPLICATION ACKNOWLEDGMENTS PROJECT WEBSITE The HDF5‐iRODS module components, together with HDF5 and iRODS, implement a client‐server system that provides interactive and efficient access to HDF5 files managed by a remote iRODS server. Applications on the local machine use client functions to access specific data and metadata in HDF5 files stored remotely. Only the requested data and metadata are transferred to the local machine, not the entire file. The HDF5‐iRODS module includes two main parts: a set of HDF5 micro‐services and a set of HDF5 objects. The HDF5 micro‐services perform simple well‐defined HDF5 tasks, such as open a file, read from a dataset, read group attributes, or close a file. The HDF5 objects, representing objects in HDF5 files, are used to specify requests from the client and to transfer results (data) from the server. HDF5 Hierarchical Data Format Version 5 (HDF5) is a unique tech‐ nology suite that makes it possible to manage extremely large and complex data collections. More than 600 organizations, over 200 types of applications, and millions of individuals are using HDF5. Terabytes of data are stored in HDF5 every day. The HDF5 suite includes: A versatile data model A portable file format A library optimized for access time and storage space Tools and applications to manage, manipulate, view, and analyze data in HDF5 files iRODS equips users to handle a full range of tasks: Manage distributed data Extract metadata Move data efficiently Share data securely Publish data in digital library Archive data for long‐term preservation The islice tool uses the HDF5‐iRODS module to extract a slice of data from a FLASH file stored remotely on the server. The slice of interest is transferred and stored on the local (client) system. islice APPLICATION BENEFITS of the HDF5‐iRODS MODULE Reduces storage needed on local machine. Terabytes of data reside remotely; only small subsets are staged locally. Facilitates data sharing. Scientists can easily access updated data after a new simulation run, as well as prior results. Clients do not require the HDF5 library; all HDF5 calls are handled by the iRODS server. Supports fast browsing of data objects. Users can examine the structure of a file without loading the data content. Provides rapid access to selected data content and metadata. By transferring only the selected data content and metadata, access time is reduced. iRODS message (pack/unpack) HDF5 object (H5Dataset) HDF5 file I need to see Application HDF5 Library Root (file entry point) Data array Group A Group B Example HDF5 File Structure The figure above depicts how the HDF5‐iRODS client‐server system works. A user on a local (client) machine requests a slice of data from an HDF5 file managed by a remote iRODS server. To serve this request, H5DATASET_OP_READ is set in an H5Dataset (HDF5 object). The HDF5 object is then packed into an iRODS message and sent to the iRODS server. The server unpacks the message and checks the rule engine for matches. It finds and executes the associated HDF5 micro‐service, msiH5Dataset_read, which calls H5Dataset.read() to get the requested data from the HDF5 file. The HDF5 object, which contains the requested slice of data, is packed into an iRODS message and returned to the client, where it is unpacked and delivered to the application. iRODS message (pack/unpack) HDF5 object (H5Dataset) FLASH simulation results Full dataset: 20 GB Slice of interest: 16 MB Data Transfer Time Full dataset Network Bandwidth Slice of interest 3.3 min 100 MB/sec .16 sec 33 min 10 MB/sec 1.6 sec 5.5 hours 1 MB/sec 16 sec HDF5 microservices Rule Engine Client Interface request result request result

Upload: others

Post on 09-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Peter Cao, The HDF Group (xcao@hdfgroup.org) Michael Wan ......me tadaa in HDF5 files sored remotely. Only the requesed daa and me a are transferred to the local machine, not the entire

TheHDF5‐iRODSModuleADataGridSystemforObjectLevelAccess

PeterCao,TheHDFGroup([email protected])MichaelWan,SanDiegoSupercomputerCenter([email protected])

• Simulationscangenerateverylargeandcomplexdatasets.

• Researchersatdifferentsitesneedfastaccesstobothrecentandhistoricaldata.• Storage,networking,andcomputeplatformsvary;somecannothandlefulldatasets.• Frequently,onlysubsetsofthedataareofinterest.

HDF5,iRODS,andtheHDF5‐iRODSmoduleaddressthesechallenges.

HDFView,avisualtoolforbrowsingandeditingHDFfiles,wasextendedtousetheHDF5‐iRODSmodulesothatuserscanviewHDF5filesstoredremotelyontheiRODSserver.

http://www.hdfgroup.org/projects/irods

ThisprojectwassponsoredbyCIP/NLADR,anNSFPACIProjectinsupportofNCSA‐SDSCcollaboration,andmanagedbytheCyberInfrastructurePartnership(CIP),ajointeffortledbyNCSAandSDSC.TheworkwascarriedoutbyTheHDFGroupandtheSDSCSRBteam.TheASC/AllianceCenterforAstrophysicalThermonuclearFlashesattheUniversityofChicagoprovidedtheFLASHsimulationdata(HDF5files)andotherassistance.Theislicetoolwasbasedonextract_slice_from_chkpnt,aslicetoolpreviouslydevelopedbyPaulRicker(NCSA/UIUC).

TheIntegratedRule‐OrientedDataSystem(iRODS)isadatagridsystemdevelopedbytheDataIntensiveCyberEnvironments(DICE)group.ThemostpowerfulfeatureofiRODSistheDistributedRuleEngine,whichallowsuserstoautomateenforcementofmanagementpoliciesbyapplyingiRODSRulesthatcontroltheexecutionofalldataaccessandmanipulationoperationsatdistributedsites.

DATACHALLENGES

iRODS

HDF5‐iRODSMODULE HDFViewAPPLICATION

ACKNOWLEDGMENTS

PROJECTWEBSITE

TheHDF5‐iRODSmodulecomponents,togetherwithHDF5andiRODS,implementaclient‐serversystemthatprovidesinteractiveandefficientaccesstoHDF5filesmanagedbyaremoteiRODSserver.ApplicationsonthelocalmachineuseclientfunctionstoaccessspecificdataandmetadatainHDF5filesstoredremotely.Onlytherequesteddataandmetadataaretransferredtothelocalmachine,nottheentirefile.

TheHDF5‐iRODSmoduleincludestwomainparts:asetofHDF5micro‐servicesandasetofHDF5objects.TheHDF5micro‐servicesperformsimplewell‐definedHDF5tasks,suchasopenafile,readfromadataset,readgroupattributes,orcloseafile.TheHDF5objects,representingobjectsinHDF5files,areusedtospecifyrequestsfromtheclientandtotransferresults(data)fromtheserver.

HDF5

HierarchicalDataFormatVersion5(HDF5)isauniquetech‐nologysuitethatmakesitpossibletomanageextremelylargeandcomplexdatacollections.Morethan600organizations,over200typesofapplications,andmillionsofindividualsareusingHDF5.TerabytesofdataarestoredinHDF5everyday.

TheHDF5suiteincludes:• Aversatiledatamodel• Aportablefileformat• Alibraryoptimizedforaccesstimeandstoragespace• Toolsandapplicationstomanage,manipulate,view,andanalyzedatainHDF5files

iRODSequipsuserstohandleafullrangeoftasks:• Managedistributeddata• Extractmetadata• Movedataefficiently• Sharedatasecurely• Publishdataindigitallibrary• Archivedataforlong‐termpreservation

TheislicetoolusestheHDF5‐iRODSmoduletoextractasliceofdatafromaFLASHfilestoredremotelyontheserver.Thesliceofinterestistransferredandstoredonthelocal(client)system.

isliceAPPLICATION

BENEFITSoftheHDF5‐iRODSMODULE• Reducesstorageneededonlocalmachine.Terabytesofdataresideremotely;onlysmallsubsetsarestagedlocally.• Facilitatesdatasharing.Scientistscaneasilyaccessupdateddataafteranewsimulationrun,aswellaspriorresults.ClientsdonotrequiretheHDF5library;allHDF5callsarehandledbytheiRODSserver.• Supportsfastbrowsingofdataobjects.Userscanexaminethestructureofafilewithoutloadingthedatacontent.• Providesrapidaccesstoselecteddatacontentandmetadata.Bytransferringonlytheselecteddatacontentandmetadata,accesstimeisreduced.

iRODSmessage(pack/unpack)

HDF5object(H5Dataset)

HDF5file

Ineedtosee

Application

HDF5Library

Root(fileentrypoint)

Dataarray

GroupA GroupB

ExampleHDF5FileStructure

ThefigureabovedepictshowtheHDF5‐iRODSclient‐serversystemworks.Auseronalocal(client)machinerequestsasliceofdatafromanHDF5filemanagedbyaremoteiRODSserver.Toservethisrequest,H5DATASET_OP_READissetinanH5Dataset(HDF5object).TheHDF5objectisthenpackedintoaniRODSmessageandsenttotheiRODSserver.Theserverunpacksthemessageandcheckstheruleengineformatches.ItfindsandexecutestheassociatedHDF5micro‐service,msiH5Dataset_read,whichcallsH5Dataset.read()togettherequesteddatafromtheHDF5file.TheHDF5object,whichcontainstherequestedsliceofdata,ispackedintoaniRODSmessageandreturnedtotheclient,whereitisunpackedanddeliveredtotheapplication.

iRODSmessage(pack/unpack)

HDF5object(H5Dataset)

FLASHsimulationresults

Fulldataset:20GB Sliceofinterest:16MB

DataTransferTime

Fulldataset NetworkBandwidth Sliceofinterest

3.3min 100MB/sec .16sec

33min 10MB/sec 1.6sec

5.5hours 1MB/sec 16sec

HDF5microservices

RuleEngineClientInterface

request result

request result