bulk data copy description generalizations (some dmi/jsdl overlap) bulk copying: recursive file/dir...

25
Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially a draft straw-man proposal for a ‘Bulk Copy Document’?) [email protected]

Upload: claire-pierce

Post on 27-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

Bulk Data Copy Description Generalizations(some DMI/JSDL overlap)

Bulk Copying: Recursive file/dir copying between multiple sources and sinks

(potentially a draft straw-man proposal for a ‘Bulk Copy Document’?)

[email protected]

Page 2: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

Overview

• Some Overlap in Data Copy Activity Descriptions (JSDL and DMI)• JSDL Data staging and Bulk copies• DMI and bulk copies • Some new draft proposals for DMI To address Bulk Data Copying• Reuse of proposed DMI-common element set• Some other stuff to consider

Page 3: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

Some Overlap in Data Copy Activity Descriptions (JSDL and DMI)

• Some overlap between JSDL Data Staging and DMI. • The Source/Target <jsdl:DataStaging/> element is roughly similar to

Source/Sink <dmi:DEPR/> element. • Both capture the source/target URI and credentials. • At present, neither JSDL DS or DMI fully captures our requirements (this is

not a criticism, they are each intended to address their existing use cases which only partially overlap with the requirements for a bulk data copy activity !).

Other• Condor Stork - based on Condor Class-Ads (see supplementary slides)• Not sure if Globus has/intends a similar definition in its new developments

(e.g. SaaS) anyone ?

Page 4: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

JSDL DATA STAGING AND BULK COPIES

Using JSDL Data Staging elements to simulate a bulk data copy activityBulk Copy: Recursive file/dir copying between multiple sources and sinks

Page 5: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

JSDL Data Staging and the HPC File Staging Profile for Bulk Data Copying

<jsdl:DataStaging><jsdl:FileName>fileA</jsdl:FileName><jsdl:CreationFlag>overwrite</jsdl:CreationFlag><jsdl:DeleteOnTermination>true</jsdl:DeleteOnTermination><jsdl:Source>

<jsdl:URI>gsiftp://griddata1.dl.ac.uk:2811/myhome/fileA</jsdl:URI> </jsdl:Source>

<jsdl:Target><jsdl:URI>ftp://ngs.oerc.ox.ac.uk:2811/myhome/fileA</jsdl:URI>

</jsdl:Target> <Credentials> … </Credentials></jsdl:DataStaging>

Define both the source and target within the same <DataStaging/> element which is permitted in JSDL.

But the HPC File Staging Profile (Wasson et al. 2008) limits to a single credential definition within a data staging element.

Possibility; maybe profile use of Credentials within Source/Target elements ?

JSDL Staging 1

Page 6: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

<jsdl:DataStaging><jsdl:FileName> fileA </jsdl:FileName><jsdl:FilesystemName> MY_SCRATCH_DIR </jsdl:FilesystemName><jsdl:CreationFlag> overwrite </jsdl:CreationFlag><jsdl:DeleteOnTermination> true </jsdl:DeleteOnTermination><jsdl:Source>

<jsdl:URI> gsiftp://griddata1.dl.ac.uk:2811/myhome/fileA </jsdl:URI> </jsdl:Source>

<Credentials> e.g. MyProxyToken </Credentials></jsdl:DataStaging>

<jsdl:DataStaging><jsdl:FileName> fileA </jsdl:FileName><jsdl:FilesystemName> MY_SCRATCH_DIR </jsdl:FilesystemName><jsdl:CreationFlag> overwrite </jsdl:CreationFlag><jsdl:Target>

<jsdl:URI> ftp://ngs.oerc.ox.ac.uk:2811/myhome/fileA </jsdl:URI> </jsdl:Target>

<Credentials> e.g. wsa:Username/password token </Credentials></jsdl:DataStaging>

• A source element for fileA and a corresponding target element for staging-out of the same file.

• Link <DataStaging/> elements via common <FileName/> and <FilesystemName/>.

• By specifying that the input file is deleted after the job has executed, staging can be used to perform a data copy from one location to another via the staging host (intermediary) .

JSDL Staging 2

Page 7: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

Using Staging to Enact Bulk Copies

• In the context of bulk copying, the file staging host (intermediary) is redundant:

– No need to explicitly name and aggregate (stage) files on a staging host (when copying between a source and sink, the staging host is a hidden implementation detail).

• No equivalent <dmi:DataLocations/> for defining alternative locations for a source and sink (a nice feature of DMI).

• JSDL is designed to describe a single activity which is atomic from the perspective of an external user (staging is part of this atomic activity). In bulk copying, we need to identify and report on the status of each copy operation.

• Some additional elements are required (e.g. <dmi:TransferRequirements/>, <other:FileSelector/>, abstract <URIConnectionProperties/> for connecting to different URI schemes, e.g. iRODS/SRB require ‘McatZone’ ‘defaultResoruce’ propertes). Are these new elements out of scope (remain proprietary?)

Page 8: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

DMI AND BULK COPIES An overview of OGSA DMI and some current limitations for Bulk Copying

Page 9: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

OGSA DMI Overview• The OGSA Data Movement Interface (DMI) (Antonioletti et al. 2008) defines a

number of elements for describing and interacting with a data transfer activity.

• The data source and destination are each described separately with a Data End Point Reference (DEPRs), which is a specialized form of WS-Address element (Box et al. 2004).

• In contrast to the JSDL data staging model, a DEPR facilitates the definition of one or more <Data/> elements within a <DataLocations/> element. This is used to define alternative locations for the data source and/or sink.

• An implementation can select between its supported protocols and select/retry different source/sink combinations (improves resilience and the likelihood of performing a successful copy).

Page 10: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

<dmi:SourceOrSinkDataEPR> <wsa:Address>http://www.ogf.org/ogsa/2007/08/addressing/none</wsa:Address> <wsa:Metadata> <dmi:DataLocations> <dmi:Data ProtocolUri="http://www.ogf.org/ogsadmi/2006/03/im/protocol/gridftp-v20" DataUrl="gsiftp://example.org/name/of/the/dir/"> <dmi:Credentials><other:MyProxyToken/></dmi:Credentials> <other:stuff/> </dmi:Data> <dmi:Data ProtocolUri="urn:my-project:srm" DataUrl="srm://example.org/name/of/the/dir/"> <dmi:Credentials><wsse:UsernameToken/></dmi:Credentials> <other:stuff/> </dmi:Data> </dmi:DataLocations> </wsa:Metadata></dmi:SourceOrSinkDataEPR>

<dmi:TransferRequirements> <dmi:StartNotBefore/> ? <dmi:EndNoLaterThan/> ? <dmi:StayAliveTime/> ? <dmi:MaxAttempts/> ? </dmi:TransferRequirements>

DMI Data Transfer Factory Interface (representation)[supported protocols] +[service instance] GetDataTransferInstance([SourceDEPR],[SinkDEPR],[TransferRequirements]);[factory attributes] GetFactoryAttributesDocument();

Source or Sink(wsa:EndpointReference type)

TransferRequirements (needs some extending)

DMI DEPR and Transfer Requirements

DEPR defines alternative locations for the data source /sink and each <Data/> nests its own credentials.

Page 11: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

Current DMI Limitations for Bulk Copying (for multiple sources and sinks)

• DMI is intended to describe only a single data copy operation between one source and one sink (this is not a criticism, this is by design for managing low-level transfers of single data units). To do several transfers, client needs to perform multiple invocations of a DMI service factory would be required to create multiple DMI service instances.

• We require a single message packet that wraps multiple transfers into a single ‘atomic’ activity rather than having to repeatedly invoke the DMI service factory (broadly similar to defining multiple JSDL data staging elements).

• Some of the existing functional spec elements require extension / slight modification (in particular addition of <xsd:any/> and <xsd:anyAttribute/> extension points to embed proprietary info in suitable locations).

Page 12: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

SOME NEW DRAFT PROPOSALS FOR DMI TO ADDRESS BULK DATA COPYING

Note, The draft proposals presented here for bulk data copying are only intended for review/discussion/sanity-check/agreement (or not)

Page 13: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

Draft Proposal 1 – New <BulkDataCopy/> and <DataCopy/> Elements

• Add new elements to describe a bulk copy activity – effectively wrap multiple source-sink pairs within a single (standalone) document e.g. <BulkDataCopy/> with nested <DataCopy/>

<!-- Draft: TO REVISE/DISCUSS/SANITY-CHECK --> <BulkDataCopy id="xsd:ID"?> <DataCopy id="xsd:ID"?> + <!--one-to-many--> <SourceDEPR/> <SinkDEPR/> <DataCopyTransferRequirements/> ? <!-- needed ? --> <xsd:any##other/> * <DataCopy/> <TransferRequirements/> ? <xsd:any##other/>* </BulkDataCopy>

• The outer <TransferRequirements/> applies to the whole bulk copy (wrapping elements that span all the sub-copies, e.g. including the <dmi:MaxAttempts/>, <dmi:StartNotBefore/> and other batch-window properties).

• Define an optional <DataCopyTransferRequirements/> for each <DataCopy/> in order to specify an additional and overriding requirement sub-set (e.g. for defining <FileSelector/> elements etc).

Big Disclaimer: needs discussion, revision, sanity check,

agreement (or not) etc….

Page 14: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

Draft Proposal 2 – Introduce a New DMI Port Type

• Add a new DMI port type to accept <BulkDataCopy/> doc (current port type defines separate [SourceDEPR], [SinkDEPR], [TransferRequirements] arguments).

• Choice of two port types. • Some minor changes to the existing functional spec (mostly adding xsd:any

extension points and other small stuff).

Possible DMI Data Transfer Factory Interface Extension (draft representation)[supported protocols] +[service instance] GetDataTransferInstance([BulkDataCopy]);[factory attributes] GetFactoryAttributesDocument();

Big Disclaimer: needs discussion, revision, sanity check, agreement (or not) etc….

• As per the existing Functional Spec; completely separate the activity description (BulkDataCopy) from the service interface rendering in order to define a generic and reusable element set.

Page 15: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

Draft Proposal 3 – Extend <State/> and <InstanceAttributes/> and describe usage for bulk copying

• Since a Bulk Copy consists of multiple transfers, we need to optionally provide a way to report the status of each sub-copy.

• The (sub) state of each <DataCopy/> could be optionally nested within the <dmi:Detail/> element as part of the parent <dmi:State/> (i.e. in place of the existing <xsd:any/> extension point). In order to specify each sub-copy identifier, the <dmi:State/> could be extended by adding an <xsd:anyAttribute /> :

<!-- Draft: TO REVISE/DISCUSS/SANITY-CHECK --> <dmi:State value=“Transferring”> <dmi:Detail> <dmi:State dataCopyId=“subcopy1” value=“Done”> <dmi:State dataCopyId=“subcopy3” value=“Failed:Unclean”> <dmi:State dataCopyId=“subcopy2” value=“Transferring”> . . . </dmi:Detail> </dmi:State>

Big Disclaimer: needs discussion, revision, sanity check, agreement (or not) etc….

• Similarly, child <dmi:InstanceAttributes/> could be optionally nested within a parent <dmi:InstanceAttributes/> to represent each sub-copy using a similar approach. But is this actually necessary ? (don’t think so since the <dmi:TotalDataSize/> could be calculated across all the sub-copies).

Page 16: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

Draft Proposal 4 – Other proposed modifications (possibly some more not listed here)

• Add <xsd:any/> and <xsd:anyAttribute/> extension points to the existing DMI elements, e.g. in dmi:DataType dmi:DataLocationsType complex types, anyAttribute in dmi:State etc….

<complexType name="DataType"> <annotation> . . . </annotation> <sequence> <element name="Credentials" type="dmi:CredentialsType" minOccurs="0" /> <xsd:any namespace="##other" processContents="lax" minOccurs="0" maxOccurs="unbounded"/> </sequence> <attribute name="ProtocolUri" type="anyURI" use="required" /> <attribute name="DataUrl" type="anyURI" use="required" /> <xsd:anyAttribute namespace="##other" processContents="lax"/> </complexType>

• Move elements referred to in the text of the functional spec into the functional spec schema, such as <FactoryAttributes/> and the fault types (currently defined in the plain WS Rendering schema).

• Some additional elements are required (e.g. <dmi:TransferRequirements/>, <other:FileSelector/>, abstract <URIConnectionProperties/> for connecting to different URI schemes, e.g. iRODS/SRB require ‘McatZone’ ‘defaultResoruce’ propertes). Are these new elements out of scope or should they remain proprietary?)

<complexType name="DataLocationType"> <annotation> . . . </annotation> <sequence> <element name="Data" type="dmi:DataType" maxOccurs="unbounded" /> <xsd:any namespace="##other" processContents="lax" minOccurs="0" maxOccurs="unbounded"/> </sequence> <xsd:anyAttribute namespace="##other" processContents="lax"/></complexType>

Page 17: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

REUSE OF PROPOSED DMI-COMMON ELEMENT SET

As per the existing DMI Functional Spec, the Bulk Copy activity description would be clearly separated from the service interface rendering . This promotes a generic and reusable element set which can be adopted for use within other specs/profiles , e.g. a new bulk copy application definition for the <jsdl:Application/> element.

Page 18: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

<jsdl:JobDefinition> <jsdl:JobDescription> <jsdl:JobIdentification ... /> <jsdl:Application> <!– Possibility? Embed new ‘BulkDataCopy’ doc as a new Application element akin to POSIXApplication or HPCProfileApplication elems --> <other:BulkDataCopyApplication> <dmi:BulkDataCopy> . . . </dmi:BulkDataCopy> </other:BulkDataCopyApplication> </jsdl:Application>

<jsdl:Resources/> </jsdl:JobDescription></jsdl:JobDefinition>

JSDL intended to be a generic compute activity description language (not just solely HPC).

a) In this example, a bulk data copy activity doc is used to describe as a jsdl application.b) Could nest the proposed <BulkDataCopy/> document within the <jsdl:Application/>

element. The <jsdl:Application/> element is a generic wrapper that is intended for this very purpose, e.g. akin to nesting <POSIXApplication/> or <HPCProfileApplication/>.

Draft usage in JSDL 1

Page 19: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

<jsdl:JobDefinition> <jsdl:JobDescription> <jsdl:JobIdentification ... /> <jsdl:Application> <!– Possibility? Stage BulkDataCopy doc and explicitly name the copy agent that would enact the copy activity --> <jsdl-posix:POSIXApplication> <jsdl-posix:Executable>/usr/bin/datacopyagent.sh<jsdl-posix:Executable> <jsdl-posix:Argument>‘my_BulkDataCopyDoc.xml’</jsdl-posix:Argument> </jsdl-posix:POSIXApplication> </jsdl:Application>

<jsdl:Resources> <jsdl:DataStaging> <jsdl:FileName>my_BulkDataCopyDoc.xml</jsdl:FileName> . . . </jsdl:DataStaging> </jsdl:Resources> </jsdl:JobDescription></jsdl:JobDefinition>

This is a less ‘contract-driven’ approach, but represents a perfectly valid re-use of the proposed <BulkDataCopy/> Document.

Stage-in <BulkDataCopy/> document as input for the executable.

Draft usage in JSDL 2

Page 20: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

• Profile the OGSA BES state model to account for DMI sub-state specializations and dmi lifecycle events ().

• Adds optional DMI sub-state specializations. Client/service may only recognize the main BES states if necessary.

• Adds optional DMI lifecycle events (dmi:suspend, dmi:resume).• Add DMI fault types?

dmi:Resume () Request

dmi:Suspend () Request

Pending Finished

Cancelled

Failed:Clean

UncleanUnknown

Running:Transferring

Running:Suspended

Draft DMI sub-state specialisations in BES

BES states

DMI sub-states

Bes and DMI Lifecycle Events in italics (i.e. Requests/operations)

bes:TerminateActivities () Request

Page 21: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

• JSDL-BES may be a better route for more widespread adoption of a bulk copy document ? (e.g. consider existing BES implementations)

• Is orchestration of the proposed <DataCopy/> activities required ? (e.g. sequential /ordering or even DAG ?). As yet, no compelling use-cases so far.

• For the proposed bulk copy doc; What about using element references rather than defining solely ‘in-line’ XML docs to cut down on element repetition (e.g. akin to <jsdl:FileSystem/> element which can be referenced through <jsdl:FilesystemName/> elements). Abstract elements and Substitution groups may also be useful here.

Some other stuff to consider

<BulkDataCopy id=”MyBulkTransferA”> <CopyResources> <Credential id=”cred1”.../> <Credential id=”cred2”.../> <TransferRequirements id=”tr1” .../> <TransferRequirements id=”tr2” .../> <DataEPR id=”data1” .../> <DataEPR id=”data2” .../> <DataEPR id=”data3” .../> </CopyResources> <DataCopy id=”subTransferA”> <SourceDEPR idref=”data1”/> <SinkDEPR idref=”data3”/> <TransferRequirementsRef idref=”tr1”/> </DataCopy> <DataCopy id=”subTransferB”> <SourceDEPR idref=”data2”/> <SinkDEPR idref=”data3”/> <TransferRequirementsRef idref=”tr2”/> </DataCopy></BulkDataCopy>

Element ‘id’ and subsequent ‘idref’s

Reduces XML repetition but validation does not check for the correct types of referenced elements.

Page 22: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

OTHER STUFF / EXTRA SLIDES….Supplementary slides

Page 23: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

Message Model RequirementsDocument Message • Bulk Data Copy Activity description• Capture all information required to connect to each source URI and sink URI and

subsequently enact the data copy activity. • Transfer requirements, e.g. additional URI Properties, file selectors (reg-expression),

scheduling parameters to define a batch-window, retry count, source/sink alternatives, checksums?, sequential ordering? DAG?

• Serialized user credential definitions for each source and sink.

Control Messages • Interact with a state/lifecycle model (e.g. stop, resume, cancel)

Event Messages • Standard fault types and status updates

Information Model • To advertise the service capabilities / properties / supported protocols

Page 24: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

In-Scope1. Job Submission Description Language (JSDL)

• An activity description language for generic compute applications. 2. OGSA Data Movement Interface (DMI)

• Low level schema for defining the transfer of bytes between and single source and sink. 3. JSDL HPC File Staging Profile (HPCFS)

• Designed to address file staging not bulk copying. 4. OGSA Basic Execution Service (BES)

• Defines a basic framework for defining and interacting with generic compute activities: JSDL + extensible state and information models.

5. Others that I am sure that I have missed ! (…ByteIO)

• Neither fully captures our requirements (not a criticism, they are designed to address their use-cases which only partially overlap with the requirements for our bulk data copy activity).

Other• Condor Stork - based on Condor Class-Ads • Not sure if Globus has/intends a similar definition in its new developments (e.g. SaaS) anyone ? – I

believe Ravi was originally supportive of a DMI for data transfers between multiple sources/sinks

Page 25: Bulk Data Copy Description Generalizations (some DMI/JSDL overlap) Bulk Copying: Recursive file/dir copying between multiple sources and sinks (potentially

Stork – Condor Class AdsExample of a Stork job request:

[ dest_url= "gsiftp://eric1.loni.org/scratch/user/";arguments = p 4 dbg vb";‐ ‐src_url = "file:///home/user/test/";dap_type = "transfer";verify_checksum = true;verify_filesize = true;set_permission = "755" ;recursive_copy = true;network_check = true;checkpoint_transfer = true;output = "user.out";err = "user.err";log = "userjob.log";]

• Purportedly the first batch scheduler for data placement and data movement in a heterogeneous environment . Developed with respect to Condor

• Uses Condor’s ClassAd job description language and is designed to understand the semantics and characteristics of data placement tasks

• Recent NSF funding to develop as a production service