geodataspace: simplifying data management tasks with globus

21
Simplifying Data Management Tasks with Globus Tanu Malik, Ian Foster, Kyle Chard, Roselyne Tchoua, Joseph Baker, Mike Gurnis, Jonathan Goodall, ScoD Peckham GeoDataspace

Upload: tanu-malik

Post on 18-Feb-2017

379 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: GeoDataspace: Simplifying Data Management Tasks with Globus

Simplifying  Data  Management  Tasks  with  Globus  

Tanu  Malik,  Ian  Foster,  Kyle  Chard,  Roselyne  Tchoua,  Joseph  Baker,  Mike  Gurnis,  Jonathan  Goodall,  ScoD  Peckham  

GeoDataspace

GeoDataspace

GeoDataspace

Page 2: GeoDataspace: Simplifying Data Management Tasks with Globus

Share and Reproduce

Alice wants to share her models and simulation output with Bob, and Bob wants to re-execute Alice’s application to validate her inputs and outputs.

GeoDataspace

GeoDataspace

GeoDataspace

Page 3: GeoDataspace: Simplifying Data Management Tasks with Globus

Alice’s Options

1. A tar and gzip

2. Build a website with model code, parameters, and data

3. Create a virtual machine

GeoDataspace

GeoDataspace

GeoDataspace

Page 4: GeoDataspace: Simplifying Data Management Tasks with Globus

Bob’s Frustration

1. I do not find the lib.so required for building the model.

2. How do I?

GeoDataspace

GeoDataspace

GeoDataspace

Lack of easy and efficient methods for sharing and reproducibility

Amount of pain Bob suffers

Amount of pain Alice suffers

Page 5: GeoDataspace: Simplifying Data Management Tasks with Globus

GeoDataspace

• Goal: Sharing and reproducibility hand-in-hand

• Target users: Computational geoscientists

• Data and model integration

• Research Output is More Than "Just" a Research Paper

GeoDataspace

GeoDataspace

GeoDataspace

Page 6: GeoDataspace: Simplifying Data Management Tasks with Globus

GeoDataspaceCI Components

• The geounits

• Units of scientific activity/research output

• How to capture and track this activity

• Globus Catalog

• A scalable, flexible catalog for annotations conforming to open-world assumption

• Globus Publish and reproduce geounits

• Share/Publish geounits for others

• Replay geounits for analysis

GeoDataspace

GeoDataspace

GeoDataspace

Page 7: GeoDataspace: Simplifying Data Management Tasks with Globus

geounits: package data , source code and

environment

GeoDataspace

GeoDataspace

GeoDataspace

Page 8: GeoDataspace: Simplifying Data Management Tasks with Globus

geounit Client:Provenance is key

GeoDataspace

GeoDataspace

GeoDataspace

1. audit

<program name>

2. PROV

compliant

database

3. exec

<program name>

[activity]

Page 9: GeoDataspace: Simplifying Data Management Tasks with Globus

geounit Client: Features

• Based on Code, Data, Environment (CDE’s) ptrace and okapi functionality

• Data/code can be local or distributed

• Data/code files are not manifested into the package until ready to share; only descriptions in package

• Specify granularity of auditing

• Partial replay

• Unpack into docker or vagrant

Page 10: GeoDataspace: Simplifying Data Management Tasks with Globus

Globus Catalog: hosts geounits

• Dataset Management Model

• Catalog: a hosted resource that enables the grouping of related datasets

• Dataset: a virtual collection of (schemaless) metadata and distributed data elements viz files, provenance

• Annotation: a piece of metadata that exists within the context of a dataset or data member

GeoDataspace

GeoDataspace

GeoDataspace

Page 11: GeoDataspace: Simplifying Data Management Tasks with Globus

Globus Catalog• Dataset Service

• Virtual views of data based on user-defined and/or automatically extracted metadata (annotations)

• Implemented as a service with web and REST interfaces

• Relies on Globus Nexus for user authentication and group management

• Client-side Tooling

• Dataset ingest

• Automatic creation of datasets and extraction of metadata from various common data formats and directory structures

• Globus endpoints

• Associate data (in files and directories) with one or more datasets

• Python Client library

• Integration with external services

• Transfer: Moving datasets from their storage endpoint(s) to a selected destination

• Faceted Browser Search

• Search based on provenance entities and activities

GeoDataspace

GeoDataspace

GeoDataspace

Page 12: GeoDataspace: Simplifying Data Management Tasks with Globus

Globus Catalog:REST interface

GeoDataspace

GeoDataspace

GeoDataspace

Approach

•  Hosted user-defined catalogs

•  Based on annotation model <dataset/member, name, value>

•  Association of data members

•  Fine grained access control

•  Flexible query language –  Name:value, free text, facets,…

•  Integrated with other services

/geodataspace/geodataspace/annotation

/geodataspace/geounit/geodataspace/geounit/annotation

/geodataspace/geounit/acl/geodataspace/geounit/members

/geodataspace/geounit/members/annotation/geodataspace/geounit/provenance

/geodataspace/geounit/version

Page 13: GeoDataspace: Simplifying Data Management Tasks with Globus

Publish and Reexecute geounits

• Still in the works

• Each geounit can be published through Globus Publish and re-executed through analysis platform

GeoDataspace

GeoDataspace

GeoDataspace

Page 14: GeoDataspace: Simplifying Data Management Tasks with Globus

Science DriversSolid Earth

Space Science

Hydrology

CSDMS

GeoDataspace

GeoDataspace

GeoDataspace

GeoDataspace

GeoDataspace

GeoDataspace

Page 15: GeoDataspace: Simplifying Data Management Tasks with Globus

Solid Earth• Allow reproducible, replayable geounits of GPlates

• GPlates

• Software package has several dependencies

• Create geounits of Kinematic Representation of Surface of Earth (3D and 4D models)

• GPlates software,

• GPML files (XML for plate tectonics) used in the model,

• output GPML files are simple X/Y format or could be visualization files, a global set of visualization output, images as well. 

• Integrating geounits in Python workflows

• Incorporate metadata from workflows and use geounit metadata to inform workflows

GeoDataspace

GeoDataspace

GeoDataspace

Page 16: GeoDataspace: Simplifying Data Management Tasks with Globus

Hydrology• Data processing steps for the VIC model

geounit 1

geounit 2

geounit 3 geounit 4

Objective: Monitor changes in the data processing steps and compare them across the various runs

GeoDataspace

GeoDataspace

GeoDataspace

Page 17: GeoDataspace: Simplifying Data Management Tasks with Globus

Space Science

• Create geounits of SuperDarn data and its plotting products

• Publish them for validation

GeoDataspace

GeoDataspace

GeoDataspace

Page 18: GeoDataspace: Simplifying Data Management Tasks with Globus

CSDMS

• How geounits should be coupled

• Metadata alignment issues

• If we create geounits of CSDMS models, how do we enable suitable search interfaces with the provenance metadata and CSDMS metadata?

GeoDataspace

GeoDataspace

GeoDataspace

Page 19: GeoDataspace: Simplifying Data Management Tasks with Globus

Current Work

• Working with use cases to bootstrap geounits

• Populating geounits based on Python workflows and incorporate geounits in workflows

• Interfacing geounit Client with Globus Catalog

• Improving distributed search functionality

GeoDataspace

GeoDataspace

GeoDataspace

Page 20: GeoDataspace: Simplifying Data Management Tasks with Globus

Track it!

• http://workspace.earthcube.org/geodataspace

• Software, Source code, Science Usecases, Reports, Presentations, News

GeoDataspace

GeoDataspace

GeoDataspace

Page 21: GeoDataspace: Simplifying Data Management Tasks with Globus

Acknowledgements

• National Science Foundation

• EarthCube Community

• Globus team

• CI team

GeoDataspace

GeoDataspace

GeoDataspace