rda’s recently endorsed outputs september 16, 2015

37
RDA’s Recently Endorsed Outputs September 16, 2015

Upload: joel-mcdaniel

Post on 30-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RDA’s Recently Endorsed Outputs September 16, 2015

RDA’s Recently Endorsed Outputs

September 16, 2015

Page 2: RDA’s Recently Endorsed Outputs September 16, 2015

2

Introduction Data Foundation and Terminology Data Type Registries PID Information Types Practical Policy Questions

Agenda

Page 3: RDA’s Recently Endorsed Outputs September 16, 2015

Data Foundation and Terminology- Talking the Same Language –

Peter Wittenburg, Gary Berg-Cross, Raphael Ritz

Page 4: RDA’s Recently Endorsed Outputs September 16, 2015

4

What is the problem? Data organizations (DOrg) and ideas about it are all different We are all speaking different languages, wasting time and

misunderstanding each other in any project involving data Different DOrgs make data discovery and integration very time

consuming, inefficient and thus expensive Different DOrgs prevent us developing maintainable support software

Who is impacted? All efforts to integrate data (Federations, BDA projects, etc.)

What are the ramifications of not having the problem resolved? Combining data of all sorts across different origins (projects, repositories,

disciplines, etc.) is a nightmare and requires a lot of curation and transformation before the actual scientific analysis can start

Summary of the Problem

Page 5: RDA’s Recently Endorsed Outputs September 16, 2015

5

Structure 60 members

Almost all regions Different types of institutions and disciplines Skillsets ranged from relative newcomers up to members with much

experience from data intensive projects

Outputs List of core terms essential to harmonize conceptualization of data

organizations Graphical model relating the terms Set of auxiliary documents including many use cases to demonstrate

the bottom-up approach and research of the WG Term Tool (using Semantic Media Wiki) to store definitions and allow

editing, classification and discussion of terms (which is also open for other groups)

Highlights of Data Foundation and Terminology Working Group

Page 6: RDA’s Recently Endorsed Outputs September 16, 2015

6Active Contributors to the Work

Institute/Project Country/ Region Domain

CNRI US IT Research and Systems

U Cardiff UK IT Research and Systems

AWI DE Oceanography & Environment

MPG DE Research Organisation

EUDAT EU Data Infrastructure

CLARIN EU Linguistic Research Infrastructure

EPOS EU Earth Observation Res. Infrastructure

ENES Int World Climate Res. Infrastructure

ENVRI EU Environmental Res. Infrastructure

DataOne US Environmental Infrastructure

ESSD/RENCI US Earth Science System Data

NCGEN/RENCI US Clinical Genomics

Europeana EU Humanities Infrastructure

DataCite/EPIC Int PID Infrastructures

DICE US IT Research and Systems

CAS CN Earth Science Model

ADCIRC/RENCI US Ocean and Storm modeling

Page 7: RDA’s Recently Endorsed Outputs September 16, 2015

7

The European data infrastructure, EUDAT Federating data from many discipline repositories where each data

collection has a different data organization. If integration is not simply done at physical level (file structures), this

heterogeneity makes it very costly to integrate all data to enable re-purposing and to make it accessible at different repositories.

The International CLARIN Project : According to the Technology Director: Very handy to have a lingua

franca when discussing research infrastructure architectures. It was good to be involved as adopting community from the start of the work.

Similar experiences from international colleagues who work on large scale data integration

Harmonization greatly reduces integration time

Impact of Outputs

Page 8: RDA’s Recently Endorsed Outputs September 16, 2015

8

EUDAT, CLARIN and others with dramatic problems in data integration Approach aligned with the progress of the DFT Working Group

discussion Their repository setups adhere now to the DFT model and interaction

with different communities based on it The Digital Object, that is described by metadata, is associated with a

Persistent ID and whose instances are stored in trustful repositories (see simplified diagram)

Other projects (humanities, health, bioinformatics, neuroinformatics and atmosphere research) adopted these models and the terminology

Endorsements/Adopters

digitalobject

bitstream repository

persistent ID

metadata

isRepresentedBy

isStoredInisReferencedBy

isDescribedBy

isa

Page 9: RDA’s Recently Endorsed Outputs September 16, 2015

9Endorsements/Adopters

Institute/Project Country/ Region Domain

CNRI US IT Research and Systems

U Cardiff UK IT Research and Systems

MPG DE Research Organisation

EUDAT EU Data Infrastructure

CLARIN EU Linguistic Research Infrastructure

EPOS EU Earth Observation Res. Infrastructure

ENES Int World Climate Res. Infrastructure

ENVRI EU Environmental Res. Infrastructure

ESSD/RENCI US Earth Science System Data

NCGEN/RENCI US Clinical Genomics

DICE US IT Research and Systems

ADCIRC/RENCI US Ocean and Storm modeling

Deep Carbon Project US Environmental/Athmospheric Research

Note: There may be more projects/institutes that have endoresed or adopted the DFT model without noticing us.

Page 10: RDA’s Recently Endorsed Outputs September 16, 2015

10

Outputs are openly available to: Anyone who wants to run a project, including those with large data

collections Organizations should be strictly compliant to the basic model to guarantee

independence and thus easy re-purposing of all components

Anyone who is working in a data federation project, integrating data from different sources, or wants to re-purpose data for data intensive science

Projects could use the model as a common reference model to design transformations

Projects could use the suggested terminology to achieve quick, mutual understanding

Software developers, who can adopt the basic model to ensure their software can be used by almost everyone adhering to state of the art principles

How You Can Endorse

Page 11: RDA’s Recently Endorsed Outputs September 16, 2015

11

“Core Terms and Model” document available on website Provides the final model and corresponding terms that can be

applied to your project

Additional Resources Supplementary documents providing information on

conceptualization and background for choices Contact the Working Group co-chairs via email or at upcoming

plenary Contribute to the now functioning DFT Interest Group via email,

wiki, Term Tool Send a request to the RDA Europe support team

How to Access and Use Outputs

Page 12: RDA’s Recently Endorsed Outputs September 16, 2015

12

Since Working Group focused only on the basic set of core terms, work needs to be continued Much more out there, in particular also in other RDA groups, where

terminology harmonization would help substantially We also see the need to consider the dynamics of the field and to be

ready to adapt current definitions and perhaps even the model

A follow-up Data Foundation and Terminology Interest Group has been established and will meet at Plenary 6 Group is meeting at RDA’s 6th Plenary in Paris next week A larger scope of integrated work is being discussed as part of the Data

Fabric IG

Next Steps

Page 13: RDA’s Recently Endorsed Outputs September 16, 2015

13

DFT WG: https://rd-alliance.org/groups/data-foundation-and-terminology-wg.html

DFT IG: https://rd-alliance.org/groups/data-foundations-and-terminology-ig.html

TeD-T Term Definition Tool: http://smw-rda.esc.rzg.mpg.de/index.php/Main_Page

RDA EU Support Team: [email protected]

Contact Information

Page 14: RDA’s Recently Endorsed Outputs September 16, 2015

14

DFT WG: https://rd-alliance.org/groups/data-foundation-and-terminology-wg.html

DFT IG: https://rd-alliance.org/groups/data-foundations-and-terminology-ig.html

TeD-T Term Definition Tool: http://smw-rda.esc.rzg.mpg.de/index.php/Main_Page

RDA EU Support Team: [email protected]

Contact Information

Page 15: RDA’s Recently Endorsed Outputs September 16, 2015

Data Type Registries

Larry Lannom, CNRIDaan Broeder, Meertens Institute, KNAW

Page 16: RDA’s Recently Endorsed Outputs September 16, 2015

16

Data sharing requires that data can be parsed, understood, and reused by people and applications other than those that created the data

How do we do this now? For documents – formats are enough, e.g., PDF, and then the

document explains itself to humans This doesn’t work well with data – numbers are not self-explanatory

What does the number 7 mean in cell B27? Data producers may not have explicitly specified certain details in the data:

measurement units, coordinate systems, variable names, etc. Need a way to precisely characterize those assumptions such that they can

be identified by humans and machines that were not closely involved in its creation

Affects all data producers and consumers

Summary of the Problem

Page 17: RDA’s Recently Endorsed Outputs September 16, 2015

17

Evaluate and identify a few assumptions in data that can be codified and shared in order to…

Produce a functioning Registry system that can easily be evaluated by organizations before adoption Highly configurable for changing scope of captured and shared

assumptions depending on the domain or organization Supports several Type record dissemination variations

Design for allowing federation between multiple Registry instances The emphasis is not on

Identifying every possible assumption and data characteristic applicable for all domains

Technology

Goal of the DTR Effort: Explicate and Share Assumptions using Types and Type Registries

Page 18: RDA’s Recently Endorsed Outputs September 16, 2015

18

Confirmation that detailed and precise data typing is a key consideration in data sharing and reuse and that a federated registry system for such types is highly desirable and needs to accommodate each community’s own requirements

Deployment of a prototype registry implementing one potential data model, against which various use cases can be tested

Involvement of multiple ongoing scientific data management efforts, across a variety of domains, in actively planning for and testing the use of data types and associated registries in their data management efforts

Integration with one additional RDA WG (Persistent Identifier Types) and at least one Interest Group (RDA/CODATA Materials Data, Infrastructure & Interoperability IG)

Development of a set of questions that require further consideration before a detailed recommendation on data typing can be issued

 

Highlights of the Output

Page 19: RDA’s Recently Endorsed Outputs September 16, 2015

19

Users

Typed Data

ID

Type

Payload

ID

Type

Payload

ID

Type

Payload

ID

Type

Payload

ID

Type

Payload

ID

Type

Payload

1010011010101….

VisualizationI Agree

Terms:…

Rights

Services

Data ProcessingData SetDissemination

Client (process or people) encounters unknown data type.1

Resolved to Type Registry.2

Response includes type definitions, relationships, properties, and possibly service pointers. Response can beused locally for processing, or, optionally

3typed data or reference to typed data can be sent to service provider.4

1

23

4

4

Impact of Use Case: Process Use Case

Federated Set of Type Registries

Page 20: RDA’s Recently Endorsed Outputs September 16, 2015

20

Materials Science Adoption Project Demo at RDA’s 6th Plenary in Paris

X-ray diffraction use case

normalize data sets resulting from multiple proprietary instruments

Enable a homogenous analysis platform for data consumers to perform their analyses

Deep Carbon Observatory Goal: given a dataset identifier, discover detailed information about the structure(s)

within that dataset, and act accordingly

DTR is a registry used for explicating structures in the form of type records

Facilitate norms of behavior relevant to data curation and re-use

Digital Object Identifier Given a DOI, what services are relevant and applicable

Having chosen a service, how can a client invoke that service?

Having invoked a service, how can a client process the returned data?

DOI, Materials Science, DCO, EUDAT

Endorsements/Adopters

Page 21: RDA’s Recently Endorsed Outputs September 16, 2015

21

Start a new prototype effort

Follow existing prototype efforts

Attend the BOF at P6

Join the Data Typing WG when it starts

Try the public prototype at typeregistry.org

How You Can Endorse

Page 22: RDA’s Recently Endorsed Outputs September 16, 2015

22

A follow-up Working Group (WG) is planned: Data Typing Leverage results of Data Type Registries Working Group Collect results from multiple prototypes Best practices for federation

Bird of a Feather session on Data Typing at RDA’s 6th Plenary in Paris (24 Sept., Breakout #6)

Proposed Chairs of Data Typing WG Giridhar Manepalli, CNRI Simon Cox, CSIRO Tobias Weigel, DKRZ

Larry and Daan are still around

Next Steps and Contact Information

Page 23: RDA’s Recently Endorsed Outputs September 16, 2015

PID Information Types:Towards PID Interoperability

Tobias Weigel (DKRZ / University of Hamburg)Tim DiLauro (Data Conservancy / Johns Hopkins University)

Page 24: RDA’s Recently Endorsed Outputs September 16, 2015

24

Move from management of filestowards management of objects How does object management scale with increasing numbers? How do we further automate our processes? Issues independent from particular disciplines, repositories,

management approaches

Understanding the most elemental characteristics of digital objects – for machine agents and human users

Facilitate interoperability across PID systems and simplify PID record usage

Avoid insular solutions and reiteration of efforts – open licenses

Summary of the Problem

IDENTIFIER

Page 25: RDA’s Recently Endorsed Outputs September 16, 2015

25

More than 50 group members from EU/US/AU A lot of technical expertise and community experience

Key Ouptuts (cf. summary report): Conceptual insights on types and their possible structures Practical type examples geared towards diverse use cases Openly licensed API specification and Java-based prototype

Highlights of the Outputs

IDENTIFIER

sizechecksumtimestampsaggregationversionlicenseformat

properties

Size:Format:Checksum:Date:

Size:Checksum:Format:License:

Verification service

Page 26: RDA’s Recently Endorsed Outputs September 16, 2015

26

Some initial types were registered in the TR prototype, making it possible to explore further applications Information on how to register new types available in the report

Incited plans in communities and projects about concrete applications

PIDs and typing increasingly seen as a crucial component to decouple management of objects from contents Simplify client access to data across domains, implementations and

changes in information models More lightweight access to information on less accessible objects

Impact of the Outputs

Page 27: RDA’s Recently Endorsed Outputs September 16, 2015

27

Adopters can be: Communities who can use existing types and share custom types, as

well as build tools and services that exploit them PID service providers who can offer a typing service as added value

beyond registration and resolution, increasing PID interoperability

Endorsements/Adopters

Adopter Category Country Scope / Goal

ENES/ESGF Community Int. Climate data management (CMIP6)

DCO-DS/RPI Community US Enhancing existing PID usage

EUDAT Community/Service provider

EU Added-value service to various disciplinary communities

MGI/NIST Community US Automation of data type conversions

EPIC Service provider EU

Generic added-value serviceCNRI Service provider US

DONA Service provider Int.

Page 28: RDA’s Recently Endorsed Outputs September 16, 2015

28

Make use of existing type examples, invent your own types and please tell us about it! Follow-up RDA WGs on Collections and Data Typing will continue the

work on concrete types. The PID Interest Group is also a good place to provide general feedback.

Specification and prototype source code are openly available Possible development by EUDAT, DCO, ENES and others as

interested adopters

Offer by PID service providers as a service beyond registration and resolution

Contribution to a unified type registry is encouraged

How You Can Endorse

Page 29: RDA’s Recently Endorsed Outputs September 16, 2015

29

PID Information Types WG https://rd-alliance.org/groups/pid-information-types-wg.html

PID Interest Group https://rd-alliance.org/groups/pid-interest-group.html

PID Collections candidate WG https://rd-alliance.org/groups/pid-collections-wg.html https://rd-alliance.org/pid-collections-p6-bof-session.html

Data Typing BoF https://rd-alliance.org/data-typing-p6-bof-session.html

Next Steps and Contact Information

Page 30: RDA’s Recently Endorsed Outputs September 16, 2015

Practical PolicyReagan Moore, Rainer Stotzka

Page 31: RDA’s Recently Endorsed Outputs September 16, 2015

31Summary of the Problem

Computer actionable policies are used to enforce data management automate administrative tasks validate compliance with assessment criteria automate scientific data processing and analyses

Users motivated by issues related to scale, distribution

Practical Policy:Assertion or assurance that is enforced about a (data) collection (data set, digital object, file) by the creators of the collection

Page 32: RDA’s Recently Endorsed Outputs September 16, 2015

32

Practical Policy members represented 11 types of data management systems 30 institutions 2 testbeds

iRODSRenaissance Computing Institute,DataNet Federation Consortium – DFC

GPFSInstitute of Physics of the Academy of Sciences, CESNETGarching Computing Centre – RZG

Published two documents Moore, R., R. Stotzka, C. Cacciari, P. Benedikt, “Practical Policy Templates”

February, 2015, http://dx.doi.org/10.15497/83E1B3F9-7E17-484A-A466-B3E5775121CC.

Moore, R., R. Stotzka, C. Cacciari, P. Benedikt, “Practical Policy Implementations”, February, 2015, http://dx.doi.org/10.15497/83E1B3F9-7E17-484A-A466-B3E5775121CC.

Policy Templates

Page 33: RDA’s Recently Endorsed Outputs September 16, 2015

33

Computer actionable rules to enforce: Preservation standards

Authenticity, integrity, chain of custody, arrangement

Data management plans Collection creation, product generation, publication, storage,

archives

Data distribution Replication, content distribution network

Publication Descriptive metadata, time dependent access controls

Processing pipelines Workflow execution

Production Environments

Page 34: RDA’s Recently Endorsed Outputs September 16, 2015

34

Distributed data management environments EUDAT Data Policy Manager

B2SAFE use case International Neuroinformatics Coordinating Facility Institut national de physique nucléaire et de physique des particules New Zealand BESTGRID DataNet Federation Consortium

NSF data management plans Odum Institute preservation archive The iPlant Collaborative genomics data grid Science Observatory Network digital library SILS LifeTime Library HydroShare

NOAA National Climatic Data Center NASA Center for Climate Simulations

Endorsements/Adopters

Page 35: RDA’s Recently Endorsed Outputs September 16, 2015

35

Policy-based collection management Purpose for assembling the collection Properties required to support the purpose Policies that control when and where the properties are enforced Procedures that execute operations controlled by the policies Persistent state information that is generated by the procedures Periodic assessment criteria that verify compliance

RDA Publications Policy templates

Constraints, operations, required state information Policy implementations

Computer actionable rules to automate policy enforcement

Applications

Page 36: RDA’s Recently Endorsed Outputs September 16, 2015

36

Data Fabric Interest Group Policies to support

Federation Interoperability

Data Foundations and Terminology Interest Group Vocabulary for policy management

Interoperability testbeds EUDAT

http://eudat.eu/data-access-and-reuse-policies-darup National Data Service

http://www.nationaldataservice.org DataNet Federation Consortium

http://datafed.org

Next Steps and Contact Information

Page 37: RDA’s Recently Endorsed Outputs September 16, 2015

37

Thank you.

Questions?