rda’s recently endorsed outputs september 16, 2015
TRANSCRIPT
RDA’s Recently Endorsed Outputs
September 16, 2015
2
Introduction Data Foundation and Terminology Data Type Registries PID Information Types Practical Policy Questions
Agenda
Data Foundation and Terminology- Talking the Same Language –
Peter Wittenburg, Gary Berg-Cross, Raphael Ritz
4
What is the problem? Data organizations (DOrg) and ideas about it are all different We are all speaking different languages, wasting time and
misunderstanding each other in any project involving data Different DOrgs make data discovery and integration very time
consuming, inefficient and thus expensive Different DOrgs prevent us developing maintainable support software
Who is impacted? All efforts to integrate data (Federations, BDA projects, etc.)
What are the ramifications of not having the problem resolved? Combining data of all sorts across different origins (projects, repositories,
disciplines, etc.) is a nightmare and requires a lot of curation and transformation before the actual scientific analysis can start
Summary of the Problem
5
Structure 60 members
Almost all regions Different types of institutions and disciplines Skillsets ranged from relative newcomers up to members with much
experience from data intensive projects
Outputs List of core terms essential to harmonize conceptualization of data
organizations Graphical model relating the terms Set of auxiliary documents including many use cases to demonstrate
the bottom-up approach and research of the WG Term Tool (using Semantic Media Wiki) to store definitions and allow
editing, classification and discussion of terms (which is also open for other groups)
Highlights of Data Foundation and Terminology Working Group
6Active Contributors to the Work
Institute/Project Country/ Region Domain
CNRI US IT Research and Systems
U Cardiff UK IT Research and Systems
AWI DE Oceanography & Environment
MPG DE Research Organisation
EUDAT EU Data Infrastructure
CLARIN EU Linguistic Research Infrastructure
EPOS EU Earth Observation Res. Infrastructure
ENES Int World Climate Res. Infrastructure
ENVRI EU Environmental Res. Infrastructure
DataOne US Environmental Infrastructure
ESSD/RENCI US Earth Science System Data
NCGEN/RENCI US Clinical Genomics
Europeana EU Humanities Infrastructure
DataCite/EPIC Int PID Infrastructures
DICE US IT Research and Systems
CAS CN Earth Science Model
ADCIRC/RENCI US Ocean and Storm modeling
7
The European data infrastructure, EUDAT Federating data from many discipline repositories where each data
collection has a different data organization. If integration is not simply done at physical level (file structures), this
heterogeneity makes it very costly to integrate all data to enable re-purposing and to make it accessible at different repositories.
The International CLARIN Project : According to the Technology Director: Very handy to have a lingua
franca when discussing research infrastructure architectures. It was good to be involved as adopting community from the start of the work.
Similar experiences from international colleagues who work on large scale data integration
Harmonization greatly reduces integration time
Impact of Outputs
8
EUDAT, CLARIN and others with dramatic problems in data integration Approach aligned with the progress of the DFT Working Group
discussion Their repository setups adhere now to the DFT model and interaction
with different communities based on it The Digital Object, that is described by metadata, is associated with a
Persistent ID and whose instances are stored in trustful repositories (see simplified diagram)
Other projects (humanities, health, bioinformatics, neuroinformatics and atmosphere research) adopted these models and the terminology
Endorsements/Adopters
digitalobject
bitstream repository
persistent ID
metadata
isRepresentedBy
isStoredInisReferencedBy
isDescribedBy
isa
9Endorsements/Adopters
Institute/Project Country/ Region Domain
CNRI US IT Research and Systems
U Cardiff UK IT Research and Systems
MPG DE Research Organisation
EUDAT EU Data Infrastructure
CLARIN EU Linguistic Research Infrastructure
EPOS EU Earth Observation Res. Infrastructure
ENES Int World Climate Res. Infrastructure
ENVRI EU Environmental Res. Infrastructure
ESSD/RENCI US Earth Science System Data
NCGEN/RENCI US Clinical Genomics
DICE US IT Research and Systems
ADCIRC/RENCI US Ocean and Storm modeling
Deep Carbon Project US Environmental/Athmospheric Research
Note: There may be more projects/institutes that have endoresed or adopted the DFT model without noticing us.
10
Outputs are openly available to: Anyone who wants to run a project, including those with large data
collections Organizations should be strictly compliant to the basic model to guarantee
independence and thus easy re-purposing of all components
Anyone who is working in a data federation project, integrating data from different sources, or wants to re-purpose data for data intensive science
Projects could use the model as a common reference model to design transformations
Projects could use the suggested terminology to achieve quick, mutual understanding
Software developers, who can adopt the basic model to ensure their software can be used by almost everyone adhering to state of the art principles
How You Can Endorse
11
“Core Terms and Model” document available on website Provides the final model and corresponding terms that can be
applied to your project
Additional Resources Supplementary documents providing information on
conceptualization and background for choices Contact the Working Group co-chairs via email or at upcoming
plenary Contribute to the now functioning DFT Interest Group via email,
wiki, Term Tool Send a request to the RDA Europe support team
How to Access and Use Outputs
12
Since Working Group focused only on the basic set of core terms, work needs to be continued Much more out there, in particular also in other RDA groups, where
terminology harmonization would help substantially We also see the need to consider the dynamics of the field and to be
ready to adapt current definitions and perhaps even the model
A follow-up Data Foundation and Terminology Interest Group has been established and will meet at Plenary 6 Group is meeting at RDA’s 6th Plenary in Paris next week A larger scope of integrated work is being discussed as part of the Data
Fabric IG
Next Steps
13
DFT WG: https://rd-alliance.org/groups/data-foundation-and-terminology-wg.html
DFT IG: https://rd-alliance.org/groups/data-foundations-and-terminology-ig.html
TeD-T Term Definition Tool: http://smw-rda.esc.rzg.mpg.de/index.php/Main_Page
RDA EU Support Team: [email protected]
Contact Information
14
DFT WG: https://rd-alliance.org/groups/data-foundation-and-terminology-wg.html
DFT IG: https://rd-alliance.org/groups/data-foundations-and-terminology-ig.html
TeD-T Term Definition Tool: http://smw-rda.esc.rzg.mpg.de/index.php/Main_Page
RDA EU Support Team: [email protected]
Contact Information
Data Type Registries
Larry Lannom, CNRIDaan Broeder, Meertens Institute, KNAW
16
Data sharing requires that data can be parsed, understood, and reused by people and applications other than those that created the data
How do we do this now? For documents – formats are enough, e.g., PDF, and then the
document explains itself to humans This doesn’t work well with data – numbers are not self-explanatory
What does the number 7 mean in cell B27? Data producers may not have explicitly specified certain details in the data:
measurement units, coordinate systems, variable names, etc. Need a way to precisely characterize those assumptions such that they can
be identified by humans and machines that were not closely involved in its creation
Affects all data producers and consumers
Summary of the Problem
17
Evaluate and identify a few assumptions in data that can be codified and shared in order to…
Produce a functioning Registry system that can easily be evaluated by organizations before adoption Highly configurable for changing scope of captured and shared
assumptions depending on the domain or organization Supports several Type record dissemination variations
Design for allowing federation between multiple Registry instances The emphasis is not on
Identifying every possible assumption and data characteristic applicable for all domains
Technology
Goal of the DTR Effort: Explicate and Share Assumptions using Types and Type Registries
18
Confirmation that detailed and precise data typing is a key consideration in data sharing and reuse and that a federated registry system for such types is highly desirable and needs to accommodate each community’s own requirements
Deployment of a prototype registry implementing one potential data model, against which various use cases can be tested
Involvement of multiple ongoing scientific data management efforts, across a variety of domains, in actively planning for and testing the use of data types and associated registries in their data management efforts
Integration with one additional RDA WG (Persistent Identifier Types) and at least one Interest Group (RDA/CODATA Materials Data, Infrastructure & Interoperability IG)
Development of a set of questions that require further consideration before a detailed recommendation on data typing can be issued
Highlights of the Output
19
Users
Typed Data
ID
Type
Payload
ID
Type
Payload
ID
Type
Payload
ID
Type
Payload
ID
Type
Payload
ID
Type
Payload
1010011010101….
VisualizationI Agree
Terms:…
Rights
Services
Data ProcessingData SetDissemination
Client (process or people) encounters unknown data type.1
Resolved to Type Registry.2
Response includes type definitions, relationships, properties, and possibly service pointers. Response can beused locally for processing, or, optionally
3typed data or reference to typed data can be sent to service provider.4
1
23
4
4
Impact of Use Case: Process Use Case
Federated Set of Type Registries
20
Materials Science Adoption Project Demo at RDA’s 6th Plenary in Paris
X-ray diffraction use case
normalize data sets resulting from multiple proprietary instruments
Enable a homogenous analysis platform for data consumers to perform their analyses
Deep Carbon Observatory Goal: given a dataset identifier, discover detailed information about the structure(s)
within that dataset, and act accordingly
DTR is a registry used for explicating structures in the form of type records
Facilitate norms of behavior relevant to data curation and re-use
Digital Object Identifier Given a DOI, what services are relevant and applicable
Having chosen a service, how can a client invoke that service?
Having invoked a service, how can a client process the returned data?
DOI, Materials Science, DCO, EUDAT
Endorsements/Adopters
21
Start a new prototype effort
Follow existing prototype efforts
Attend the BOF at P6
Join the Data Typing WG when it starts
Try the public prototype at typeregistry.org
How You Can Endorse
22
A follow-up Working Group (WG) is planned: Data Typing Leverage results of Data Type Registries Working Group Collect results from multiple prototypes Best practices for federation
Bird of a Feather session on Data Typing at RDA’s 6th Plenary in Paris (24 Sept., Breakout #6)
Proposed Chairs of Data Typing WG Giridhar Manepalli, CNRI Simon Cox, CSIRO Tobias Weigel, DKRZ
Larry and Daan are still around
Next Steps and Contact Information
PID Information Types:Towards PID Interoperability
Tobias Weigel (DKRZ / University of Hamburg)Tim DiLauro (Data Conservancy / Johns Hopkins University)
24
Move from management of filestowards management of objects How does object management scale with increasing numbers? How do we further automate our processes? Issues independent from particular disciplines, repositories,
management approaches
Understanding the most elemental characteristics of digital objects – for machine agents and human users
Facilitate interoperability across PID systems and simplify PID record usage
Avoid insular solutions and reiteration of efforts – open licenses
Summary of the Problem
IDENTIFIER
25
More than 50 group members from EU/US/AU A lot of technical expertise and community experience
Key Ouptuts (cf. summary report): Conceptual insights on types and their possible structures Practical type examples geared towards diverse use cases Openly licensed API specification and Java-based prototype
Highlights of the Outputs
IDENTIFIER
sizechecksumtimestampsaggregationversionlicenseformat
properties
Size:Format:Checksum:Date:
Size:Checksum:Format:License:
Verification service
26
Some initial types were registered in the TR prototype, making it possible to explore further applications Information on how to register new types available in the report
Incited plans in communities and projects about concrete applications
PIDs and typing increasingly seen as a crucial component to decouple management of objects from contents Simplify client access to data across domains, implementations and
changes in information models More lightweight access to information on less accessible objects
Impact of the Outputs
27
Adopters can be: Communities who can use existing types and share custom types, as
well as build tools and services that exploit them PID service providers who can offer a typing service as added value
beyond registration and resolution, increasing PID interoperability
Endorsements/Adopters
Adopter Category Country Scope / Goal
ENES/ESGF Community Int. Climate data management (CMIP6)
DCO-DS/RPI Community US Enhancing existing PID usage
EUDAT Community/Service provider
EU Added-value service to various disciplinary communities
MGI/NIST Community US Automation of data type conversions
EPIC Service provider EU
Generic added-value serviceCNRI Service provider US
DONA Service provider Int.
28
Make use of existing type examples, invent your own types and please tell us about it! Follow-up RDA WGs on Collections and Data Typing will continue the
work on concrete types. The PID Interest Group is also a good place to provide general feedback.
Specification and prototype source code are openly available Possible development by EUDAT, DCO, ENES and others as
interested adopters
Offer by PID service providers as a service beyond registration and resolution
Contribution to a unified type registry is encouraged
How You Can Endorse
29
PID Information Types WG https://rd-alliance.org/groups/pid-information-types-wg.html
PID Interest Group https://rd-alliance.org/groups/pid-interest-group.html
PID Collections candidate WG https://rd-alliance.org/groups/pid-collections-wg.html https://rd-alliance.org/pid-collections-p6-bof-session.html
Data Typing BoF https://rd-alliance.org/data-typing-p6-bof-session.html
Next Steps and Contact Information
Practical PolicyReagan Moore, Rainer Stotzka
31Summary of the Problem
Computer actionable policies are used to enforce data management automate administrative tasks validate compliance with assessment criteria automate scientific data processing and analyses
Users motivated by issues related to scale, distribution
Practical Policy:Assertion or assurance that is enforced about a (data) collection (data set, digital object, file) by the creators of the collection
32
Practical Policy members represented 11 types of data management systems 30 institutions 2 testbeds
iRODSRenaissance Computing Institute,DataNet Federation Consortium – DFC
GPFSInstitute of Physics of the Academy of Sciences, CESNETGarching Computing Centre – RZG
Published two documents Moore, R., R. Stotzka, C. Cacciari, P. Benedikt, “Practical Policy Templates”
February, 2015, http://dx.doi.org/10.15497/83E1B3F9-7E17-484A-A466-B3E5775121CC.
Moore, R., R. Stotzka, C. Cacciari, P. Benedikt, “Practical Policy Implementations”, February, 2015, http://dx.doi.org/10.15497/83E1B3F9-7E17-484A-A466-B3E5775121CC.
Policy Templates
33
Computer actionable rules to enforce: Preservation standards
Authenticity, integrity, chain of custody, arrangement
Data management plans Collection creation, product generation, publication, storage,
archives
Data distribution Replication, content distribution network
Publication Descriptive metadata, time dependent access controls
Processing pipelines Workflow execution
Production Environments
34
Distributed data management environments EUDAT Data Policy Manager
B2SAFE use case International Neuroinformatics Coordinating Facility Institut national de physique nucléaire et de physique des particules New Zealand BESTGRID DataNet Federation Consortium
NSF data management plans Odum Institute preservation archive The iPlant Collaborative genomics data grid Science Observatory Network digital library SILS LifeTime Library HydroShare
NOAA National Climatic Data Center NASA Center for Climate Simulations
Endorsements/Adopters
35
Policy-based collection management Purpose for assembling the collection Properties required to support the purpose Policies that control when and where the properties are enforced Procedures that execute operations controlled by the policies Persistent state information that is generated by the procedures Periodic assessment criteria that verify compliance
RDA Publications Policy templates
Constraints, operations, required state information Policy implementations
Computer actionable rules to automate policy enforcement
Applications
36
Data Fabric Interest Group Policies to support
Federation Interoperability
Data Foundations and Terminology Interest Group Vocabulary for policy management
Interoperability testbeds EUDAT
http://eudat.eu/data-access-and-reuse-policies-darup National Data Service
http://www.nationaldataservice.org DataNet Federation Consortium
http://datafed.org
Next Steps and Contact Information
37
Thank you.
Questions?