![Page 1: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/1.jpg)
Enabling Precise Identification and Citability of Dynamic Data:Recommendations of the
RDA Working Group on Data CitationAndreas Rauber
Vienna University of TechnologyFavoritenstr. 9-11/1881040 Vienna, Austria
[email protected] http://ww.ifs.tuwien.ac.at/~andi
![Page 2: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/2.jpg)
Motivation
Research data is fundamental for science- Results are based on data- Data serves as input for workflows and experiments- Data is the source for graphs and visualisations in publications
Data is needed for Reproducibility- Repeat experiments- Verify / compare results
Need to provide specific data set- Service for data repositories
Put data in data repository, Assign PID (DOI, Ark, URI, …)Make is accessible done?
Source: open.wien.gv.at
https://commons.wikimedia.org/w/index.php?curid=30978545
![Page 3: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/3.jpg)
Identification of Dynamic Data
Usually, datasets have to be static- Fixed set of data, no changes:
no corrections to errors, no new data being added But: (research) data is dynamic
- Adding new data, correcting errors, enhancing data quality, …- Changes sometimes highly dynamic, at irregular intervals
Current approaches- Identifying entire data stream, without any versioning- Using “accessed at” date- “Artificial” versioning by identifying batches of data (e.g.
annual), aggregating changes into releases (time-delayed!) Would like to identify precisely the data
as it existed at a specific point in time
![Page 4: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/4.jpg)
Granularity of Subsets
What about the granularity of data to be identified?- Enormous amounts of CSV data - Researchers use specific subsets of data- Need to identify precisely the subset used
Current approaches- Storing a copy of subset as used in study -> scalability- Citing entire dataset, providing textual description of subset
-> imprecise (ambiguity)- Storing list of record identifiers in subset -> scalability,
not for arbitrary subsets (e.g. when not entire record selected) Would like to be able to identify precisely the
subset of (dynamic) data used in a process
![Page 5: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/5.jpg)
What we do NOT want… Common approaches to data management…
(from PhD Comics: A Story Told in File Names, 28.5.2010)Source: http://www.phdcomics.com/comics.php?f=1323
![Page 6: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/6.jpg)
Research Data Alliance WG on Data Citation:
Making Dynamic Data Citeable March 2014 – September 2015
- Concentrating on the problems of large, dynamic (changing) datasets
Final version presented Sep 2015at P7 in Paris, France
Endorsed September 2016at P8 in Denver, CO
https://www.rd-alliance.org/groups/data-citation-wg.html
RDA WG Data Citation
![Page 7: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/7.jpg)
RDA WGDC - Solution
We have ‒ Data & some means of access („query“)
![Page 8: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/8.jpg)
RDA WGDC - Solution
We have ‒ Data & some means of access („query“)
Make ‒ Data: time-stamped and versioned‒ Query: timestamped
![Page 9: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/9.jpg)
RDA WGDC - Solution
We have ‒ Data & some means of access („query“)
Make ‒ Data: time-stamped and versioned‒ Query: timestamped
Data Citation: ‒ Store query‒ Assign the PID to the timestamped query
(which, dynamically, leads to the data) Access:
Re-execute query on versioned data according to timestamp Dynamic Data Citation:
Dynamic data & dynamic citation of data
![Page 10: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/10.jpg)
Data Citation – Deployment
Researcher uses workbench to identify subset of data Upon executing selection („download“) user gets
Data (package, access API, …) PID (e.g. DOI) (Query is time-stamped and stored) Hash value computed over the data for local storage Recommended citation text (e.g. BibTeX)
PID resolves to landing page Provides detailed metadata, link to parent data set, subset,… Option to retrieve original data OR current version OR changes
Upon activating PID associated with a data citation Query is re-executed against time-stamped and versioned DB Results as above are returned
Query store aggregates data usage
![Page 11: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/11.jpg)
Data Citation – Deployment
Researcher uses workbench to identify subset of data Upon executing selection („download“) user gets
Data (package, access API, …) PID (e.g. DOI) (Query is time-stamped and stored) Hash value computed over the data for local storage Recommended citation text (e.g. BibTeX)
PID resolves to landing page Provides detailed metadata, link to parent data set, subset,… Option to retrieve original data OR current version OR changes
Upon activating PID associated with a data citation Query is re-executed against time-stamped and versioned DB Results as above are returned
Query store aggregates data usage
Note: query string provides excellent Note: query string provides excellent provenance information on the data set!provenance information on the data set!
![Page 12: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/12.jpg)
Data Citation – Deployment
Researcher uses workbench to identify subset of data Upon executing selection („download“) user gets
Data (package, access API, …) PID (e.g. DOI) (Query is time-stamped and stored) Hash value computed over the data for local storage Recommended citation text (e.g. BibTeX)
PID resolves to landing page Provides detailed metadata, link to parent data set, subset,… Option to retrieve original data OR current version OR changes
Upon activating PID associated with a data citation Query is re-executed against time-stamped and versioned DB Results as above are returned
Query store aggregates data usage
Note: query string provides excellent Note: query string provides excellent provenance information on the data set!provenance information on the data set!
This is an important advantage over This is an important advantage over traditional approaches relying on, e.g. traditional approaches relying on, e.g. storing a list of identifiers/DB dump!!!storing a list of identifiers/DB dump!!!
![Page 13: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/13.jpg)
Data Citation – Deployment
Researcher uses workbench to identify subset of data Upon executing selection („download“) user gets
Data (package, access API, …) PID (e.g. DOI) (Query is time-stamped and stored) Hash value computed over the data for local storage Recommended citation text (e.g. BibTeX)
PID resolves to landing page Provides detailed metadata, link to parent data set, subset,… Option to retrieve original data OR current version OR changes
Upon activating PID associated with a data citation Query is re-executed against time-stamped and versioned DB Results as above are returned
Query store aggregates data usage
Note: query string provides excellent Note: query string provides excellent provenance information on the data set!provenance information on the data set!
This is an important advantage over This is an important advantage over traditional approaches relying on, e.g. traditional approaches relying on, e.g. storing a list of identifiers/DB dump!!!storing a list of identifiers/DB dump!!!
Identify which parts of the data are used.Identify which parts of the data are used.If data changes, identify which queries If data changes, identify which queries (studies) are affected(studies) are affected
![Page 14: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/14.jpg)
Data Citation – Output
14 Recommendationsgrouped into 4 phases:- Preparing data and query store- Persistently identifying specific data sets- Resolving PIDs- Upon modifications to the data
infrastructure 2-page flyer
https://rd-alliance.org/recommendations-working-group-data-citation-revision-oct-20-2015.html
More detailed report: IEEE TCDL 2016http://www.ieee-tcdl.org/Bulletin/v12n1/papers/IEEE-TCDL-DC-2016_paper_1.pdf
![Page 15: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/15.jpg)
Data Citation – Recommendations
Preparing Data & Query Store- R1 – Data Versioning- R2 – Timestamping- R3 – Query Store
When Data should be persisted- R4 – Query Uniqueness- R5 – Stable Sorting- R6 – Result Set Verification- R7 – Query Timestamping- R8 – Query PID- R9 – Store Query- R10 – Citation Text
When Resolving a PID- R11 – Landing Page- R12 – Machine Actionability
Upon Modifications to the Data Infrastructure
- R13 – Technology Migration- R14 – Migration Verification
![Page 16: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/16.jpg)
Data Citation – Recommendations
A) Preparing the Data and the Query Store
R1 – Data Versioning: Apply versioning to ensure earlier states of data sets the data can be retrieved
R2 – Timestamping: Ensure that operations on data are timestamped, i.e. any additions, deletions are marked with a timestamp
R3 – Query Store: Provide means to store the queries and metadata to re-execute them in the future
![Page 17: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/17.jpg)
Data Citation – Recommendations
A) Preparing the Data and the Query Store
R1 – Data Versioning: Apply versioning to ensure earlier states of data sets the data can be retrieved
R2 – Timestamping: Ensure that operations on data are timestamped, i.e. any additions, deletions are marked with a timestamp
R3 – Query Store: Provide means to store the queries and metadata to re-execute them in the future
NoteNote::•R1 & R2 R1 & R2 are already pretty much standard in are already pretty much standard in many (RDBMS-) research databasesmany (RDBMS-) research databases•Different ways to implementDifferent ways to implement•A bit more challenging for some data types A bit more challenging for some data types (XML, LOD, …)(XML, LOD, …)
NoteNote::•R3: R3: query store usually pretty small, even for query store usually pretty small, even for extremely high query volumesextremely high query volumes
![Page 18: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/18.jpg)
Data Citation – Recommendations
B) Persistently Identify Specific Data sets (1/2)When a data set should be persisted:
R4 – Query Uniqueness: Re-write the query to a normalized form so that identical queries can be detected. Compute a checksum of the normalized query to efficiently detect identical queries R5 – Stable Sorting: Ensure an unambiguous sorting of the records in the data setR6 – Result Set Verification: Compute fixity information/checksum of the query result set to enable verification of the correctness of a result upon re-executionR7 – Query Timestamping: Assign a timestamp to the query based on the last update to the entire database (or the last update to the selection of data affected by the query or the query execution time). This allows retrieving the data as it existed at query time
![Page 19: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/19.jpg)
Data Citation – Recommendations
B) Persistently Identify Specific Data sets (2/2)When a data set should be persisted:
R8 – Query PID: Assign a new PID to the query if either the query is new or if the result set returned from an earlier identical query is different due to changes in the data. Otherwise, return the existing PID
R9 – Store Query: Store query and metadata (e.g. PID, original and normalized query, query & result set checksum, timestamp, superset PID, data set description and other) in the query store
R10 – Citation Text: Provide citation text including the PID in the format prevalent in the designated community to lower barrier for citing data.
![Page 20: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/20.jpg)
Data Citation – Recommendations
C) Resolving PIDs and Retrieving Data
R11 – Landing Page: Make the PIDs resolve to a human readable landing page that provides the data (via query re-execution) and metadata, including a link to the superset (PID of the data source) and citation text snippet
R12 – Machine Actionability: Provide an API / machine actionable landing page to access metadata and data via query re-execution
![Page 21: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/21.jpg)
Data Citation – Recommendations
D) Upon Modifications to the Data Infrastructure
R13 – Technology Migration: When data is migrated to a new representation (e.g. new database system, a new schema or a completely different technology), migrate also the queries and associated checksums
R14 – Migration Verification: Verify successful data and query migration, ensuring that queries can be re-executed correctly
![Page 22: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/22.jpg)
Pilot Adopters
Several adopters- Different types of data, different settings, …- CSV & SQL reference implementation (SBA/TUW)
Pilots:- Biomedical BigData Sharing, Electronic Health Records
(Center for Biomedical Informatics, Washington Univ. in St. Louis)- Marine Research Data
Biological & Chemical Oceanography Data Management Office (BCO-DMO)
- Vermont Monitoring Cooperative: Forest Ecosystem Monitoring- ARGO Boy Network, British Oceanographic Data Centre (BODC)- Virtual Atomic and Molecular Data Centre (VAMDC)- UK Riverflow Riverflow Archive, Centre for Ecology and Hydrology
![Page 23: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/23.jpg)
WG Data Citation PilotCBMI @ WUSTL
Cynthia Hudson Vitale, Leslie McIntosh, Snehil Gupta
Washington University in St.Luis
![Page 24: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/24.jpg)
Biomedical Adoption Project Goals
![Page 25: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/25.jpg)
RDA-MacArthur Grant Focus
ResearchersData Broker
![Page 26: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/26.jpg)
R1 and R2 Implementation
PostgreSQL Extension“temporal_tables”
c1 c2 c3
RDC.tablesys_period
c1 c2 c3 sys_period
RDC.hist_table*
*stores history of data changes
12
3triggers
![Page 27: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/27.jpg)
Return on Investment (ROI) - Estimated
20 hours to complete 1 study $150/hr (unsubsidized) $3000 per study 115 research studies per year 14 replication studies
![Page 28: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/28.jpg)
Adoption of Data Citation Outcomes by BCO-DMO
Cynthia Chandler, Adam Shepherd
![Page 29: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/29.jpg)
An existing repository (http://bco-dmo.org/)
Marine research data curation since 2006 Faced with new challenges, but no new funding e.g. data publication practices to support citation Used the outcomes from the RDA Data Citation Working
Group to improve data publication and citation services
A story of success enabled by RDA
https://www.rd-alliance.org/groups/data-citation-wg.html
![Page 30: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/30.jpg)
Adoption of Data Citation Outputs
Evaluation- Evaluate recommendations (done December 2015)- Try implementation in existing BCO-DMO architecture
(work began 4 April 2016) Trial
- BCO-DMO: R1-11 fit well with current architecture; R12 doable; test as part of DataONE node membership; R13-14 are consistent with Linked Data approach to data publication and sharing
Timeline:- Redesign/protoype completed by 1 June 2016- New citation recommendation by 1 Sep 2016- Report out at RDA P8 (Denver, CO) September 2016- Final report by 1 December 2016
![Page 31: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/31.jpg)
Vermont Monitoring Cooperative
James Duncan, Jennifer PontiusVMC
![Page 32: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/32.jpg)
Ecosystem MonitoringCollaborator NetworkData Archive, Access and Integration
![Page 33: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/33.jpg)
USER WORKFLOW TO DATE Modify a dataset
changes tracked original data table unchanged
Commit to version, assign name computes result hash (table pkid, col
names, first col data) and query hash updates data table to new state formalizes version
Restore previous version creates new version table from
current data table state, walks it back using stored SQL. Garbage collected after a period of
time
![Page 35: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/35.jpg)
Global Water Information Interest Group meetingRDA 8th Plenary, 15th September 2016, Denver
UK National River Flow Archive• ORACLE Relational database• Time series and metadata tables• ~20M daily flow records, + monthly / daily catchment rainfall series• Metadata (station history, owners, catchment soils / geology, etc.)• Total size of ~5GB• Time series tables automatically audited,
• But reconstruction is complex• Users generally download simple files• But public API is in development / R-NRFA package is out there• Fortunately all access is via single codeset
![Page 36: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/36.jpg)
Global Water Information Interest Group meetingRDA 8th Plenary, 15th September 2016, Denver
Versioning / citation solution• Automated archiving of entire database – version controlled scripts defining
tables, creating / populating archived tables (largely complete)• Fits in with data workflow – public / dev versions – this only works because
we have irregular / occasional updates• Simplification of the data model (complete)• API development (being undertaken independently of dynamic citation
requirements):• allows subsetting of dataset in a number of ways – initially simply• need to implement versioning (started) to ensure will cope with changes to data
structures
• Fit to dynamic data citation recommendations?• Largely• Need to address mechanism for users to request / create citable version of a query
• Resource required: estimated ~2 person months
![Page 37: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/37.jpg)
Reference Implementation for CSV Data (and SQL)
Stefan Pröll, SBA
![Page 38: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/38.jpg)
38CSV/SQL Reference Implementation 1
Reference Implementation available on Githubhttps://github.com/datascience/RDA-WGDC-CSV-Data-Citation-Prototype
Upload interface -->Upload CSV files Migrate CSV file into RDBMS
Generate table structure, identify primary keyAdd metadata columns for versioning (transparent)Add indices (transparent)
Dynamic data: upload new version of fileVersioned update / delete existing records in RDBMS
Access interfaceTrack subset creationStore queries -> PID + Landing Page
Barrymieny
![Page 39: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/39.jpg)
39CSV/Git Reference Implementation 2
Based on Git only (no SQL database) Upload CSV files to Git repository SQL-style queries operate on CSV file via API Data versioning with Git Store scripts versioned as well Make subset creation reproducible
![Page 40: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/40.jpg)
40CSV/Git Reference Implementation 2
Stefan Pröll, Christoph Meixner, Andreas RauberPrecise Data Identification Services for Long Tail Research Data. Proceedings of the intl. Conference on Preservation of Digital Objects (iPRES2016), Oct. 3-6 2016, Bern, Switzerland.
Source at Github:https://github.com/datascience/RDA-WGDC-CSV-Data-Citation-Prototype
Videos: Login: https://youtu.be/EnraIwbQfM0 Upload: https://youtu.be/xJruifX9E2U Subset: https://www.youtube.com/watch?v=it4sC5vYiZQ Resolver: https://youtu.be/FHsvjsUMiiY Update: https://youtu.be/cMZ0xoZHUyI
![Page 41: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/41.jpg)
Benefits
Retrieval of precise subset with low storage overhead Subset as cited or as it is now (including e.g. corrections) Query provides provenance information Query store supports analysis of data usage Checksums support verification Same principles applicable across all settings
- Small and large data- Static and dynamic data- Different data representations (RDBMS, CSV, XML, LOD, …)
Would work also for more sophisticated/general transformations on data beyond select/project
![Page 42: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/42.jpg)
Thanks!https://rd-alliance.org/working-groups/data-citation-wg.html
Thank you!
![Page 43: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/43.jpg)
Reference Implementation for CSV Data (and SQL)
Stefan Pröll, SBA
![Page 44: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/44.jpg)
RDA recommendations implemented in data infrastructures
Required adaptions - Introduce versioning, if not already in place- Capture sub-setting process (queries)- Implement dedicated query store to
store queries- A bit of additional functionality
(query re-writing, hash functions, …) Done! ?
- “Big data”, database driven- Well-defined interfaces- Trained experts available- “Complex, only for professional research infrastructures” ?
Large Scale Research Settings
![Page 45: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/45.jpg)
Long Tail Research DataBig data,
well organized,often used and cited
Less well organized,non-standardised
no dedicated infrastructure
“Dark data”
Amount of data sets
Data set size
[1] Heidorn, P. Bryan. "Shedding light on the dark data in the long tail of science." Library Trends 57.2 (2008): 280-299.
![Page 46: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/46.jpg)
Prototype Implementations
Solution for small-scale data- CSV files, no “expensive” infrastructure, low overhead
2 Reference implementations : Git based Prototypes: widely used versioning system
- A) Using separate folders- B) Using branches
MySQL based Prototype: - C) Migrates CSV data into relational database
Data backend responsible for versioning data sets Subsets are created with scripts or queries
via API or Web Interface Transparent to user: always CSV
![Page 47: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/47.jpg)
CSV Reference Implementation 2
Git Implementation 1Upload CSV files to Git repository (versioning)Subsets created via scripting language (e.g. R)
Select rows/columns, sort, returns CSV + metadata file Metdata file with script parameters stored in Git (Scripts stored in Git as well)
PID assigned to metadata file Use Git to retrieve proper data set version
and re-execute script on retrieved file
![Page 48: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/48.jpg)
Git-based Prototype
Git Implementation 2Addresses issues
- common commit history, branching dataUsing Git branching model:Orphaned branches for queries and data
- Keeps commit history clean- Allows merging of data files
Web interface for queries (CSV2JDBC)Use commit hash for identification
- Assigned PID hashed with SHA1- Use hash of PID as filename (ensure permissible characters)
![Page 49: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/49.jpg)
Git-based Prototype
Step 1: Select a CSV file in the repository
Step 2: Create a subset with a SQL query (on CSV data)
Step 3: Store the query script and metadata Step 4: Re-Execute!
Prototype: https://github.com/Mercynary/recitable
![Page 50: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/50.jpg)
MySQL-based Prototype
MySQL PrototypeData upload
- User uploads a CSV file into the systemData migration from CSV file into RDBMS
- Generate table structure- Add metadata columns (versioning)- Add indices (performance)
Dynamic data- Insert, update and delete records- Events are recorded with a timestamp
Subset creation- User selects columns, filters and sorts records in web interface- System traces the selection process- Exports CSV
Bar
rym
ieny
![Page 51: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/51.jpg)
MySQL-based Prototype
Source at Github:‒ https://github.com/datascience/RDA-WGDC-CSV-Data-Citation-Prototype
Videos:‒ Login: https://youtu.be/EnraIwbQfM0‒ Upload: https://youtu.be/xJruifX9E2U‒ Subset: https://www.youtube.com/watch?v=it4sC5vYiZQ‒ Resolver: https://youtu.be/FHsvjsUMiiY‒ Update: https://youtu.be/cMZ0xoZHUyI
![Page 52: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/52.jpg)
CSV Reference Implementation 2
Stefan Pröll, Christoph Meixner, Andreas RauberPrecise Data Identification Services for Long Tail Research Data. Proceedings of the intl. Conference on Preservation of Digital Objects (iPRES2016), Oct. 3-6 2016, Bern, Switzerland.
Source at Github:https://github.com/datascience/RDA-WGDC-CSV-Data-Citation-Prototype
Videos: Login: https://youtu.be/EnraIwbQfM0 Upload: https://youtu.be/xJruifX9E2U Subset: https://www.youtube.com/watch?v=it4sC5vYiZQ Resolver: https://youtu.be/FHsvjsUMiiY Update: https://youtu.be/cMZ0xoZHUyI
![Page 53: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/53.jpg)
WG Data Citation PilotCBMI @ WUSTL
Cynthia Hudson Vitale, Leslie McIntosh, Snehil Gupta
Washington University in St.Luis
![Page 54: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/54.jpg)
Moving Biomedical Big Data Sharing Forward An adoption of the RDA Data Citation of Evolving Data Recommendation to Electronic Health Records
Leslie McIntosh, PHD, MPH RDA P8 Cynthia Hudson Vitale, MA Denver, USA September 2016
@mcintold@cynhudson
![Page 55: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/55.jpg)
Background
![Page 56: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/56.jpg)
Director, Center for Biomedical Informatics
Data Services Coordinator
Leslie McIntosh, PHD, MPH
Cynthia Hudson Vitale, MA
![Page 57: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/57.jpg)
BDaaS Biomedical Data as a Service
Researchers
Data Broker
i2b2 Application
Biomedical Data
Repository
![Page 58: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/58.jpg)
Move some of the responsibility of reproducibility
Biomedical Researcher
Biomedical Pipeline
![Page 59: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/59.jpg)
RDA/MacArthur Grant
![Page 60: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/60.jpg)
Biomedical Adoption Project Goals
Implement RDA Data Citation WG recommendation to local Washington U i2b2Engage other i2b2 community adopteesContribute source code back to i2b2 community
![Page 61: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/61.jpg)
RDA Data Citation WG Recommendations
R1: Data VersioningR2: Data TimestampingR3, R9: Query StoreR7: Query TimestampingR8: Query PIDR10: Query Citation
![Page 62: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/62.jpg)
Internal Implementation Requirements
Scalable Available for PostgreSQL Actively supported Easy to maintain Easy for data brokers to use
![Page 63: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/63.jpg)
RDA-MacArthur Grant Focus
ResearchersData Broker
![Page 64: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/64.jpg)
R1 and R2 Implementation
PostgreSQL Extension“temporal_tables”
c1 c2 c3
RDC.tablesys_period
c1 c2 c3 sys_period
RDC.hist_table*
*stores history of data changes
12
3triggers
![Page 65: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/65.jpg)
ETL IncrementalsSource Data
Update?
… sys_period2016-9-9 00:00, NULL
RDC.table
Old Data
… sys_period2016-9-8 00:00, 2016-9-9 00:00
RDC.hist_table
Insert?
… sys_period2016-9-9 00:00, NULL
RDC.table
![Page 66: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/66.jpg)
R3, R7, R8, R9, and R10 Implementation
PostgreSQL Extension“temporal_tables”
1 RDC.table RDC.hist_table
RDC.table_with_history (view)
2 3• functions• triggers• query audit
tables
![Page 67: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/67.jpg)
Data Reproducibility Workflow
![Page 68: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/68.jpg)
Bonus Feature: Determine if Change Occurred
![Page 69: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/69.jpg)
Future Developments
Develop a process for sharing Query PID with researchers in an automated way
Resolve Query PIDs to a landing page with Query metadata
Implement research reproducibility requirements in other systems as possible
![Page 70: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/70.jpg)
Outcomes and Support
![Page 71: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/71.jpg)
Obtained Outcomes
Implemented WG recommendations Engaged with other i2b2 adoptees
(Harvard, Nationwide Children’s Hospital)
![Page 72: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/72.jpg)
Dissemination
Poster presentation (Harvard U, July 2016) Scientific manuscript based on our proof of concept to
AMIA TBI/CRI 2017 conference Sharing the code with the community
![Page 73: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/73.jpg)
Return on Investment (ROI) - Estimated
20 hours to complete 1 study $150/hr (unsubsidized) $3000 per study 115 research studies per year 14 replication studies
![Page 74: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/74.jpg)
Funding Support MacArthur Foundation 2016 Adoption Seeds program Foundation
through a sub-contract with Research Data Alliance
Washington University Institute of Clinical and Translational Sciences NIH CTSA Grant Number UL1TR000448 and UL1TR000448-09S1
Siteman Cancer Center at Washington University NIH/NCI Grant P30 CA091842-14
![Page 75: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/75.jpg)
Center for Biomedical Informatics @WUSTLTeams for Reproducible Research
![Page 76: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/76.jpg)
WashU CBMI Research Reproducibility Resources
![Page 77: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/77.jpg)
Adoption of Data Citation Outcomes by BCO-DMO
Cynthia Chandler, Adam Shepherd
![Page 78: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/78.jpg)
An existing repository (http://bco-dmo.org/)
Marine research data curation since 2006 Faced with new challenges, but no new funding e.g. data publication practices to support citation Used the outcomes from the RDA Data Citation Working
Group to improve data publication and citation services
A story of success enabled by RDA
https://www.rd-alliance.org/groups/data-citation-wg.html
![Page 79: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/79.jpg)
BCO-DMO is a thematic, domain-specific repositoryfunded by NSF Ocean Sciences and Polar Programs
BCO-DMO curated data are- Served: http://bco-dmo.org (URLs, URIs)
- Published: at an Institutional Repository (CrossRef DOI)http://dx.doi.org/10.1575/1912/4847
- Archived: at NCEI, a US National Data Centerhttp://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.nodc:0078575
BCO-DMO Curated Data
for Linked Data URI: http://lod.bco-dmo.org/id/dataset/3046
![Page 80: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/80.jpg)
BCO-DMO Dataset Landing Page (Mar ‘16)
![Page 81: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/81.jpg)
Initial Architecture Design Considerations (Jan 2016)
![Page 82: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/82.jpg)
Modified Architecture (March 2016)
![Page 83: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/83.jpg)
BCO-DMO Data Publication System Components
BCO-DMO publishes data to WHOAS and a DOI is assigned. The BCO-DMO architecture now supports data versioning.
![Page 84: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/84.jpg)
BCO-DMO Data Citation System Components
![Page 85: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/85.jpg)
BCO-DMO Data Set Landing Page
![Page 86: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/86.jpg)
![Page 87: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/87.jpg)
BCO-DMO Data Set Landing Page
![Page 88: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/88.jpg)
Linked to Publication via DOI
![Page 89: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/89.jpg)
New Capabilities … BCO-DMO becoming a DataONE Member Node
https://search.dataone.org/
![Page 90: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/90.jpg)
New Capabilities … BCO-DMO Data Set Citation
![Page 91: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/91.jpg)
To the Data Citation Working Group for their effortshttps://www.rd-alliance.org/groups/data-citation-wg.html
RDA US for funding this adoption project
TIMELINE:- Redesign/protoype completed by 1 June 2016- New citation recommendation by 1 Sep 2016- Report out at RDA P8 (Denver, CO) September 2016- Final report by 1 December 2016
Cyndy Chandler @cynDC42 @bcodmoORCID: 0000-0003-2129-1647 [email protected]
Thank you …
![Page 92: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/92.jpg)
Removed these to reduce talk to 10-15 minutes
EXTRA SLIDES
![Page 93: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/93.jpg)
Adoption of Data Citation Outputs
Evaluation- Evaluate recommendations (done December 2015)- Try implementation in existing BCO-DMO architecture
(work began 4 April 2016)
Trial- BCO-DMO: R1-11 fit well with current architecture; R12
doable; test as part of DataONE node membership; R13-14 are consistent with Linked Data approach to data publication and sharing
NOTE: adoption grant received from RDA US (April 2016)
![Page 94: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/94.jpg)
RDA Data Citation (DC) of evolving data
DC goals: to create identification mechanisms that:- allow us to identify and cite arbitrary views of data, from a single record to
an entire data set in a precise, machine-actionable manner- allow us to cite and retrieve that data as it existed at a certain point in time,
whether the database is static or highly dynamic
DC outcomes: 14 recommendations and associated documentation- ensuring that data are stored in a versioned and timestamped manner- identifying data sets by storing and assigning persistent identifiers (PIDs) to
timestamped queries that can be re-executed against the timestamped data store
https://www.rd-alliance.org/groups/data-citation-wg.html
![Page 95: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/95.jpg)
RDA Data Citation WG Recommendations
»» Data Versioning: For retrieving earlier states of datasets the data need to be versioned. Markers shall indicate inserts, updates and deletes of data in the database.
»» Data Timestamping: Ensure that operations on data are timestamped, i.e. any additions, deletions are marked with atimestamp.
»» Data Identification: The data used shall be identified via a PID pointing to a time-stamped query, resolving to a landing page.
Oct 2015 version w/ 14 recommendations
DC WG chairs: Andreas Rauber, Ari Asmi, Dieter van Uytvanck
![Page 96: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/96.jpg)
procedure: when a BCO-DMO data set is updated … A copy of the previous version is preservedRequest a DOI for the new version of dataPublish data, and create new landing page for new version of data, with new DOI assignedBCO-DMO database has links to all versions of the data (archived and published)Both archive and published dataset landing pages have links back to best version of full dataset at BCO-DMOBCO-DMO data set landing page displays links to all archived and published versions
New capability (implemented)
![Page 97: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/97.jpg)
Extended description of recommendations
Altman and Crosas. 2013. “Evolution of Data Citation …” CODATA-ICSTI 2013. “Out of cite, out of mind” FORCE11 https://www.force11.org/about/mission-and-guiding-principles R. E. Duerr, et al. “On the utility of identification schemes for digital
earth science data”, ESI, 2011.
REFERENCES
![Page 98: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/98.jpg)
Vermont Monitoring Cooperative
James Duncan, Jennifer PontiusVMC
![Page 99: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/99.jpg)
Implementation of Dynamic Data Citation
James Duncan and Jennifer Pontius9/16/2016
[email protected], www.uvm.edu/vmc
![Page 100: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/100.jpg)
Ecosystem MonitoringCollaborator NetworkData Archive, Access and Integration
![Page 101: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/101.jpg)
Many Disciplines, Many Contributors
VMC houses any data related to forest ecosystem condition, regardless of affiliation or discipline
![Page 102: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/102.jpg)
Why We Need It
Continually evolving datasets
Some errors not caught till next field season
Frequent reporting and publishing
![Page 103: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/103.jpg)
Dynamic Data Citation – Features Needed
Light footprint on database resources Works on top of existing catalog and metadata Works in an institutionally managed PHP/MySQL
environment User-driven control of what quantity of change
constitutes a version Integration with management portal Track granular changes in data
![Page 104: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/104.jpg)
User Workflow to Date
Modify a dataset- changes tracked- original data table unchanged
Commit to version, assign name - computes result hash (table pkid, col
names, first col data) and query hash- updates data table to new state- formalizes version
Restore previous version - creates new version table from current
data table state,- walks it back using stored SQL.- Garbage collected after a period of time
![Page 105: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/105.jpg)
User Workflow, still to come
Web-based data editing validation
DOI minting integration Public display Subsetting workflow Other methods of data
modification? Upgrade to rest of system
V1.1
V1.2
V1.3 Query 1
Unsaved
DOI:
DOI:
DOI?
![Page 106: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/106.jpg)
Technical Details
Version PID
Dataset ID
Version Name
Version ID
Person ID
Query Hash
Result Hash
Timestamp
23456 3525 Version 1.5
3 …. …. ….
23574 3525 Unsaved -1 …. NULL ….
Step ID Version PID
Forward Backward Order
983245 23574 DELETE FROM…
INSERT INTO… 1
983245 23574 UPDATE SET site=“Winhall”…
UPDATE SET site=“Lye Brook”…
2
Version Info Table
Step Tracking Table (Child of Version Info)
![Page 107: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/107.jpg)
Implementation Challenges and Questions
ChallengesLarge updatesRe-creation of past versions, in terms of garbage collection and storage
QuestionsQuery uniqueness checking and query normalizationEfficient but effective results hashing strategiesLinear progression of data, versus branching network
![Page 108: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/108.jpg)
Acknowledgments
Adoption seed funding - MacArthur Foundation and the Research Data Alliance
The US Forest Service State and Private Forestry program for core operational funding of the VMC
Fran Berman, Yolanda Meleco and the other adopters who have been sharing their experiences.
All the VMC cooperators that contribute
![Page 109: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/109.jpg)
Thank you!
![Page 110: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/110.jpg)
ARGO
Justin Buck, Helen GlavesBODC
![Page 111: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/111.jpg)
WG Data Citation: Adoption meeting
Argo DOI pilot
Justin Buck, National Oceanography Centre (UK), [email protected]
Thierry Carval, Ifremer (France), [email protected] Loubrieu, Ifremer (France), [email protected] Merceur, Ifremer (France), [email protected]
RDA P8 2016, Denver, 16th September 2016
![Page 112: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/112.jpg)
300+ citations per year
How to cite Argo data at a given point in time?Possible with a single DOI?
![Page 113: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/113.jpg)
Argo data system (simplified)
Raw data
Data Assembly Centres (DAC)(national level, 10 of, RT processing and QC) GTS
(+GTSPP archive)
Global data assembly Centres (GDAC)(2 of, mirrored, ftp access, repository of NetCDF files, no version control, latest version served)
Data users
Delayed modequality control
![Page 114: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/114.jpg)
Key associated milestones
2014 – Introduction of dynamic data into the DataCite metadata schema- https://schema.datacite.org/- Snapshots were used as an interim solution for Argo
2015 – RDA recommendations on evolving data:- https://rd-alliance.org/system/files/RDA-DC-
Recommendations_151020.pdf - Legacy GDAC architecture does not permit full implementation
![Page 115: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/115.jpg)
How to apply a DOI for Argo
Obs time
State time
++ +++
+++ ++++
+
++++
+
+ + ++ +++
+++
Obs time
State time
++ +++
+++ ++++
+
++++
+
+ + ++ +++
+++
Obs time
State time
++ +++
+++ ++++
+
++++
+
+ + ++ +++
+++
Archive (collection of snapshots/granules)
To cite a particular snapshot one can potentially cite a time slice of an archive i.e. the snapshot at a given point in time.
![Page 116: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/116.jpg)
New single DOI
Argo (2000). Argo float data and metadata from Global Data Assembly Centre (Argo GDAC). SEANOE. http://doi.org/10.17882/42182
![Page 117: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/117.jpg)
# key used to identify snapshot
![Page 118: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/118.jpg)
# key used to identify snapshot
http://www.seanoe.org/data/00311/42182/#45420
![Page 119: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/119.jpg)
Step towards RDA recommendation
Archive snapshots enables R1 and R2 at monthly granularityR1 – Data Versioning R2 – Timestamping
Argo Pilot effectively uses predetermined referencing of snapshots removing the need for requirements R3 to R7. # keys are PIDs for the snapshots and have associated citation texts.R3 – Query Store Facilities R4 – Query UniquenessR5 – Stable Sorting R6 – Result Set VerificationR7 – Query Timestamping R8 – Query PIDR9 – Store Query R10 – Automated Citation Texts
SEANOE landing page architecture means R11 and R12 effectively metR11 – Landing Page R12 – Machine Actionability
Final two requirements untested at this stageR13 – Technology Migration R14 – Migration Verification
![Page 120: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/120.jpg)
Summary
There is now a single DOI for Argo- Takes account of legacy GDAC architecture- Monthly temporal granularity- Enables both reproducible research and simplifies the tracking
of citations- ‘#’ rather than ‘?’ in identifier takes account of current DOI
resolution architecture Extensible to other observing systems such as
OceanSITES and EGO The concept allows for different subsets of Argo data
e.g. ocean basins, Bio-Argo data
![Page 122: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/122.jpg)
From RDA Data Citation Recommendations to new paradigms
for citing data from VAMDC
C.M. Zwölf and VAMDC consortiumRDA 8th Plenary - Denver
![Page 123: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/123.jpg)
VAMDCSingle and
unique access to
heterogeneous A+M
Databases
Federates 29 heterogeneous databases http://portal.vamdc.org/
The “V” of VAMDC stands for Virtual in the sense that the e-infrastructure does not contain data. The infrastructure is a wrapping for exposing in a unified way a set of heterogeneous databases.
The consortium is politically organized around a Memorandum of understanding (15 international members have signed the MoU, 1 November 2014)
High quality scientific data come from different Physical/Chemical Communities
Provides data producers with a large dissemination platform
Remove bottleneck between data-producers and wide body of users
The Virtual Atomic and Molecular Data Centre
![Page 124: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/124.jpg)
ExistingIndependent
A+M database
The VAMDC infrastructure technical architecture
![Page 125: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/125.jpg)
VAMDC wrapping layer VAMDC
Node
ExistingIndependent
A+M database
Resourceregistered into
The VAMDC infrastructure technical architecture
Standard Layer
Accept queries submitted in standard grammar
Results formatted into standard XML file (XSAMS)
Registry Layer
![Page 126: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/126.jpg)
VAMDC wrapping layer VAMDC
Node
ExistingIndependent
A+M database
Resourceregistered into
Unique A+M query
Asks for available resources
The VAMDC infrastructure technical architecture
VAMDCClients
(dispatch query on all
the registeredresources)
•Portal•SpecView•SpectCol
Set of standard
files
Standard Layer
Accept queries submitted in standard grammar
Results formatted into standard XML file (XSAMS)
Registry Layer
![Page 127: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/127.jpg)
VAMDC wrapping layer VAMDC
Node
ExistingIndependent
A+M database
Resourceregistered into
Unique A+M query
Asks for available resources
The VAMDC infrastructure technical architecture
VAMDCClients
(dispatch query on all
the registeredresources)
•Portal•SpecView•SpectCol
Set of standard
files
Standard Layer
Accept queries submitted in standard grammar
Results formatted into standard XML file (XSAMS)
Registry Layer
• VAMDC is agnostic about the local data storage strategy on each node.• Each node implements the access/query/result protocols.• There is no central management system.• Decisions about technical evolutions are made by consensus in Consortium.
It is both technical and political challenging to implement the WG recommendations.
![Page 128: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/128.jpg)
Let us implement the recommendation!!
Tagging and versioning data
The problem is more anthropological than technical…
What does it really mean data citation?
![Page 129: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/129.jpg)
Let us implement the recommendation!!
Tagging and versioning data
The problem is more anthropological than technical…
What does it really mean data citation?
![Page 130: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/130.jpg)
Let us implement the recommendation!!
Tagging and versioning data
The problem is more anthropological than technical…
We see technically how to do that
But each data provider defines differently what a dataset is.
Naturally it is the dataset (A+M data have no meaning outside this given context)
Ok, but What is the data granularity for tagging?
What does it really mean data citation?
![Page 131: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/131.jpg)
Let us implement the recommendation!!
Tagging and versioning data
The problem is more anthropological than technical…
We see technically how to do that
But each data provider defines differently what a dataset is.
Naturally it is the dataset (A+M data have no meaning outside this given context)
Ok, but What is the data granularity for tagging?
What does it really mean data citation?
Everyone knows what it is!
Yes, but everyone has its own definition
RDA cite databases record or output files. (an extracted data file may have an H-factor)
VAMDC cite all the papers used for compiling the content of a given output file.
![Page 132: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/132.jpg)
Let us implement the recommendation!!
Tagging versions of data
Implementation will be an overlay to the standard / output layer, thusindependent from any specific data-node
Two
laye
rs
mec
hani
sms
1 Fine grained granularity:Evolution of XSAMS output standard for tracking data
modifications
2 Coarse grained granularity:
At each data modification to a given data node, the version of the Data-Node changes
With the second mechanism we know that something changed : in other words, we know that the result of an identical query may be different from one version to the other. The detail of which data changed is accessible using the first mechanisms.
![Page 133: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/133.jpg)
Let us implement the recommendation!!
Query StoreTagging versions of data
Implementation will be an overlay to the standard / output layer, thusindependent from any specific data-node
Two
laye
rs
mec
hani
sms
1 Fine grained granularity:Evolution of XSAMS output standard for tracking data
modifications
2 Coarse grained granularity:
At each data modification to a given data node, the version of the Data-Node changes
With the second mechanism we know that something changed : in other words, we know that the result of an identical query may be different from one version to the other. The detail of which data changed is accessible using the first mechanisms.
Is built over the versioning of Data
Is plugged over the existing VAMDC data-extraction
mechanisms.
Due to the distributed VAMDC architecture, the Query Store
architecture is similar to a log-service.
![Page 134: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/134.jpg)
Data-Versioning: overview of the fine grained mechanisms
Out
put X
SAM
S fil
e
Radiative process 1
Radiative process 2
We adopted a change of paradigm (weak structuration):
Radiative process n
…
Collisional process 1
Collisional process 2
Collisional process m
…
Energy State 1
Energy State 2
Energy State p
……
Element 1
Element 2
Element q
![Page 135: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/135.jpg)
Out
put X
SAM
S fil
e
Radiative process 1
Radiative process 2
Radiative process n
…
Collisional process 1
Collisional process 2
Collisional process m
…
Energy State 1
Energy State 2
Energy State p
……
Element 1
Element 2
Element q
Version 1(tagged according to infrastructure
state & updates)
Version 2(tagged according to infrastructure
state & updates)
Data-Versioning: overview of the fine grained mechanisms
We adopted a change of paradigm (weak structuration):
![Page 136: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/136.jpg)
Out
put X
SAM
S fil
e
Radiative process 1
Radiative process 2
Radiative process n
…
Collisional process 1
Collisional process 2
Collisional process m
…
Energy State 1
Energy State 2
Energy State p
……
Element 1
Element 2
Element q
Version 1(tagged according to infrastructure
state & updates)
Version 2(tagged according to infrastructure
state & updates)
Data-Versioning: overview of the fine grained mechanisms
We adopted a change of paradigm (weak structuration):
![Page 137: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/137.jpg)
This approach has several advantages:
•It solves the data tagging granularity problem
•It is independent from what is considered a dataset
•The new files are compliant with old libraries & processing programs• We add a new feature, an overlay to the existing structure• We induce a structuration, without changing the structure (weak
structuration)
Data-Versioning: overview of the fine grained mechanisms
We adopted a change of paradigm (weak structuration):
![Page 138: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/138.jpg)
We adopted a change of paradigms:
This approach has several advantages:
•It solves the data tagging granularity problem
•It is independent from what is considered a dataset
•The new files are compliant with old libraries & processing programs• We add a new feature, an overlay to the existing structure• We induce a structuration, without changing the structure (weak
structuration)
Technical details described in New model for datasets citation and extraction reproducibility in VAMDC, C.M. Zwölf, N. Moreau, M.-L. Dubernet, In press J. Mol. Spectrosc. (2016), http://dx.doi.org/10.1016/j.jms.2016.04.009
Arxiv version: https://arxiv.org/abs/1606.00405
Data-Versioning: overview of the fine grained mechanisms
![Page 139: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/139.jpg)
Let us focus on the query store:
The difficulty we have to cope with:•Handle a query store in a distributed environment (RDA did not design it for these configurations).•Integrate the query store with the existing VAMDC infrastructure.
![Page 140: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/140.jpg)
Let us focus on the query store:
The difficulty we have to cope with:•Handle a query store in a distributed environment (RDA did not design it for these configurations).•Integrate the query store with the existing VAMDC infrastructure.
The implementation of the query store is the goal of a joint collaboration between VAMDC and RDA-Europe. •Development started during spring 2016. •Final product released during 2017.
![Page 141: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/141.jpg)
Let us focus on the query store:
The difficulty we have to cope with:•Handle a query store in a distributed environment (RDA did not design it for these configurations).•Integrate the query store with the existing VAMDC infrastructure.
The implementation of the query store is the goal of a jointly collaboration between VAMDC and RDA-Europe. •Development started during spring 2016. •Final product released during 2017.
Collaboration with Elsevier for embedding the VAMDC query store into the pages displaying the digital version of papers. Designing technical solution for•Paper / data linking at the paper submission (for authors)•Paper / data linking at the paper display (for readers)
![Page 142: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/142.jpg)
Data extraction procedure
Let us focus on the query store:
VAMDC portal(query
interface)
VAMDC infrastructur
e
Query VAMD portal(result part)
Computed response
Access to the output data file
Digital Unique Identifier
associated to the current extraction
Subm
itting
quer
y
Sketching the functioning – From the final-user point of view:
![Page 143: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/143.jpg)
Data extraction procedure
Let us focus on the query store:
VAMDC portal(query
interface)
VAMDC infrastructur
e
Query VAMD portal(result part)
Computed response
Access to the output data file
Digital Unique Identifier
associated to the current extraction
Subm
itting
quer
y
Resolves
Landing Page
The original query
Date & time where query was processed
Version of the infrastructure when the query was processed
List of publications needed for answering the query
When supported (by the VAMDC federated DB): retrieve the output
data-file as it was computed (query re-execution)
Query Metadata
Query Store
Sketching the functioning – From the final-user point of view:
![Page 144: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/144.jpg)
Data extraction procedure
Let us focus on the query store:
VAMDC portal(query
interface)
VAMDC infrastructur
e
Query VAMD portal(result part)
Computed response
Access to the output data file
Digital Unique Identifier
associated to the current extraction
Subm
itting
quer
y
Resolves
Landing Page
The original query
Date & time where query was processed
Version of the infrastructure when the query was processed
List of publications needed for answering the query
When supported (by the VAMDC federated DB): retrieve the output
data-file as it was computed (query re-execution)
Query Metadata
Query Store
Manage queries (with
authorisation/authentication)
Group arbitrary set of queries (with
related DUI) and assign them a DOI
to use in publications
Use
DO
I in
pape
rs
Sketching the functioning – From the final-user point of view:
![Page 145: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/145.jpg)
Data extraction procedure
Let us focus on the query store:Sketching the functioning – From the final-user point of view:
VAMDC portal(query
interface)
VAMDC infrastructur
e
Query VAMD portal(result part)
Computed response
Access to the output data file
Digital Unique Identifier
associated to the current extraction
Subm
itting
quer
y
Resolves
Landing Page
The original query
Date & time where query was processed
Version of the infrastructure when the query was processed
List of publications needed for answering the query
When supported (by the VAMDC federated DB): retrieve the output
data-file as it was computed (query re-execution)
Query Metadata
Query Store
Manage queries (with
authorisation/authentication)
Group arbitrary set of queries (with
related DUI) and assign them a DOI
to use in publications
Use
DO
I in
pape
rs
Editors may follow the citation pipeline : credit delegation applies
![Page 146: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/146.jpg)
Let us focus on the query store:Sketching the functioning – Technical internal point of view:
VAMDC Node
Query Store Listener Service
Notification
1 When a node receives a user query, it notifies to the Listener Service the following
information:•The identity of the user (optional)•The used client software•The identifier of the node receiving the query•The version (with related timestamp) of the node receiving the query•The version of the output standard used by the node for replying the results•The query submitted by the user•The link to the result data.
![Page 147: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/147.jpg)
Let us focus on the query store:Sketching the functioning – Technical internal point of view:
VAMDC Node
Query Store Listener Service
2 For each received notification, the listener check if it exists an already existing query:
•Having the same Query•Having the same node version•Submitted to the same node and having the same node version
Notification
![Page 148: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/148.jpg)
Let us focus on the query store:Sketching the functioning – Technical internal point of view:
VAMDC Node
Query Store Listener Service
2 For each received notification, the listener check if it exists an already existing query:
•Having the same Query•Having the same node version•Submitted to the same node and having the same node version
• Provides the Query with a unique time-stamped identifier
• Following the link, get data and process them for extracting relevant metadata (ex. Bibtex of references used for compiling the output file)
• Store these relevant metadata• If identity of user is provided, we
associate the user with the Query ID
Notification
• We get the already existing unique identifier and (incrementally) associate the new query timestamp (and if provided the identifier of the user) with the query Identifier
existing
Not existing
![Page 149: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/149.jpg)
Let us focus on the query store:Sketching the functioning – Technical internal point of view:
VAMDC Node
Query Store Listener Service
Notification
Remark on query uniqueness:
•The query language supported by the VAMDC infrastructure is VSS2 (VAMDC SQL Subset 2, http://vamdc.eu/documents/standards/queryLanguage/vss2.html).
•We are working on a specific VSS2 parser (based on Antlr) which should identify, from queries expressed in different ways, the ones that are semantically identical
•We are designing this analyzer as an independent module, hoping to extend it to all SQL.
![Page 150: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/150.jpg)
Final remarks:
• Our aims:
• Provide the VAMDC infrastructure with an operational query store
• Share our experience with other data-providers
• Provide data-providers with a set of libraries/tools/methods for an easy implementation of a query store.
• We will try to build a generic query store (i.e. using generic software blocks)
![Page 152: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/152.jpg)
UK National River Flow Archive
Curation and dissemination of regulatory river flow data for research and other access
Data used for significant research outputs, and a large number of citations annually
Updated annually but also regular revision of entire flow series through time (e.g. stations resurveyed)
Internal auditing, but history is not exposed to users
![Page 153: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/153.jpg)
UK National River Flow Archive ORACLE Relational database Time series and metadata tables ~20M daily flow records, + monthly / daily catchment
rainfall series Metadata (station history, owners, catchment soils /
geology, etc.) Total size of ~5GB Time series tables automatically audited,
- But reconstruction is complex Users generally download simple files But public API is in development / R-NRFA package is out
there Fortunately all access is via single codeset
![Page 154: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/154.jpg)
Our data citation requirements
Cannot currently cite whole dataset Allow citation a subset of the data, as it was at the time Fit with current workflow / update schedule, and
requirements for reproducibility Fit with current (file download) and future (API) user
practices Resilient to gradual or fundamental changes in
technologies used Allow tracking of citations in publications
![Page 155: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/155.jpg)
Options for versioning / citation
“Regulate” queries: - limitations on service provided
Enable any query to be timestamped / cited / reproducible:- does not readily allow verification (e.g. checksum) of queries
(R7), or storage of queries (R9)
Manage which queries can be citable: - limitation on publishing workflow?
![Page 156: Enabling Precise Identification and Citability of Dynamic Data: Recommendations of the RDA Working Group on Data Citation](https://reader031.vdocuments.mx/reader031/viewer/2022011721/5878adab1a28ab724c8b4c65/html5/thumbnails/156.jpg)
Versioning / citation solution
Automated archiving of entire database – version controlled scripts defining tables, creating / populating archived tables (largely complete)
Fits in with data workflow – public / dev versions – this only works because we have irregular / occasional updates
Simplification of the data model (complete) API development (being undertaken independently of dynamic
citation requirements):- allows subsetting of dataset in a number of ways – initially simply- need to implement versioning (started) to ensure will cope with changes to
data structures Fit to dynamic data citation recommendations?
- Largely- Need to address mechanism for users to request / create citable version of
a query Resource required: estimated ~2 person months