leveraging publication metadata to help overcome the data ... · leveraging publication metadata to...
TRANSCRIPT
Todd J. Vision National Evolutionary Synthesis Center
Department of Biology University of North Carolina at Chapel Hill
ORCID Participant Meeting, Harvard, May 2011
Leveraging publication metadata to help overcome the data ingest bottleneck
• The End To make data archiving integral to scientific publishing.
• The scope Data underlying findings in the peer-reviewed biological
literature.
• The Means Integrated submission of data with the manuscript
Low barrier to submission (at the datafile level) Free reuse of data (free as in both speech & beer)
Journals share responsibility for governance and sustainability
The long tail of orphan data in “small science” Vo
lum
e
Rank frequency of datatype
Specialized repositories (e.g. GenBank, PDB)
Orphan data
after B. Heidorn
The long tail of orphan data in “small science” Vo
lum
e
Rank frequency of datatype
Specialized repositories (e.g. GenBank, PDB)
Orphan data
after B. Heidorn
Bumpus HC (1898) The Elimination of the Unfit as Illustrated by the Introduced Sparrow, Passer domesticus. A Fourth Contribution to the Study of Variation. pp. 209-226 in Biological Lectures from the Marine Biological Laboratory, Woods Hole, Mass.
A publication package
1
1. Integrated manuscript and data submission
A publication package
1
1. Integrated manuscript and data submission
A publication package 2
2. Handshaking with specialized repositories
Submit data Submit manuscript
Integrated
Submit data
Manuscript metadata
Submit manuscript
Integrated
Submit data
Manuscript metadata
Peer review Review passcode
Submit manuscript
Integrated
Submit data
Manuscript metadata
Peer review Review passcode
Acceptance notification Curation
Data DOI Production
Submit manuscript
Integrated
Submit data
Manuscript metadata
Peer review Review passcode
Acceptance notification Curation
Data DOI Production
Article metadata Curation
Submit manuscript
Integrated
Submit data
Manuscript metadata
Peer review Review passcode
Acceptance notification Curation
Data DOI Production
Article metadata Curation
Article Publication Data
publication
Submit manuscript
Integrated
Submit data
Manuscript metadata
Peer review Review passcode
Acceptance notification Curation
Data DOI Production
Article metadata Curation
Article Publication Data
publication
Non-integrated
Submit data
Submit manuscript
Integrated
Submit data
Manuscript metadata
Peer review Review passcode
Acceptance notification Curation
Data DOI Production
Article metadata Curation
Article Publication Data
publication
Non-integrated
Submit data
Author adds DOI
Data DOI
Article publication
Article metadata harvested
Submit manuscript
Integrated
Article Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC,
Eisen JA (2011) Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011
Dryad data package Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC,
Eisen JA (2011) Data from: Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. Dryad Digital Repository. doi:10.5061/dryad.8384
• Integrated submission Currently integrated or in process: 20
All journals with Dryad content: >70 A minority require data prior to review
• Affiliated publishers: Traditional (e.g. Oxford University Press, Wiley-Blackwell) Open Access (e.g. BMC, BMJ Open)
Society publishers (e.g. with Allen Press, or independent)
Dryad vs. Supplementary Online Materials
Dryad SOM
Article citations: reuse of data leads to article citations ✔ ✔
Identifiable: Data DOIs within articles serve as permanent, resolvable identifiers ✔ ✔/✗
Curated: quality control of data submissions and indexing metadata ✔ ✔/✗
Economy of scale: cost efficiency from shared infrastructure ✔ ✔/✗
Discoverable: indexed and exposed to both web and bibliographic search engines ✔ ✔/✗
Ease of deposit: streamlined deposit, allow large and complex datasets ✔/✗ ✔/✗
Formatted for reuse, i.e. not PDF ✔/✗ ✔/✗
Updatable: new versions of data files can be added, metadata can be enhanced ✔ ✗
Preservation planning: integrity audits, format migration, replication, etc. ✔ ?
Support for embargoes: can delay release of data in accordance with journal policy ✔ ?
Free reuse: no paywall, no unecessary IP restrictions/ambiguities ✔ ?
Data citations: reuse of data leads to data citations ? ✗
612 downloads
Investigator toolkit
Member nodes • Dryad, ORNL DAAC, Knowledge Network for Biocomplexity, etc.
Coordinating nodes
Why Dryad yearns for ORCIDs • Replace name strings with identities
Disambiguation of like names
Confidently recognizing different data packages that share an author Not being fooled by different names for the same author
• Enabling Accurate author searches
Internal and external author hyperlinks Aggregation of author contributions
Inclusion of data records in the profiles of coauthors
Propagation of ORCIDs with Dryad metadata
• Manual curation of names not feasible Only ~20% of authors match LC name authority file
Manual control would explode curation costs
How to get ORCIDs into Dryad���
• Ideally sent to Dryad by integrated journals Pre-review/Pre-production: allows coauthors to edit
data packages
Post-production: works for all other uses
• Non-integrated journals Lookup API based on article or affiliation data
• To be avoided Authors required to enter ORCIDs during submission Authors required to register during submission
What do we know about authors?���
• Names Often abbreviated except for corresponding or
submitting author
• At least one article they have written Title, journal, volume, pages, DOI, abstract
• Other identifiable information An email for submitting authors Sometimes: institutional affiliation and contact
information for corresponding authors
Some requirements • Recognizing ORCIDs for authenticated users
Mapping to InCommon Silver profiles ORCIDs for organizations (e.g. consortia)
• Dspace support Curator interface for ORCID lookup/verification
Ideally optional in submission interface Allowing metadata relationships (e.g. of an ORCID with a
name)
• Mechanisms for curator to At least flag duplicates and errors Register provisional ORCIDs
Map to other profiles (e.g. InCommon)
Business model issues • Dryad is (will be) supported by subscriptions and
deposit charges, primarily from journals. With a not-for-profit budget
• Feasibility requires wide adoption by publishers And manuscript-submission system developers!
• Favored model Pay for use of automated lookup services
Credit for curator contributions
For more information: http://datadryad.org
http://blog.datadryad.org http://datadryad.org/wiki
http://code.google.com/p/dryad [email protected]
Facebook: Dryad Twitter: @datadryad
"Cherish old knowledge that you may acquire new" The Analects of Confucius
Special thanks to Elena Feinstein Jane Greenberg
Ryan Scherle
Data Package Article
Datafile
• dc.identifier = doi of data file • dc.relation.isPartOf = doi of data package • file-specific description: keywords, authors, format, size, checksum, etc. • embargo information (type, end date)
• dc.identifier = doi of data package • dc.relation.hasPart = dois of data files • dc.references = handle of article description record • dc.title = title of data package • dc.description (not article abstract, optional) • dc.creator = authors of data package • dc.date (with refinements – dates associated with submission to Dryad and archiving in the repository) • dryad.external = GenBank accession number, TreeBASE identifier • dc.relation = URL of related resource • dc.subject = general keywords • DarwinCore.ScientificName = taxon keywords • dc.spatial = geographic keywords • dc.temporal = timespan keywords • dryad.curatorNote
• dc.identifier = doi of article • bibo.status = article publication status • dc.creator = authors of article • dc.issued = article publication date • dc.title = title of article • bibo.journal = journal title • bibo.issn and bibo.eissn • bibo.volume • bibo.issue • bibo.pageStart and bibo.pageEnd • dc.abstract = article abstract • dc.isReferencedBy = data package doi
Dryad Metadata Profile (v3.0)