leveraging publication metadata to help overcome the data ... · leveraging publication metadata to...

29
Todd J. Vision National Evolutionary Synthesis Center Department of Biology University of North Carolina at Chapel Hill ORCID Participant Meeting, Harvard, May 2011 Leveraging publication metadata to help overcome the data ingest bottleneck

Upload: others

Post on 04-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

Todd J. Vision National Evolutionary Synthesis Center

Department of Biology University of North Carolina at Chapel Hill

ORCID Participant Meeting, Harvard, May 2011

Leveraging publication metadata to help overcome the data ingest bottleneck

Page 2: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

•  The End   To make data archiving integral to scientific publishing.

•  The scope   Data underlying findings in the peer-reviewed biological

literature.

•  The Means   Integrated submission of data with the manuscript

  Low barrier to submission (at the datafile level)   Free reuse of data (free as in both speech & beer)

  Journals share responsibility for governance and sustainability

Page 3: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

The long tail of orphan data in “small science” Vo

lum

e

Rank frequency of datatype

Specialized repositories (e.g. GenBank, PDB)

Orphan data

after B. Heidorn

Page 4: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

The long tail of orphan data in “small science” Vo

lum

e

Rank frequency of datatype

Specialized repositories (e.g. GenBank, PDB)

Orphan data

after B. Heidorn

Bumpus HC (1898) The Elimination of the Unfit as Illustrated by the Introduced Sparrow, Passer domesticus. A Fourth Contribution to the Study of Variation. pp. 209-226 in Biological Lectures from the Marine Biological Laboratory, Woods Hole, Mass.

Page 5: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

A publication package

Page 6: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

1

1. Integrated manuscript and data submission

A publication package

Page 7: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

1

1. Integrated manuscript and data submission

A publication package 2

2. Handshaking with specialized repositories

Page 8: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

Submit data Submit manuscript

Integrated

Page 9: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

Submit data

Manuscript metadata

Submit manuscript

Integrated

Page 10: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

Submit data

Manuscript metadata

Peer review Review passcode

Submit manuscript

Integrated

Page 11: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

Submit data

Manuscript metadata

Peer review Review passcode

Acceptance notification Curation

Data DOI Production

Submit manuscript

Integrated

Page 12: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

Submit data

Manuscript metadata

Peer review Review passcode

Acceptance notification Curation

Data DOI Production

Article metadata Curation

Submit manuscript

Integrated

Page 13: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

Submit data

Manuscript metadata

Peer review Review passcode

Acceptance notification Curation

Data DOI Production

Article metadata Curation

Article Publication Data

publication

Submit manuscript

Integrated

Page 14: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in
Page 15: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in
Page 16: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

Submit data

Manuscript metadata

Peer review Review passcode

Acceptance notification Curation

Data DOI Production

Article metadata Curation

Article Publication Data

publication

Non-integrated

Submit data

Submit manuscript

Integrated

Page 17: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

Submit data

Manuscript metadata

Peer review Review passcode

Acceptance notification Curation

Data DOI Production

Article metadata Curation

Article Publication Data

publication

Non-integrated

Submit data

Author adds DOI

Data DOI

Article publication

Article metadata harvested

Submit manuscript

Integrated

Page 18: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

Article Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC,

Eisen JA (2011) Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011

Dryad data package Wu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M, Venter JC,

Eisen JA (2011) Data from: Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. Dryad Digital Repository. doi:10.5061/dryad.8384

Page 19: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

•  Integrated submission   Currently integrated or in process: 20

  All journals with Dryad content: >70   A minority require data prior to review

•  Affiliated publishers:   Traditional (e.g. Oxford University Press, Wiley-Blackwell)   Open Access (e.g. BMC, BMJ Open)

  Society publishers (e.g. with Allen Press, or independent)

Page 20: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

Dryad vs. Supplementary Online Materials

Dryad SOM

Article citations: reuse of data leads to article citations ✔ ✔

Identifiable: Data DOIs within articles serve as permanent, resolvable identifiers ✔ ✔/✗

Curated: quality control of data submissions and indexing metadata ✔ ✔/✗

Economy of scale: cost efficiency from shared infrastructure ✔ ✔/✗

Discoverable: indexed and exposed to both web and bibliographic search engines ✔ ✔/✗

Ease of deposit: streamlined deposit, allow large and complex datasets ✔/✗ ✔/✗

Formatted for reuse, i.e. not PDF ✔/✗ ✔/✗

Updatable: new versions of data files can be added, metadata can be enhanced ✔ ✗

Preservation planning: integrity audits, format migration, replication, etc. ✔ ?

Support for embargoes: can delay release of data in accordance with journal policy ✔ ?

Free reuse: no paywall, no unecessary IP restrictions/ambiguities ✔ ?

Data citations: reuse of data leads to data citations ? ✗

Page 21: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

612 downloads

Page 22: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

Investigator toolkit

Member nodes •  Dryad, ORNL DAAC, Knowledge Network for Biocomplexity, etc.

Coordinating nodes

Page 23: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

Why Dryad yearns for ORCIDs •  Replace name strings with identities

  Disambiguation of like names

  Confidently recognizing different data packages that share an author   Not being fooled by different names for the same author

•  Enabling   Accurate author searches

  Internal and external author hyperlinks   Aggregation of author contributions

  Inclusion of data records in the profiles of coauthors

  Propagation of ORCIDs with Dryad metadata

•  Manual curation of names not feasible   Only ~20% of authors match LC name authority file

  Manual control would explode curation costs

Page 24: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

How to get ORCIDs into Dryad���

•  Ideally sent to Dryad by integrated journals   Pre-review/Pre-production: allows coauthors to edit

data packages

  Post-production: works for all other uses

•  Non-integrated journals   Lookup API based on article or affiliation data

•  To be avoided   Authors required to enter ORCIDs during submission   Authors required to register during submission

Page 25: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

What do we know about authors?���

•  Names  Often abbreviated except for corresponding or

submitting author

•  At least one article they have written   Title, journal, volume, pages, DOI, abstract

•  Other identifiable information   An email for submitting authors   Sometimes: institutional affiliation and contact

information for corresponding authors

Page 26: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

Some requirements •  Recognizing ORCIDs for authenticated users

  Mapping to InCommon Silver profiles   ORCIDs for organizations (e.g. consortia)

•  Dspace support   Curator interface for ORCID lookup/verification

  Ideally optional in submission interface   Allowing metadata relationships (e.g. of an ORCID with a

name)

•  Mechanisms for curator to   At least flag duplicates and errors   Register provisional ORCIDs

  Map to other profiles (e.g. InCommon)

Page 27: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

Business model issues •  Dryad is (will be) supported by subscriptions and

deposit charges, primarily from journals.   With a not-for-profit budget

•  Feasibility requires wide adoption by publishers   And manuscript-submission system developers!

•  Favored model   Pay for use of automated lookup services

  Credit for curator contributions

Page 28: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

For more information: http://datadryad.org

http://blog.datadryad.org http://datadryad.org/wiki

http://code.google.com/p/dryad [email protected]

Facebook: Dryad Twitter: @datadryad

"Cherish old knowledge that you may acquire new" The Analects of Confucius

Special thanks to Elena Feinstein Jane Greenberg

Ryan Scherle

Page 29: Leveraging publication metadata to help overcome the data ... · Leveraging publication metadata to help overcome the data ingest bottleneck ! ... The long tail of orphan data in

Data Package Article

Datafile

•  dc.identifier = doi of data file •  dc.relation.isPartOf = doi of data package •  file-specific description: keywords, authors, format, size, checksum, etc. •  embargo information (type, end date)

•  dc.identifier = doi of data package •  dc.relation.hasPart = dois of data files •  dc.references = handle of article description record •  dc.title = title of data package •  dc.description (not article abstract, optional) •  dc.creator = authors of data package •  dc.date (with refinements – dates associated with submission to Dryad and archiving in the repository) •  dryad.external = GenBank accession number, TreeBASE identifier •  dc.relation = URL of related resource •  dc.subject = general keywords •  DarwinCore.ScientificName = taxon keywords •  dc.spatial = geographic keywords •  dc.temporal = timespan keywords •  dryad.curatorNote

•  dc.identifier = doi of article •  bibo.status = article publication status •  dc.creator = authors of article •  dc.issued = article publication date •  dc.title = title of article •  bibo.journal = journal title •  bibo.issn and bibo.eissn •  bibo.volume •  bibo.issue •  bibo.pageStart and bibo.pageEnd •  dc.abstract = article abstract •  dc.isReferencedBy = data package doi

Dryad Metadata Profile (v3.0)