gaining credit for sharing research data
TRANSCRIPT
Varsha Khodiyar, PhD
Data Curation Editor, Scientific Data
Nature Publishing Group
@varsha_khodiyar
@scientificdata
Tweet with #SDJPN16
Gaining credit for sharing research data
Data publishing with Scientific Data RIKEN Center for Life Science Technologies 4th March 2016
My background • Joined Scientific Data in October 2014
• Professional data curator since 2003
• PhD in Molecular Biology from the University of Leicester
• Contributed to the Human Genome Project as member of the Human Gene Nomenclature Committee (HGNC)
• Gene Ontology curator for 8 years, at University College London, UK
• 3 years of open data publishing experience
2
Why share research data?
Generating research data is expensive
Just 18.1% NIH grant applications funded in 2014*
• Hours spent writing grants?
• Hours spent reviewing grants?
Resources are finite/expensive
• Modified animals
• Specialized reagents
Time and effort taken in the laboratory to generate good, valid data
* report.nih.gov/success_rates/Success_ByIC.cfm
Irreproducibility of published science
Figure 1 - Ioannidis JPA. et al. Repeatability of published microarray gene
expression analyses. Nature Genetics 41, 149–55 (2009) doi:10.1038/ng.295
Withholding data impacts on human health
Clinical study reports, detailed data and software code available at Dryad Digital Repository doi:10.5061/dryad.bv8j6 and www.Study329.org
• Diversity of analyses and opinion
• New research
• testing of new hypotheses
• new analysis methods
• meta-analyses to create new datasets
• studies on data collection methods
• Education of new researchers
• Increased return on investment in research
Vickers AJ: Whose data set is it anyway? Sharing raw data from randomized trials. Trials 2006, 7:15
Hrynaszkiewicz I, Altman DG: Towards agreement on
best practice for publishing raw clinical trial data. Trials 2009, 10:17
Sharing data promotes
Researchers already share data
• Most researchers are sharing
data, and using the data of
others
• Direct contact between
researchers (on request) is a
common way of sharing data
• Repositories are second most
common method of sharing
Kratz and Strasser (2015) doi: 10.1371/journal.pone.0117619 9
Some problems… • Sharing upon request relies heavily on trust
• Informally stored data associated with published works disappears at a
rate of ~17% per year (Vines et al. 2014; doi: 10.1016/j.cub.2013.11.014)
• Datasets not referenced in a manuscript are essentially invisible (a.k.a
“Dark data”)
• If data are available, they are often not interpretable or reusable
because sufficient detail is not included
• Data producers do not get appropriate credit for their work
10
www.nature.com/scientificdata
Credit – Scholarly credit for publishing data; all publications are indexed
and citeable.
Reuse – Standardized and detailed descriptions enables easier reuse of
published research data.
Quality – Rigorous peer-review on technical quality and reusability.
Editorial Board of experts in their field maintain community standards.
Discovery – Curated, machine-readable metadata for dataset discovery.
Validated links to published data in each article.
Open – Use of CC-BY licence for articles and CC0 for metadata. Promote
use of open licences for published data.
Service – Commitment to excellent service for authors and readers.
What is a Data Descriptor?
Data Descriptors have human and machine readable components
13
Human readable representation of
study i.e. article (HTML &
PDF)
Human readable representation of
study i.e. article (HTML
& PDF)
Machine readable
representation of study
i.e. metadata
Synthesis
Analysis
Conclusions
What did I do to generate the data?
How was the data processed?
Where is the data?
Who did what and when?
Methods and technical analyses supporting the quality of the measurements.
Do not contain tests of new scientific hypotheses
Comparison of Data Descriptor to traditional article
What types of data can be published?
15
Decades old
dataset
Standalone dataset
Data that has been used in an analysis
article
Large consortium
dataset
Data from a single
experiment
Data that the researcher finds
valuable and that others might find
useful too
Data associated with a high impact
analysis article
When can a Data Descriptor be published?
16
After data analysis has
been published
Before analysis has been published
Authors not intending to analyse data
Data Descriptors can be submitted and published
at any point in the research workflow, i.e.
whenever it makes most sense for your data
After data analysis has
been published
Before the analysis has
been published
Publication alongside analysis
article
Scientific Data accepts submissions from all quantitative research disciplines
17
Helping authors find the right place for their data
Scientific Data’s Repository List
Browse our recommended data repositories online.
• We currently list almost 80 repositories, across biological, medical,
physical and social sciences
• When required, we provide guidance to authors on the best place to
store their data
www.nature.com/sdata/data-policies/repositories
Generation of machine readable metadata
• We want to capture metadata about the dataset being described in each Data Descriptor
• The manuscript captures human readable metadata needed for data reuse
• The curated metadata records capture machine readable metadata needed for machine based data discovery
Metadata at Scientific Data
ISA-Tab format for machine readable metadata
22
• Study workflow
• Key sample characteristics
needed for data discovery
• Relates samples to data files
• Shows location of dataset
• Uses controlled vocabularies
and ontologies (where
possible)
Use of community endorsed ontologies and controlled vocabularies
23
Controlled vocabulary = list of standardized phrases of scientific concepts Ontology = controlled vocabulary with defined relationships between terms
Structured Summary table from curated metadata
24
Investigation file
Study file
Sample characteristics reported in Structured Summary table: Organism Organism part Cell line Geographical location Environment type
Viewing the metadata
25
1.
2.
3.
Metadata for data discovery
Search by: • Data Repositories • Experiment design • Measurements made • Technologies used • Factor types • Sample Characteristics
• Organism • Environment types • Geographic locations
scientificdata.isa-explorer.org
Citing Data
Citing my own data
1. In the article text
2. In the Data Citation section
Citing data I’ve reused
1. In the article text
2. In the References
section
Clinical researchers support sharing, but…
Rathi V, Dzara K, Gross CP, Hrynaszkiewicz I, Joffe S, Krumholz HM, Strait KM, Ross JS: Sharing of clinical trial data among trialists: a cross sectional survey. BMJ 2012;345:e7570
• Sharing de-identified data via repositories should be required (236 respondents, 74%)
• Investigators should share de-identified data on request (229 respondents, 72%)
…clinical data producers have specific concerns
Rathi V, Dzara K, Gross CP, Hrynaszkiewicz I, Joffe S, Krumholz HM, Strait KM, Ross JS: Sharing of clinical trial data among trialists: a cross sectional survey. BMJ 2012;345:e7570
Example initiatives for sharing clinical data
Yale Open Data Access (YODA) & Clinical Study Data Request (CSDR) projects:
• Data Use Agreements (DUAs) • Controlled access environment • Scientific validity of reanalysis checked • Independent governance • Data anonymisation checks
http://yoda.yale.edu/ https://www.clinicalstudydatarequest.com/
Clinical data publication at Scientific Data
• Identify repositories able to archive clinical data
• Work with identified repositories to establish workflows for
peer review and publication, whilst maintaining patient
privacy
• Facilitate specialist peer review process for clinical data, for
example ensure peer reviewers have agreed to terms of data
use agreement
Hrynaszkiewicz, I., Khodiyar, V., Hufton, A. & Sansone, S. A. Publishing descriptions of non-public clinical datasets: guidance for researchers, repositories, editors and funding organisations. BioRxiv http://dx.doi.org/10.1101/021667 (2015).
A robust data-on-request workflow?
Published Data Descriptor with clinical data Data Records
section details how to access
the data
Links to restricted access data Data Citations link to repository
Data files requiring
permission to access
Freely accessible data files
Data Reuse stories
Data reuse by (some of) the same researchers
38
Data reuse by other researchers in the same field
39
“The Data Descriptor made it easier to use the data, for me it was critical that everything was there…all the technical details like voxel size.”
Professor Daniele Marinazzo
According to Google Scholar, cited 43 times! (February 2016)
Data reuse and citation by researchers
41
www.bbc.co.uk/news/science-environment-33057402
Data reuse by the non-research community
Data reuse by the non-research community
42
http://www.nytimes.com/interactive/2014/12/30/science/history-of-ebola-in-24-outbreaks.html
Data Descriptors…
• …enable you to gain scholarly credit for your data gathering efforts.
• …are human AND machine readable.
• …can be published with, or independently of, an analysis article.
• …can be published point in the research workflow.
• …allow the publication and discovery of clinical data, whilst maintaining your patients privacy.
• …result in greater reuse and citation by fellow members of your research community.
• …extend the impact of your research data by enabling access to and reuse by the non-research community.
43
Get more from
your data
Preserve it
Encourage reuse
Get credit for it
Visit nature.com/sdata Email [email protected] Tweet @ScientificData #SDJPN16