what have scientists planned for data sharing and reuse? a content analysi…
DESCRIPTION
Presentation at the Research Data Access and Preservation Summit (RDAP2013) - Baltimore, MD, 4-5 April 2013.TRANSCRIPT
What have Scientists Planned for Data Sharing and Reuse?
A Content Analysis of NSF Awardees’ Data Management Plans
Renata Curty, Youngseek Kim & Dr. Jian Qin
Baltimore, 4-5 April 2013
Motivation
While the NSF mandate gives researchersplenty flexibility to define their own DMPand many academic institutions provideDMP writing support, little is known abouthow scientists address their strategies ontheir DMPs.
Study Design Online Survey: 20 questions
Target Population: NSF Awardees from January 18, 2011 to November 5, 2012 - Standard Grants - Total 16065
Random Sample: 1606 cases
Pilot Study: 100 Awardees (Survey Reformulation)
Final Deployment: 966 awardees, 169 responses (17.5%) and DMPs (68)
NSF Directorate Amount Awarded
166 166
10%
16%
12%
18%
16%
15%
13%
BIO CISE EHR ENGGEO MPS SBE
Awards Info
Awardees InfoAge Organization Type
7%
41%
26%
19%
7%
25-24
35-44
45-54
55-64
65+
150 151
Academia, 93%
Awardees InfoPosition in Academia
Others: Dean (3), Professor Emeritus (1), Professor of Practice (1), Lecturer/Instructor (1), Post-Doctoral Fellow (1), Emeritus Senior Scientist, Director, Expert Consultant, Administrative Faculty Position, Chair.
143 138
Assistant Professor
22%
Associate Professor
28%
Full Professor
40%
Researcher6.77%
Tenured62%
Retired 2%
On Tenure Track25%
Non-Tenure Track11%
Geographical Distribution
109Created with Google Fusion Tables.
4.79
%0.
40%
3.01
%
22.7
5%
21.5
6%10
.24%
11.3
8%
13.7
7%
6.63
%
25.7
5%25
.75%
10.8
4%
23.3
5%23
.35%
22.8
9%
8.98
%10
.18%
33.1
3%
2.99
%2.
99%
13.2
5%
Strongly disagree Disagree Somewhat disagree
Neither agree or disagree Somewhat agree Agree
Strongly agree
DMP is difficult to execute
DMP is important to formalize data sharing practices in science
N=166= 4.93= 1.62
Writing a DMP for NSF proposal is a challenging task
N=167= 3.89= 1.45
N=167= 3.79= 1.51
Others: Computational Models, Surveys, DNA Sequences, Computer Codes, Crowdsourcing Data (Reviews)
Types of Data Documentation of Data
Will follow:
46% - Disciplinary practices
37% - Research project’s needs
17% - Institutional recommendations/ guidelines
158
3D Models 13.01% - 19
Audio Files 12.33% - 18
Curriculum Materials 21.23% - 31
Data Models 27.40% - 40
Field Notes 26.03% - 38
Experimental Data 63.70% - 93
Images 36.99% - 54
Interview Transcripts 17.12% - 25
Patient Records 0.68% - 1
Samples 20.55% - 30
Software 35.62% - 52
Spreadsheets 40.41% - 59
Video Files 21.23% - 31
Challenges Encountered
None26%
Lack of guidance from my
institution29%
Lack of guidance from NSF
36%
Appropriate infrastructure
to archive/preserve data
41%
Level of granularity
of data25%
Data
Description & Documentation
30%
Which stage(s) of
research to share the
data 25%
Others:
Some projects do not generate data
Conflict between DMP requirement and IRB requirements regarding social and behavioral research data
Conflicts intellectual property and data protection
Long-term preservation issues
Conflicts individual/group vs. institutional strategies
169
Data Access & Availability
167
Others: “Publications”, “Available to NSF only”
Open 45%
Available with some
restrictions51%
Restricted5%
By email request 45.52% - 61
Personal website 17.91% - 24
Research Group/Project Website
51.49% - 69
Institutional Repository 20.15% - 27
Disciplinary Repository 32.84% - 44
164
Barriers for Data Reuse
Reuse Issues - Privacy, Anonymity & Confidentiality
“IRB restrictions on ability to share even deidentified data. Concern that sharing even deidentified data will discourage participation in the study.”
“For myself, no. But for others to use my data, yes: for qualitative data, under IRB requirements for the protection of human subjects around confidentiality and anonymity, DMPs are nearly impossible to implement without perhaps some kind of temporal restriction on them (like, ‘This archive can only be opened in 20 -30 - 40 years’ or something like that)”
“The project involves human subject; so protections have to be put in place that may limit reuse applications in the future.”
“HIPAA *Health Insurance Portability and Accountability Act+ issues - obtaining self reporting data on human subjects.”
Reuse Issues - Context, Time Factor & Documentation“My past data was collected on a unique system built specifically for the research project. Need lots of context to reuse the data.”
“The only problems I see is that data can be taken out of context in a way that produces results that might not be correct.”
“Data is specific to testing scenarios. The insight gleaned from our experimental data is of more importance than the data itself.”
“My data is for specific purposes and it is hard to conceive of how someone would use it for something else/different. Even with a significant amount of metadata it would be difficult for someone to know all the circumstances under which the data was collected and why it was collected.”
“All scientific data is collected in particular context. Mechanisms that facilitate the description of that context are lacking. The creation of metadata that provides this information is a cumbersome, boring task and there are few resources available to ease the burden.”
“Systems are always changing...It would be best if we could upload data to NSF so that it will be publicly available in the same way NIST [National Institutes of Standards and Technology+ publishes data.”
“Our raw data formats are extremely large, and need to be compressed into reduced, on-line archives for sharing. It is not possible for me as an individual PI to archive the raw data for others to examine.”
“My data is generally related to large software artifacts, so using it could involve quite a bit of work to get those artifacts running. This is something that I explicitly try to come up with solutions for in my DMPs.”
“Until NSF provides a free national repository for data archiving, we will not make progress in this area. If such an archive was available, it would be sensible to require researchers to place data there at the end of a grant and would allow other researchers to take advantage of it in a practical way.”
Reuse Issues - Format, Tools, Infrastructure Interoperability & Standards
DMPs – Preliminary Content Analysis
• Coding Scheme
Used both deductive and inductive approaches
35 codes
NSF DMP Policy and University of Virginia's Guideline
Emerged from DMP statements
• Data Analysis Procedure
A total of 766 utterances were identified
642 unique utterances
DMPs’ Content
<Wordle Cloud Generated Based on Numbers of Each Code across the 68 DMPs>
Coding Scheme
Types of Data
Metadata Standards
Data Access & Sharing Process
Data Archiving
Plan
Data Reuse Plan
Others
• What to Generate
• What Data Types
• How to Create• Where to Get
Existing Data
• Data Format
• Metadata Form
• How to Create
• Which
Metadata
Standard
• Contextual
Details Needed
• Discoverability
of the Data
• When Available• How Available• What Available• Process for
Gaining Access• How Long
Retain the Right• Embargo Period• Ethical/Privacy
Issues• Compliance
with IRB Protocol
• Whose Intellectual Property
• Reusability of the Data
• Restrictions to Access
• Groups Interested In
• Foreseeable Uses/Users
• Strategy for Archiving Data
• Which Repository
• Procedures for Long-Term Storage
• Data Preservation Period
• What Data Preserved for Long-Term
• Transformation Required
• Data Documentation
• Related Information
• Data Lifecycle• Data Curation• Budget
Types of Data
Codes Freq. Examples
What to Generate 58 Geochemical Data, Physical Samples, Mathematica(programing) Code, Course Materials
What Data Types 37 Gene Sequences, Experimental Data, Interview Transcript, Video Recordings
How to Create Data 25 Experimental Setup, Field Observation, Simulation, Survey, Interviews
Where to Get Existing Data 13 Moore Laboratory of Zoology, ArcView/GIS Inventories, Prior Study’s Database
Metadata StandardCodes Freq. Examples
Data Format 38 CSV file, TEMPO data file, XML format, SPSS file, plain text
Metadata Form 31 ArcGIS Metadata file, XML-base standard file, GIS database file
How to Create Metadata 14 Use existing metadata standards, or develop their own metadata standards
Which Metadata Standard 15 Dublin Core, DNA Sequence Metadata, EML (Ecological Metadata Language)
Contextual Details Needed 10 All aspect of the development project documented, experimental procedure record
Data Discoverability 7 Searches Built into Library, Searchable through Project Website
Data Access & Sharing Process
Codes Freq. Examples
When Available 28 Post-Publication, Post-Project, After Data Collection
How Available 37 Upon Request, Project Website, GMOD CHADO databases, Institutional Repository
What Available 33 Original research data (genome assemblies), survey data, educational materials
Process for Gaining Access 25 Email Request, Material Transfer Agreement, Direct Access from Web or Repository
How Long Retain the Right 18 Withhold until Publication, Years after Project Ends, Years after Data Production
Embargo Period 5 Years after data collection, Period for commercialization
Ethical/Privacy Issues 21 Privacy information is not available for public
Compliance with IRB Protocol 13 IRB application submission for human subject research
Whose Intellectual Property 17 Property of the PI and Co-PIs, Institutions, Open-Access
Data Archiving
Codes Freq. Examples
Strategy for Archiving Data 31 Hosted on the Web Servers at (university), ICPSR, disciplinary data repository
Which Repository 55 Organization website, institutional or discipline data repository
Procedures for Long-Term Storage
33 Submitted to databanks including NCBI GEO, Genbank, DataONE, Dryad
Data Preservation Period11 Minimum of five years post-grant funding, Long-
term preservation through disciplinary data repositories
What Data Preserved for Long-Term
7 All data and materials generated by this award, Genome Sequencing Data
Transformation Required 4 Keeping raw image data in its uncompressed form,transferred to IRI format
Data Documentation Submitted 11 Contextual details about experimental procedures, all aspects of the development project
Related Information Submitted 3 Metadata files, proposed study information, companion web page
Data Reuse PlanCodes Freq. Examples
Reusability of the Data 6 Descriptions about reusable methods (Used by a research community to follow-up)
Restrictions to Access 6 Access allowed for a certain group of researchers
Groups Interested In 8
Wider research community studying the Great Lakes, academic geography organizations, and geography teacher associations
Foreseeable Uses/Users 10
Available to engineers, clinicians, and medical researchers, sociologists and psychologists working in relevant sub-fields.
OthersCodes Freq. Examples
Data Lifecycle 1 Application of the Life Cycle Inventory databases
Data Curation 4 Curation (Consortiums and Partnerships)
Budget 9 Institution will absorb costs, no incremental costs , marginal costs
Data Available -
3 3
10
3
8
1
27
13
0
5
10
15
20
25
30
After data collection
After project
ends
After publication
Years after data
collection
Years after project
ends
Years after publication
Not Specified
Not Mentioned
Types of Data Repositories for Long-Term Archiving
11
4
14
11
2
13 13
0
2
4
6
8
10
12
14
16
Disciplinary Repository
External/Commercial
Storage
Institutional Repository
Internal/Institutional
Storage
Journal Repository/ Supplement
Lab/Organization
Website
Not mentioned/
Specified
Some insights – DMPs’ Preliminary Analysis More informal/personal data sharing procedures rather than
formal/institutionalized data sharing and management plans
Most DMPs lacks content on “Metadata Standard” and “Data Reuse Plan”
Few have plans for long-term archiving. Very vague plans and ideas about long-term use of their data
Many DMPs addressed data archiving in institutional repositories that are not in existence yet, but expected to be created
A few DMPs mentioned interview transcripts will be available, but without addressing IRB issues
Future Directions
Survey a larger number of Awardees
More exhaustive coding analysis and in-depth exploration of the DMPs’ content
Analysis of DMPs to identify patterns, common challenges and best practices across and within different disciplinary communities