rdap 15: the role of assessment in research data services
TRANSCRIPT
Role of Assessment in RDS
Factors to Assess
• Data Management Plans
• Research Methodologies of Faculty
• Journal Data Policies and Faculty’s Compliance
Using assessment of NSF data management plans to enable evidence-based evolution of research data services
Amanda Whitmire, Jake Carlson, Patricia Hswe,
Susan Wells Parham, Lizzy Rolando & Brian Westra
@DMPResearch
Acknowledgements
Jake Carlson ─ University of Michigan Library
Patricia M. Hswe ─ Pennsylvania State University Libraries
Susan Wells Parham ─ Georgia Institute of Technology Library
Lizzy Rolando ─ Georgia Institute of Technology Library
Brian Westra ─ University of Oregon Libraries
This project was made possible in part by the Institute of Museum and Library Services
grant number LG-07-13-0328.
23 April 2015 3
23 April 2015 4
23 April 2015 5
23 April 2015 6
23 April 2015 7
23 April 2015 8
levels of data services
the basics DMP review workshopswebsite
mid-level
dedicated “research services”
metadata support
facilitate deposit in
DRsconsults
high level infrastructure data curation
23 April 2015 9From: Reznik-Zellen, Rebecca C.; Adamick, Jessica; and McGinty, Stephen. (2012). "Tiers of Research
Data Support Services." Journal of eScience Librarianship 1(1): Article 5.
http://dx.doi.org/10.7191/jeslib.2012.1002
Informed data services development
23 April 2015 10
surveys
23 April 2015 11
data curation profiles
Informed data services development
23 April 2015 12
DMP
data mgmt.plans
Informed data services development
DART Premise
13
DMP
Research Data
Management
needs
practices
capabilities
knowledge
researcher
23 April 2015
DART Premise
14
Research Data
Management
needs
practices
capabilities
knowledge
Research Data
Services
23 April 2015
15
“Of the 181 NSF DMPs that were analyzed, 39 (22%) identified Georgia Tech’s institutional repository, SMARTech.”
“We have a clear road ahead of us: we will target specific schools for outreach; develop consistent language about repository services for research data; and focus on the widespread dissemination of information about our new digital preservation strategy.”
23 April 2015
We need a tool
1623 April 2015
Solution: An analytic rubric
17
Performance Levels
Perform
ance
Criteria
High Medium Low
Thing 1
Thing 2
Thing 3
23 April 2015
18
Literature review on creating & usinganalytic rubrics
23 April 2015
19
NSF-tangent & 3rd-party DMP guidance
23 April 2015
20
NSF DMP guidance
23 April 2015
21
NSF Directorate or DivisionBIO Biological Sciences
DBI Biological Infrastructure
DEB Environmental Biology
EF Emerging Frontiers Office
IOS Integrative Organismal Systems
MCB Molecular & Cellular Biosciences
CISE Computer & Information Science & Engineering
ACI Advanced Cyberinfrastructure
CCF Computing & Communication Foundations
CNS Computer & Network Systems
IIS Information & Intelligent Systems
EHR Education & Human Resources
DGE Division of Graduate Education
DRL Research on Learning in Formal & Informal Settings
DUE Undergraduate Education
HRD Human Resources Development
ENG Engineering
CBET Chemical, Bioengineering, Environmental, & Transport Systems
CMMI Civil, Mechanical & Manufacturing Innovation
ECCS Electrical, Communications & Cyber Systems
EEC Engineering Education & Centers
EFRI Emerging Frontiers in Research & Innovation
IIP Industrial Innovation & Partnerships
GEO Geosciences
AGS Atmospheric & Geospace Sciences
EAR Earth Sciences
OCE Ocean Sciences
PLR Polar Programs
MPS Mathematical & Physical Sciences
AST Astronomical Sciences
CHE Chemistry
DMR Materials Research
DMS Mathematical Sciences
PHY Physics
SBE Social, Behavioral & Economic Sciences
BCS Behavioral & Cognitive Sciences
SES Social & Economic Sciences
division-specific guidance
*
*
*
*
*
********
23 April 2015
Consolidated guidance
22
Source Guidance text
NSF guidelines The standards to be used for data and metadata format and content (where
existing standards are absent or deemed inadequate, this should be
documented along with any proposed solutions or remedies)
BIO Describe the data that will be collected, and the data and metadata formats and
standards used.
CSE The DMP should cover the following, as appropriate for the project: ...other
types of information that would be maintained and shared regarding data, e.g.
the means by which it was generated, detailed analytical and procedural
information required to reproduce experimental results, and other metadata
ENGData formats and dissemination. The DMP should describe the specific data
formats, media, and dissemination approaches that will be used to make data
available to others, including any metadata
GEO AGSData Format: Describe the format in which the data or products are stored (e.g.
hardcopy logs and/or instrument outputs, ASCII, XML files, HDF5, CDF, etc).
23 April 2015
AdvisoryBoard
Project team testing & revisions
Feedback & iteration
Rubric
23 April 2015 23
23 April 2015 24
Performance Level
Performance Criteria Complete / detailedAddressed issue, but
incompleteDid not address
issue Directorates
Gen
eral
Ass
ess
men
tC
rite
ria
Describes what types of data will be captured, created or collected
Clearly defines data type(s). E.g. text, spreadsheets, images, 3D models, software, audio files, video files, reports, surveys, patient records, samples, final or intermediate numerical results from theoretical calculations, etc. Also defines data as: observational, experimental, simulation, model output or assimilation
Some details about data types are included, but DMP is missing details or wouldn’t be well understood by someone outside of the project
No details included, fails to adequately describe data types.
All
Dir
ecto
rate
-o
r d
ivis
ion
-sp
ecif
ic a
sse
ssm
ent
crit
eria
Describes how data will be collected, captured, or created (whether new observations, results from models, reuse of other data, etc.)
Clearly defines how data will be captured or created, including methods, instruments, software, or infrastructure where relevant.
Missing some details regarding how some of the data will be produced, makes assumptions about reviewer knowledge of methods or practices.
Does not clearly address how data will be captured or created.
GEO AGS,GEO EAR SGP, MPS AST
Identifies how much data (volume) will be produced
Amount of expected data (MB, GB, TB, etc.) is clearly specified.
Amount of expected data (GB, TB, etc.) is vaguely specified.
Amount of expected data (GB, TB, etc.) is NOT specified.
GEO EAR SGP, GEO AGS
23 April 2015 25
Performance Level
Performance Criteria Complete / detailedAddressed issue, but
incompleteDid not address
issue Directorates
Gen
eral
Ass
ess
men
tC
rite
ria
Describes what types of data will be captured, created or collected
Clearly defines data type(s). E.g. text, spreadsheets, images, 3D models, software, audio files, video files, reports, surveys, patient records, samples, final or intermediate numerical results from theoretical calculations, etc. Also defines data as: observational, experimental, simulation, model output or assimilation
Some details about data types are included, but DMP is missing details or wouldn’t be well understood by someone outside of the project
No details included, fails to adequately describe data types.
All
Dir
ecto
rate
-o
r d
ivis
ion
-sp
ecif
ic a
sse
ssm
ent
crit
eria
Describes how data will be collected, captured, or created (whether new observations, results from models, reuse of other data, etc.)
Clearly defines how data will be captured or created, including methods, instruments, software, or infrastructure where relevant.
Missing some details regarding how some of the data will be produced, makes assumptions about reviewer knowledge of methods or practices.
Does not clearly address how data will be captured or created.
GEO AGS,GEO EAR SGP, MPS AST
Identifies how much data (volume) will be produced
Amount of expected data (MB, GB, TB, etc.) is clearly specified.
Amount of expected data (GB, TB, etc.) is vaguely specified.
Amount of expected data (GB, TB, etc.) is NOT specified.
GEO EAR SGP, GEO AGS
23 April 2015 26
Performance Level
Performance Criteria Complete / detailedAddressed issue, but
incompleteDid not address
issue Directorates
Gen
eral
Ass
ess
men
tC
rite
ria
Describes what types of data will be captured, created or collected
Clearly defines data type(s). E.g. text, spreadsheets, images, 3D models, software, audio files, video files, reports, surveys, patient records, samples, final or intermediate numerical results from theoretical calculations, etc. Also defines data as: observational, experimental, simulation, model output or assimilation
Some details about data types are included, but DMP is missing details or wouldn’t be well understood by someone outside of the project
No details included, fails to adequately describe data types.
All
Dir
ecto
rate
-o
r d
ivis
ion
-sp
ecif
ic a
sse
ssm
ent
crit
eria
Describes how data will be collected, captured, or created (whether new observations, results from models, reuse of other data, etc.)
Clearly defines how data will be captured or created, including methods, instruments, software, or infrastructure where relevant.
Missing some details regarding how some of the data will be produced, makes assumptions about reviewer knowledge of methods or practices.
Does not clearly address how data will be captured or created.
GEO AGS,GEO EAR SGP, MPS AST
Identifies how much data (volume) will be produced
Amount of expected data (MB, GB, TB, etc.) is clearly specified.
Amount of expected data (GB, TB, etc.) is vaguely specified.
Amount of expected data (GB, TB, etc.) is NOT specified.
GEO EAR SGP, GEO AGS
23 April 2015 27
Performance Level
Performance Criteria Complete / detailedAddressed issue, but
incompleteDid not address
issue Directorates
Gen
eral
Ass
ess
men
tC
rite
ria
Describes what types of data will be captured, created or collected
Clearly defines data type(s). E.g. text, spreadsheets, images, 3D models, software, audio files, video files, reports, surveys, patient records, samples, final or intermediate numerical results from theoretical calculations, etc. Also defines data as: observational, experimental, simulation, model output or assimilation
Some details about data types are included, but DMP is missing details or wouldn’t be well understood by someone outside of the project
No details included, fails to adequately describe data types.
All
Dir
ecto
rate
-o
r d
ivis
ion
-sp
ecif
ic a
sse
ssm
ent
crit
eria
Describes how data will be collected, captured, or created (whether new observations, results from models, reuse of other data, etc.)
Clearly defines how data will be captured or created, including methods, instruments, software, or infrastructure where relevant.
Missing some details regarding how some of the data will be produced, makes assumptions about reviewer knowledge of methods or practices.
Does not clearly address how data will be captured or created.
GEO AGS,GEO EAR SGP, MPS AST
Identifies how much data (volume) will be produced
Amount of expected data (MB, GB, TB, etc.) is clearly specified.
Amount of expected data (GB, TB, etc.) is vaguely specified.
Amount of expected data (GB, TB, etc.) is NOT specified.
GEO EAR SGP, GEO AGS
23 April 2015 28
Performance Level
Performance Criteria Complete / detailedAddressed issue, but
incompleteDid not address
issue Directorates
Gen
eral
Ass
ess
men
tC
rite
ria
Describes what types of data will be captured, created or collected
Clearly defines data type(s). E.g. text, spreadsheets, images, 3D models, software, audio files, video files, reports, surveys, patient records, samples, final or intermediate numerical results from theoretical calculations, etc. Also defines data as: observational, experimental, simulation, model output or assimilation
Some details about data types are included, but DMP is missing details or wouldn’t be well understood by someone outside of the project
No details included, fails to adequately describe data types.
All
Dir
ecto
rate
-o
r d
ivis
ion
-sp
ecif
ic a
sse
ssm
ent
crit
eria
Describes how data will be collected, captured, or created (whether new observations, results from models, reuse of other data, etc.)
Clearly defines how data will be captured or created, including methods, instruments, software, or infrastructure where relevant.
Missing some details regarding how some of the data will be produced, makes assumptions about reviewer knowledge of methods or practices.
Does not clearly address how data will be captured or created.
GEO AGS,GEO EAR SGP, MPS AST
Identifies how much data (volume) will be produced
Amount of expected data (MB, GB, TB, etc.) is clearly specified.
Amount of expected data (GB, TB, etc.) is vaguely specified.
Amount of expected data (GB, TB, etc.) is NOT specified.
GEO EAR SGP, GEO AGS
23 April 2015 29
Performance Level
Performance Criteria Complete / detailedAddressed issue, but
incompleteDid not address
issue Directorates
Gen
eral
Ass
ess
men
tC
rite
ria
Describes what types of data will be captured, created or collected
Clearly defines data type(s). E.g. text, spreadsheets, images, 3D models, software, audio files, video files, reports, surveys, patient records, samples, final or intermediate numerical results from theoretical calculations, etc. Also defines data as: observational, experimental, simulation, model output or assimilation
Some details about data types are included, but DMP is missing details or wouldn’t be well understood by someone outside of the project
No details included, fails to adequately describe data types.
All
Dir
ecto
rate
-o
r d
ivis
ion
-sp
ecif
ic a
sse
ssm
ent
crit
eria
Describes how data will be collected, captured, or created (whether new observations, results from models, reuse of other data, etc.)
Clearly defines how data will be captured or created, including methods, instruments, software, or infrastructure where relevant.
Missing some details regarding how some of the data will be produced, makes assumptions about reviewer knowledge of methods or practices.
Does not clearly address how data will be captured or created.
GEO AGS,GEO EAR SGP, MPS AST
Identifies how much data (volume) will be produced
Amount of expected data (MB, GB, TB, etc.) is clearly specified.
Amount of expected data (GB, TB, etc.) is vaguely specified.
Amount of expected data (GB, TB, etc.) is NOT specified.
GEO EAR SGP, GEO AGS
“Mini-reviews 1 & 2”
30
23 April 2015 31
23 April 2015 32
18
4
14
8
22
8
5
4
3
17
12
1
3
4
4
6
2
1
10
7
7
5
6
2
4
17
7
11
1
16
10
14
15
3
4
7
Describes what types of data will be captured, created orcollected
Identifies metadata standards or formats that will used forthe proposed project
Describes data formats created or used during project
Provides details on when the data will be made publiclyavailable
Describes how the data will be made publicly available
Describes security measures that will be in place to protectthe data from unauthorized access
Describes the policies or provisions in place governing theuse and reuse of the data
Describes the policies or provisions for redistribution of thedata
Describes policies or provisions for building off of the data,such as through the creation of derivatives
Indicates whether or not the data will be archived
Describes plans for archiving and preserving digital data*
Plan discusses the types or formats of data the investigatorexpects to retain in their possession*
Complete / detailed Addressed issue, but incomplete Did not address the issue
data sharing methods
23 April 2015 33
0
4
10
3
7
1
8
9
1
3
0
0 2 4 6 8 10 12
Did not specify
Institutional repository
Journal / supplement
National data center
Other data repository or method
Book
Personal website
On request
ETD
Conference / proceedings
Not planning to share data
To sum up…
34
http://bit.ly/dmpresearch@DMPResearch
Developing a rubric to empower academic librarians in providing research data support
35
Understanding Methodological and Disciplinary Differences in the Data Practices of Academic Researchers*
T R AV I S W E L L E R A N D A M A L I A M O N R O E - G U L I C K
( W I T H S P E C I A L T H A N K S TO B R I A N R O S E N B LU M & J U L I E WAT E R S F O R T H E I RI N VA L U A B L E W O R K O N T H E S U R V E Y T H AT G E N E R AT E D T H E D ATA F O R T H I S P R O J E C T )
*WELLER, TRAVIS, AND AMALIA MONROE-GULICK. "UNDERSTANDING METHODOLOGICAL AND DISCIPLINARY DIFFERENCES IN THE DATA PRACTICES OF ACADEMIC RESEARCHERS." LIBRARY HI TECH 32, NO. 3 (2014): 467-82.
Introduction
Methods
MethodsSurvey Instrument
Distribution
Response Rate
Limitations
Results
A Diversity of Methodologies
0 50 100 150 200 250 300
Quantitative
Qualitative
Statistical
Experimental
Historical
Case study
Archival
Comparative…
Survey research
Textual analysis
Field work
Modeling
Data mining
Data…
Meta-analysis
Simulation
Ethnography
Oral history
Geospatial
Other
Feasibility study
Multiple Methodologies
0 20 40 60 80 100
1
3
5
7
9
11
13
15
Number of Respondents
Nu
mb
er
of
Met
ho
ds
Conclusion #1
Researchers are multiple methodologists.
Implication #1
Technical data solutions must be flexible to be useful, even for a single researcher
Future Needs
Future Needs
We asked – in what areas do you anticipate needing assistance in the future:
Writing Data Plans
Digitizing Resources
Data Analysis
Data Storage/Archiving/Preservation
Dissemination and Publication
Future Needs
We asked – in what areas do you anticipate needing assistance in the future:
For faculty, greatest need was in storage/archiving/preservation
For graduate students, it was in analysis and dissemination and publication
But, there was also variation between research methods
Future Needs
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Overall Quantitative Qualitative Statistical Experimental Historical
Writing Data ManagementPlanDigitization of Resources
Data Analysis
Data Storage, Archiving andPreservation
Future Needs (by methodology)
Future Needs
areas of Greatest Anticipated Need
Stats, quantitative and experimental – data analysis
Qualitative researchers - dissemination and publication (analysis close behind)
Historical researchers - digitization and storage/archiving/preservation
Conclusion #2
Methods matter for data services.
Implication #2
Including methods in assessments
may be useful.
Implication #3
Tailor services and/or outreach to be method-specific.
Methods tailored research services – Examples
Center for Research Methods & Data Analysis
Work Group on Qualitative Research
Assessing researcher compliance with publisher requirements for data sharingKathleen Fear
@kmfear
RDAP 2015
The team
• Judi Briden, Outreach Librarian for Brain & Cognitive Sciences, Linguistics, Public Health, American Sign Language
• Sue Cardinal, Outreach Librarian for Chemistry
• Diane Cass, Outreach Librarian for Biology, Mathematics & Statistics
• Tyler Dzuba, Head of the Physics, Optics and Astronomy Library
• Fang Wan, Outreach Librarian for Computer Science
Are our researchers sharing their data when
they need to?
What can we do to help?
(probably not)
Exploring one specific instance of sharing: publication-related data
• What journals do our authors publish in most commonly?
• Of those journals, which require or encourage data sharing?
• And where authors are subject to data sharing policies, how well are they complying?
Publication data
• All 2014 publications in Web of Science with affiliation = University of Rochester
• 2784 articles in 1181 journals
0
100
200
300
400
500
600
700
800
1 2-4 5+
# o
f jo
urn
als
Articles published per journal
Journal review
•Focus: 109 journals with 5+ articles
•Process:•Review journal’s author guidelines,
editorial policies and other documentation• Is there a data sharing policy?•How comprehensive is the data
sharing policy?
What’s a data sharing policy?
• For our purposes, a journal has a data sharing policy if it requires or explicitly encourages data sharing.
• PLOS One: “PLOS journals require authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception”
• Journal of Neurophysiology: "All authors of articles submitted to APS journals should submit their relevant data to all appropriate data repositories"
• Clinical Toxicology: “The journal offers authors the possibility of publishing supplementary data online.”
What’s a data sharing policy?
• For our purposes, a journal has a data sharing policy if it requires or explicitly encourages data sharing.
• PLOS One: “PLOS journals require authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception”
• Journal of Neurophysiology: "All authors of articles submitted to APS journals should submit their relevant data to all appropriate data repositories"
• Clinical Toxicology: “The journal offers authors the possibility of publishing supplementary data online.”
What’s a comprehensive data sharing policy?
• A comprehensive data sharing policy specifies:• How to share the data
• How to cite the data or otherwise indicate the data’s availability
• When the data should be accessible to others
• Rating scale: Excellent = 3/3 criteria covered; Good = 2/3; Fair = 1/3; Poor = 0/3
Data policy findings
• About 39% had a policy of some kind
• 91% of those with a policy had an Excellent or Good policy
• Guidelines around deposit of genomic / proteomic data are well-established; about a third of Excellent or Good policies address only these data types
• The most common missing piece was how to cite the data or otherwise indicate that they are available:
0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
How to share Time How to cite
Fair policies:
• FASEB Journal
• Monthly Notices of the Royal Astronomical Society
• Molecular Endocrinology
• Earth and Planetary Science Letters
Once an article has been published in The FASEB Journal, the editors and editorial board strongly encourage authors to archive the original data sets in publicly available and permanent repositories whenever possible and appropriate. The journal makes no recommendation or suggestion as to which repositories are most appropriate, but encourages authors to identify those which best meet their needs as well as the needs of those who will be accessing the data. (FASEB)
The journal list:
http://bit.ly/data_policies
(Feel free to check our work and add/edit/change!)
Article reviewJournal title
Number of articles
Policy rating
PLOS One 57 Excellent
Biophysical Journal 19 Excellent
PNAS 17 Excellent
Monthly Notices of the Royal Astronomical Society 10 Fair
Journal of Physical Chemistry Letters 9 Good
Journal of Clinical Investigation 9 Good
Journal of Neurophysiology 8 Excellent
Nucleic Acids Research 6 Excellent
PLOS Genetics 6 Excellent
Earth and Planetary Science Letters 5 Fair
Nature 5 Excellent
Nature Communications 5 Excellent
FASEB Journal 5 Fair
Article review
• Focus: 161 articles across 13 journals with varying levels of data policies
• Process:• Skim article• Are the data shared? How well are
they shared?•We want to encourage researchers
to share their data promptly in an easily accessible and usable form
Article rating
2
Link to where data are or will be accessible
"The full tree including bootstrap confidence at the nodes is deposited in the Dryad Digital Repository (http://dx.doi.org/10.5061/dryad.3sh6d)”
Data are included in paper or suppl. information in usable format
"The authors confirm that all data underlying the findings are fully available without restriction. All relevant data are within the paper and its Supporting Information files."
Justification for why data are not accessible
"Deidentified datasets are available upon request to researchers who have obtained IRB approval to conduct secondary analyses on them, since participants in our studies did not consent to publicly posting their data."
1
Indication that data are or will be available, without link
"All raw and processed data files have been deposited in the National Center for Biotechnology Information Gene Expression Omnibus dataset."
Instructions to contact author, with no justification
“Data are available on request”
“All relevant data are in the paper” but not in a usable format
Article findings
• 50% of articles rated a 0; 30% rated a 2
• No journal got all 0’s; 1 journal got all 2’s
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 1 2
Article review
Policy rating 0 1 2
Excellent 57 (54%) 19 (18%) 30 (28%)
Good 8 (53%) 5 (33%) 2 (13%)
Fair 2 (14%) 3 (21%) 9 (64%)
Our takeaways
•Many journals have decent guidance on data sharing (but that doesn’t mean authors follow it)
•We see room for improvement in data sharing
•Non-sharers to sharers
•Sharers to better sharers
Areas of focus for outreach
• Linking data and papers• Monitoring publication output for papers w/o
shared data and offering assistance• Targeting presentations and discussions focusing
on how to meet a particular journal’s guidelines• Keep tone in line with what the journal states• Identify important details in policies (e.g. Human
Molecular Genetics: “the availability of results after publication will be considered in decisions regarding publication”)
• Where policies are poor / fair, focus on grad students and new faculty
Amanda Whitmire, Oregon State University
Travis Weller, University of Kansas University
Amalia Monroe-Gulick, University of Kansas University
Kathleen Fear, University of Rochester