Data Management … a “nuts-and-bolts” part of
Responsible Conduct of Research
March 21, 2015
Enid Karr, Sr. Bibliographer for Biology, Earth & Environmental Sciences, Environmental Studies
Sally Wyman, Collection Development Librarian, Sr. Bibliographer for Chemistry, Physics, Environmental Studies
Barbara Mento, Data/GIS Librarian, Sr. Bibliographer for Computer Science, Economics, Mathematics
Fits into “responsible conduct of research”
Risk of data loss for you and the University
Facilitates fulfillment of requests from others to see your data
Shared data (“open access”) higher citation rate!
First – Why?
Increasingly, grants require a “Data Management Plan”
NSF
NIH
All larger agencies, coming soon Per White House Directive on Open Data -- Feb. 22, 2013
More scholarly journal policies (Nature, Science, PNAS, PLoS…) require that data must be:
Clearly documented .. available for sharing … detailed enough to permit replication of analysis
New “data journals” starting to appear – including Nature’s Scientific Data, which publishes data sets
More (Really Good) Reasons:
A “Typical” Data Management Plan
1-2 pages describing the project and how data will be:
Collected (including formats, size, etc.) … Secured … Analyzed … Shared … Preserved
Details about access/sharing
Potential audience(s) for the data
How access will be provided and how others will find it: “Access” (freely-available) vs. “Sharing” (by request)
Stipulations for privacy, confidentiality, IP or other rights
Allowed re-use of the data, derivative products
Metadata standards to be used
How long data will be retained -- archiving, long-term preservation and format migration
From the NSF FAQ on Data Management Plans:
“DMP” covers recorded factual material commonly accepted in the [specific] scientific community as necessary to validate research findings. May include, but is not limited to:
Data
Publications
Samples
Physical collections
Software and models
But not: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. (Office of Management and Budget (OMB) Circular A-110 )
Boston College LibrariesData Management Plan
Research Guide
Guidance on content
Templates/examples
Additional resources
To arrange a consultation with a subject specialist
http://libguides.bc.edu/dataplan
Data Management in Action
Some “best practices” while collecting or generating your data
Storage
Documentation
Loss Prevention
Security
Image: digitalart / FreeDigitalPhotos.net
Handling … Storing … and
Backing Up Your DataData Storage Elements to Consider:
File Formats and Naming
Directory Structure
Version Control
Assign Responsibility
Document your practices
Think about all of this EARLY
File Formats
Whenever possible, save your data using open standards. Avoid proprietary formats. Some examples:
TXT, PDF/PDF Archival, not Word (doc, docx)
ASCII, not Excel (xls, xlsx)
MPEG-4, not Quicktime (qtff)
TIFF or JPEG2000, not GIF or JPG
XML or RDF, not RDBMS
Ideally, save files in both original format AND one of the preferred ones listed above.
Why Use Open File Formats?
No restrictions on their use
Open source code future migration easier
Propriety formats are offered by companies that may go out of business, carrying the code knowledge with them
Facilitates sharing
Organization
File Naming Conventions/Best Practices Consistent, descriptive, UNIQUE … avoid spaces and
special characters
Use brief names
Can contain:
Project acronyms
Researchers’ initials
File type information
Version number
Date
File Status
IUS_v02_092011_final.csv
Internet Usage Study version 2, Sept 2011, final draft, in csv format
Organization
Directory Structure
Use folders!
Possible ways to organize:
By types of data
IR, NMR, etc.
By experiment
By collection method
Choose option that works best for your research group … it should be understandable to others
Image: digitalart / FreeDigitalPhotos.net
Version Control
Keep an archival (unmodified) version, and updated versions (clearly labelled)
Use ordinal numbers (1, 2, 3) for major changes and decimals for minor changes (V1.1, V1.2 …)
Version control software can help, and some software has this built-in… especially instrument software
Data Entry and Quality Control
Whatever you use, be consistent
Define abbreviations in readme.txt file or in a “codebook”
Record dates for best sorting (YYYYMMDD)
Check periodically for data corruption/integrity using checksum, for example
Flag problematic data
Handling of null values: problematic in moving across software platforms
Consider using blanks: treated as null values by R, Python, Excel
Don’t use text (as in, “no data”) in a data column formatted for numbers
Avoid manual data entry whenever possible
Consider making your raw data files “read only”
Data Documentation (“Metadata”)
What is metadata?
Benefits of good documentation
What elements should be documented?
For help, contact your subject specialist:
www.bc.edu/libraries/help/askalib.html
ISO suggested Minimum Data Elementso Titleo Creator (Principal Investigators)o Date Created (also versions)o Instrument and modelo Format (and software required)o Subjecto Unique Identifiero Description of the specific data
resourceo Coverage of the data (spatial or
temporal)o Publishing Organizationo Type of Resourceo Rightso Funding or Grant
Why Metadata?
It helps others discover your research when you share your data.
This “data about your data” captures the most critical information about a particular project. Capture it early on… you think you will remember, but …
Metadata may be required for journal publication/data deposit.
Metadata Standards
These vary …
by disciplineby type of data by repository
for example: GenBank
We can help.
Sample GenBank Record – example of a standard
Data Documentation – What do you do with it once you have it?
Record it in a readme.txt file
In some fields, “codebooks” are used to record methodology and other data management notes (e.g. IRB compliance statements, etc.)
Consider including a “data dictionary”
Inserted with deposited data these files facilitate “discovery” of your data on the Web
Data Loss Prevention
Regular back-ups protect against data loss
Back up strategy will depend on your needs:
Back up all versions of the files or certain ones?
How often will you back up files?
Have at least two back up locations
internal (your computer)
external (i.e. the BC Research Data Archive or departmental servers)
Assign responsibility for backing-up
Physical Storage Options
Local Centralized Remote
Convenient but less secure (especially external media)
More secure, with automatic back-up … and more space
Permanent, someone else takes responsibility for future migration
• On your own computer’s hard drive
• External media (hard drive, CD/DVD, flash drive)
• Departmental server, local network access
• ITS• Departmental
server, local network access
• Disciplinary Repositories, e.g. GenBank, Cambridge Structure Database
• Secure cloud options are in use at other institutions
Data Storage
ITS offers a remote, automated backup of faculty and staff computers using a product called Connected Backup by
Autonomy. Users have the ability to recover files from any location using a web browser.
http://www.bc.edu/offices/help/essentials/backup/ironmtn.html
Research Services provides secure archive space for research data that is backed up nightly.
http://www.bc.edu/offices/researchservices/dataresources/archive.html
Your department may provide its own storage options.
Funding Long-term Data Storage
Who will pay for this? NSF DMP guidelines encourage inclusion of cost information … and grants may pay.
How much of your data will you save? Raw data (untouched) always …
In general, data must be stored for three years (contact Dr. Stephen Erickson at the Boston College Office of Research and Integrity for more information).
Data Security
For additional assistance with security planning, consult the Computer Policy & Security Office of the IT Assurance
Department.
Director: David Escalante
www.bc.edu/offices/its/depts/assurance/policysecurity.html
Data Access and SharingOptions include:
Personal website
Journal “supplementary materials” (ACS, etc.)
Institutional repository, e.g. eScholarship@bc
Disciplinary (or multidisciplinary) repository
Or, a combination: journal-designated repository – Nature example)
E-Scholarship@bc
• A repository for BC data sets and publications
• A portal for pointing to your data wherever it is stored (at BC or beyond)
Data Sharing Options Beyond BC
Subject-based archives – ask your subject librarian
Directories of data repositories:
DataBib (Beta) http://databib.org/index.php#
Simmons Data Repositories Listing http://oad.simmons.edu/oadwiki/Data_repos
itories
Examples of Repositories
Biomedicine:
GenBank -- sequence data
RSCB Protein DataBank -- biomolecule crystal structure coordinates, etc.
Chemistry:
Cambridge Structural Database (CSD)
PubChem (Part of NCBI Entrez, covering biological activities of small molecules)
Multidisciplinary: FigShare.com (Open, Free)
DMPs: Data Sharing … also Archiving
What does the Data Sharing Policy Mean?
Example NSF: “plans for archiving data, samples, and other research products, and for preservation of access to them.”
Archiving Data means not just preserving the data in the original format but also in a format that is non-platform reliant, using a standard that ensures that the data can be re-used in the future.
Metadata is vital to insure data is findable.
Ethics and Privacy Sensitive data should be
redacted before depositing in a public archive or repository.
Access to data may be embargoed (access limited for a time) for confidentiality, legal, patentability or other reasons.
Dark archives ensure permanent protection of confidentiality.
Where human subjects/privacy is involved, BC’s Institutional Review Board (IRB) must approve. http://www.bc.edu/research/oric/human.html
Image: digitalart / FreeDigitalPhotos.net
Data Ownership
You may have copyright or ownership concerns when planning to share your
data.
For assistance and more information, please contact the Boston College
Office for Research Integrity and Compliance:
http://www.bc.edu/content/bc/research/oric/compliance.html
Intellectual Property/Technology Transfer Concerns
Funders/journals expect that you will share your data within a reasonable amount of time …
However, they also recognize the need to protect intellectual property rights and potential commercial value
The DMP should describe your plans to protect those rights
Contact the Boston College Office for Technology Transfer and Licensing as part of your DMP writing process
Research OutputData Citations
Why should I cite data? Ensures that original producers of the data
(you!) are credited in citation indexes*
Allows researchers to locate research data used in an article
May be required by the archive that stored the data you have repurposed
*Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308.doi:10.1371/journal.pone.0000308
Citing Data SetsEssential citation elements; style will vary:
• author or creator
• title or description
• year of publication
• publisher and/or the database/archive from which it was retrieved
• the URL or DOI if the data set is online
Mackey, R.A., Mackey, E.F., and O’Brien, B.A. (1990). Lasting relationships research data archive (eScholarship version) [Data file]. Boston College School of Social Work. http://hdl.handle.net/2345/2228
National Center for Biotechnology Information. PubChem Compound Database; CID=5934766, http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=5934766 (accessed Feb. 22, 2011).
Additional Support
The Libraries
The Data Management LibGuidelibguides.bc.edu/dataplan
Subject Specialistswww.bc.edu/libraries/help/askalib.html
The Office for Sponsored Programs Researchhttp://www.bc.edu/research/osp.html
ITS/Research Serviceshttp://www.bc.edu/offices/researchservices/
Office for Research Integrity and Compliancehttp://www.bc.edu/research/oric/compliance.html
The Office for Technology Transfer and Licensing http://www.bc.edu/research/ottl/
Some Useful Links
Data Management and Sharing Snafu in 3 Short Acts (NYU Health Sciences Library)https://www.youtube.com/watch?v=N2zK3sAtr-4
DataOne Best Practiceshttps//www.dataone.org/all-best-practices-download-pdf
DCC (Digital Curation Center) Disciplinary Metadata Standardshttp://www.dcc.ac.uk//resources/metadata-standards
DCC Digital Curation Center Metadata Standards – Physical Sciences http://www.dcc.ac.uk/resources/subject-areas/physical-science
Guide to Writing “Readme” Style Metadata (Cornell)http://data.research.cornell.edu/content/readme
Questions?