Data Management Plans:Graduate Research and Beyond
Elizabeth BrownScholarly Communications and Library Grants OfficerBinghamton University LibrariesOctober 18,2012
What we’ll cover today
What is an NSF Data Management Plan? How and why was it created?
Why are Libraries a part of data management?
(Short Break)
Creating and Implementing NSF Data Management Plans
Preserving Research Data after a project is completed
Learning Objectives
To: Understand current NSF and government data
policies requirements. Be aware of research support services within
the Libraries. Locate and use various resources to develop
data management plans (DMPs) for NSF proposal(s).
Write a comprehensive DMP for NSF proposal(s).
Identify and plan for long-term preservation of research data from funded projects.
What is a Data Management Plan?
What is a Data Management Plan?
Storing Research Data “Forever”Serge GoldsteinAssociate CIO & Director of Academic
ServicesPrinceton UniversityFall 2010 Coalition for Networked Information
MeetingURL:
http://www.youtube.com/watch?v=fQ-YEcV1k1A
Some Handy Definitions
Cyberinfrastructure: computing resources & networks, services, & people
Data management: technical processing and preparation of data for analysis
Data curation: selection of data for preservation and adding value for current and future use
Data citation: mechanisms to enable easy reuse and verification, track impact of data, and create structures to recognize and reward researchers (DataCite)
Data sharing: must take into account ethical and legal issues; a spectrum with many options
Source: Heather Coates and Kristi Palmer, Data management plans & planning: Meeting the NSF Requirement, March 7, 2012 URL: http://www.slideshare.net/goldenphizzwizards/meeting-the-nsf-dmp-requirement-20120307-final
NSF DMP Requirements by Unit
URL: http://www.nsf.gov/bfa/dias/policy/dmp.jsp
Why were Data Management Plans
created?
Why the NSF created this requirement
Source: http://blogs.library.ucla.edu/dmptool/2012/10/09/data-now-recognized/
Why NSF created the NSF requirement
Source: http://www.whitehouse.gov/open
Acknowledging: Open is a movement
Open Access Open Educational
Tools Open Standards Open Science Open Source
http://en.wikipedia.org/wiki/File:Benjamin_Franklin_-_Join_or_Die.jpg
Dorothea Salo, Battle of the Opens, Book of Trogool, March 15, 2010
Acknowledging:Publishing is changing
Houghton, J.W. (2011). "The costs and potential benefits of alternative scholarly publishing models" Information Research, 16(1) paper 469. [Available at http://InformationR.net/ir/16-1/paper469.html]
Acknowledging:Scholarly impact measures
http://altmetrics.org/manifesto/
Acknowledging:Accountability of funding agencies
Source: http://www.whitehouse.gov/open/around
How will DMPs help me?
Let’s think about it… (discussion)
How will DMP’s help me?
Saves time Less reorganization for future projects
Increases efficiency Compile and prioritizing data collection(s) Anticipate how your data will be used
Consider data preservation requirements and plan for them
Better aware of funding agency mandates and data preservation culture in your field
How are Libraries a part of this?
Libraries Support Scholarship
Sources: http://iteach.wustl.edu/newsletter/spring-2011-newsletter/199; http://news.lib.uchicago.edu/special-collections/preservation/; http://www.berfrois.com/2012/03/share-books/
•Access•Services•Cultural Memory•Preservation
Support for Scholarship is evolving
URLs: http://futurepath.org/digital-information-presevation/; http://www.nationalgalleries.org/object/GMA A42/2/GKL1012; http://metalink.binghamton.edu:8332/V/DIURBR94GJBKS1C3TBKAGBQ19XQQQ4DBVYRJHJKGH2ADLG2D3Y-08701?func=native-link&resource=BNG05356
Print Archives, Collections
Electronic Content, Databases
Research Data
Libraries’ Research Data Support
URL: http://library.binghamton.edu/services/scholarly/index.html
NSF Data Management Plan Info
NSF Data Management Plan support
URL: http://library.binghamton.edu/services/scholarly/NSFdata.html
Find funder requirements
Locate sample plans
Write, edit, review plans
Copyright information, guidance
URL: http://library.lib.binghamton.edu/services/scholarly/copyrightdemystified.html
Copyright Terms Locating
OwnersClassroom exceptions
Information and Policy Updates
URL: http://library.lib.binghamton.edu/services/scholarly/index.html
Creating and Implementing Data Management Plans
Consider the Research Life Cycle
Source: DDI Structural Reform Group. “DDI Version 3.0 Conceptual Model." DDI Alliance. 2004. Accessed on 11 August 2008. <http://www.icpsr.umich.edu/DDI/committee-info/Concept-Model-WD.pdf>.
DMPTool: Funder Requirements Info, Templates
URL: https://dmp.cdlib.org/pages/funder_requirements
DMP Sections
1. Types of Data2. Data and Metadata Standards3. Policies for Access and Sharing
Data Privacy and Protection4. Data re-use and re-distribution5. Data Archiving and Preservation
1. Types of Data
Expected data. The DMP should describe the types of data, samples, physical collections, software, curriculum materials, or other materials to be produced in the course of the project. It should then describe the expected types of data to be retained.
The Federal government defines ‘data’ in OMB Circular A-110 as: Research data is defined as the recorded factual material commonly accepted in the scientific
community as necessary to validate research findings, but not any of the following: preliminary analyses, drafts of scientific papers, plans for future research, peer reviews, or communications with colleagues. This "recorded" material excludes physical objects (e.g., laboratory samples). Research data also do not include: (A) Trade secrets, commercial information, materials necessary to be held confidential by a
researcher until they are published, or similar information which is protected under law; and (B) Personnel and medical information and similar information the disclosure of which would
constitute a clearly unwarranted invasion of personal privacy, such as information that could be used to identify a particular person in a research study.
PIs should use the opportunity of the DMP to give thought to matters such as: • The types of data that their project might generate and eventually share with others, and under
what conditions • How data are to be managed and maintained until they are shared with others • Factors that might impinge on their ability to manage data, e.g. legal and ethical restrictions on
access to non-aggregated data • The lowest level of aggregated data that PIs might share with others in the scientific community,
given that community’s norms on data • The mechanism for sharing data and/or making them accessible to others • Other types of information that should be maintained and shared regarding data,
Source: http://www.nsf.gov/sbe/SBE_DataMgmtPlanPolicy.pdf
DMP Sample: I. Types of Data“This research project will generate data resulting from sensor recordings (i.e.
earth pressures, accelerations, wall deformation and displacement and soil settlement) during the centrifuge experiments. In addition to the raw, uncorrected sensor data, converted and corrected data (in engineering units), as well as several other forms of derived data will be produced. Metadata that describes the experiments with their materials, loads, experimental environment and parameters will be produced. The experiments will also be recorded with still cameras and video cameras. Photos and videos will be part of the data collection.”
“A total storage demand of 50 GB is anticipated at the University of Michigan, and 50 GB at Auburn University.”
“Based on the previous viscoelastic turbulent channel flow simulations, the amount of resulting binary data is estimated around 40 TB per year. Some text format data files are also required for post-processing in the laboratory and are anticipated to be around 1 TB per year.”
“In one year, we will perform approximately 2 to 3 simulations. This means ~100 3D plots, 30 restart files, 1000 EUV, X-ray and LASCO-like images, 10 satellite files, 1000 2D plot files (total of about 150 GB of data per year).”
Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
DMP Sample: I. Types of Data
“The data, samples, and materials expected to be produced will consist of laboratory notebooks, raw data files from experiments, experimental analysis data files, simulation data, microscopy images, optical images, LabView acquisition programs, and quantum dot superlattice nanowire thermoelectric samples.... each of these data is described below:
A. Laboratory notebooks: The graduate student and PI will record by hand any observations, procedures, and ideas generated during the course of the research.
B. Experimental raw data files: These files will consist of ASCII text that represents data directly collected from the various electrical instruments used to measure the thermoelectric properties of the superlattice nanowire thermoelectric devices.
C. Experimental analysis data files: These files will consist of spreadsheets and plots of the raw data mentioned in Part A. The data in these files will have been manipulated to yield meaningful and quantitative values for the device efficiency and ZT. The analysis will be performed using best practice and acceptable methods for calculating device efficiency and ZT.
D. Simulation data: These data will represent the results from commercially available simulation and modeling software to model the quantum confinement.
E. Microscopy images: Images of the proposed silicon nanostructures will be generated by scanning electron microscopy (SEM), transmission electron microscopy (TEM) at high resolution to quantify wire diameter and roughness, and atomic force microscopy (AFM).
Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
2. Data Formats and Metadata3. Policies for Access and Sharing; Data Privacy and Protection4. Data re-use and re-distribution
Data formats and dissemination. The DMP should describe data formats, media, and dissemination approaches that will be used to make data and metadata available to others. Policies for public access and sharing should be described, including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements. Research centers and major partnerships with industry or other user communities must also address how data are to be shared and managed with partners, center members, and other major stakeholders.
Source: http://www.nsf.gov/sbe/SBE_DataMgmtPlanPolicy.pdf
3. Policies for Access and Sharing; Data Privacy and Protection4. Data re-use and re-distribution
Period of data retention. SBE is committed to timely and rapid data distribution. However, it recognizes that types of data can vary widely and that acceptable norms also vary by scientific discipline. It is strongly committed, however, to the underlying principle of timely access, and applicants should address how this will be met in their DMP statement.
Source: http://www.nsf.gov/sbe/SBE_DataMgmtPlanPolicy.pdf
DMP Samples: II. Data Formats and Metadata
“The Dublin Core will be used as the standard for metadata. The metadata set mainly consists of fifteen elements, including title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage, and rights. These elements have been ratified as both national (i.e., ANSI/NISO Standard Z39.85) and international standards (i.e., ISO Standard 15836). Further, they describe resources such as text, video, audio, and data files. These standard formats will be used in our study.”
“For each code made available, a user's manual will be provided with instructions for compiling the source codes, installing and running the codes, formulating input data streams, and visualizing the output. Documentation will be in PDF format.”
Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
DMP Sample: II. Data Formats and Metadata
“Verilog, SPICE, and MATLAB files generated will be processed and submitted to FTP servers as .mat files with TXT documentation. The data will be distributed in several widely used formats, including ASCII, tab-delimited (for use with Excel), and MAT format. Instructional material and relevant technical reports will be provided as PDF. Digital video data files generated will be processed and submitted to the FTP servers in MPEG-4 (.mp4) and .avi formats. Variables will use a standardized naming convention consisting of a prefix, root, suffix system.”
“Plasma image data will be RGB colored JPG or TIFF format with resolution determined by the camera. Video data will be RGB colored AVI format.”
“Images from the scanning electron microscopes (SEMs) and focused ion beam workstations (FIBs) are saved in tagged image file format (TIFF), which is readily readable by a wide variety of imaging and processing applications.”
Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
DMP Sample: III. Access and Sharing Policies
III. Policies for access and sharing and provisions for appropriate protection/privacy
As detailed in the project description, the CARE platform in intended to be a research cloud service that provides analytical middleware for use in analyzing health data. During the project, access will be limited to project team member and invited expert stakeholders through a password protected website. Commencing with Task 5 (month 26), means for access by the broader research community will be implemented. At that time, the project team will determine whether there is a need for initiating access charges, which may be appropriate for securing the longer terms sustainability of the CARE platform and analysis tools.
All of the data that will be utilized are publicly available data sets that have been de-identified by public agencies and have passed their standards for privacy protection and assurance so that no individually identifiable data is provided. The datasets to be utilized within this project and other intellectual property have been released without restriction.
Over the course of the study, the project team will meet with both the Community Health Institute and the SafeRoadMaps/CERS team to arrive at a data-sharing agreement for postproject utilization of their data. Such an agreement will provide a model for not only this partnership, but for licensing the CARE Platform analytics for use by other health data sets.
Source: http://rci.ucsd.edu/_files/DMP%20Example%20Chaitan%20Baru%20SDSC.pdf
DMP Sample: IV. Reuse and Distribution
“After uploading the data into the NEES Project Warehouse and allowing public access, all data will be available for re-use and re-distribution with proper acknowledgement of their originators.”
“Researchers and practitioners in diverse fields will be able to readily reuse and redistribute shared data. Terms of use will include the prohibition of commercial commercial use of the work – modifications of the work will be allowed with the proper citations.”
“The simulation code will be developed in C and provided to the public in source code format for non-commercial use under GNU General Public License (GPL).”Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
DMP Sample: IV. Data Reuse and Distribution
“Before data is stored, it will be stripped of all institutional and individual identifiers to ensure confidentiality by staff of the Center following procedures developed by the researchers.”
“Audio files of interviews will be stored on a password protected secure server during the study and for two years after, and destroyed subsequently.”
“Exceptions to shared data include proprietary DTE GIS utility information (for security reasons) and software code of commercial interest to the project's GOALI partners or identified licensees. Both exceptions are permitted by the ENG DMP policy.... The research team will however develop a set of 3D GIS datasets for distribution the public. These datasets will represent non-existent buried infrastructure and will only be useful for the evaluation of the other research products.”Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
DMP Sample: IV. Reuse and redistribution
IV. Policies and provisions for re-use, re-distribution
As noted in the project description, policies for provision and re-use will be developed as part of the research project. It is anticipated that there will be considerable interest in the platform and tools within the research and practice community, including academic researchers, health research agencies, and cloud service providers, among others. The need for such a tool was identified during a recent NSF sponsored symposium on Health Cyberinfrastructure, which was conducted by the PIs.Source: http://rci.ucsd.edu/_files/DMP%20Example%20Chaitan%20Baru%20SDSC.pdf
5. Data Archiving and Preservation
Data storage and preservation of access. The DMP should describe physical and cyber resources and facilities that will be used for the effective preservation and storage of research data. These can include third party facilities and repositories.
Source: http://www.nsf.gov/sbe/SBE_DataMgmtPlanPolicy.pdf
DMP Sample: V. Archiving and Preservation
V. Plans for archiving and Preservation of access The project website and service will contain all appropriate
information and documentation for using the CARE platform and tool for health research discovery and analysis. The site will also contain all references, research papers, and related products developed throughout the course of the project.
The San Diego Supercomputer Facility at UC San Diego will host the data throughout the research project and provide a minimum of three years of online access beyond the completion of the project. Data storage will be performed at the nominal rates charged by SDSC to any project using the facility. These are relatively modest (~$1000/TB) and can be borne ahead of time for the 3-year period. Should the CARE platform not extend beyond the three years (post grant), the data could then be archived at SDSC at even lower cost. A decision would have to be made at that point in time regarding how exactly to archive the data, and on paying for the archival storage.Source: http://rci.ucsd.edu/_files/DMP%20Example%20Chaitan%20Baru%20SDSC.pdf
DMP Sample: V. Archiving and Preservation
“For archiving, the data along with any related publications will be deposited in Libra, the UVA archival system, with an appropriate licensing statement. DOIs will be attached to all data stored from this project. Since the current preservation plan for Libra is indefinite data storage, preservation of access is assured.”
“Materials to be publicly shared will be stored with the Deep Blue repository, a service of the UM Libraries that provides deposit access and preservation services. Deposited items will be assigned a persistent URL that will be registered with the Handle System for assigning, managing, and resolving persistent identifiers (‘handles’) for digital objects and other Internet resources.”Source: http://deepblue.lib.umich.edu/handle/2027.42/86586
After the project is complete
Preserving Research Data
What are your goals? Who needs access and when? When/if can data be
shared/distributed? Prepare for future funder mandates Plan beyond individual PI/grant
projects
Who owns your research data?
• Campus Copyright policy• Collaborator institution copyright and
ownership policies, informal agreements
• Patent and provenance issues• International copyright considerations• Post-project data retention
requirements• Post-employment data agreements
URL: http://www2.binghamton.edu/academics/provost/faculty-staff-handbook/handbook-xii.html
How large are your research data sets?
Survey sample:308 campus researchers with externally sponsored projects or submitted proposals (2009-2011); 91 survey respondents
Source: Binghamton University Research Faculty Survey, June 2011, Jim Wolf, Director of Academic Computing (ret.)
Campus Research Data Locations
Source: Jim Wolf, Director of Academic Computing (ret.), June 2011
Local re-search group
server
ITS storage Library archive
Disciplinary repository
(e.g., ICPSR)
0
10
20
30
40
50
60
forever
3-7 yrs
<3 yrs
Who needs access to data? For how long?
Local
resea
rch gr
oup serve
r
ITS st
orage
Librar
y arch
ive
Disciplin
ary re
pository
(e.g.,
ICPSR
)05
101520253035404550
access granted to in-dividualsopenly available to allproprietaryprivate
Source: Research Faculty Survey, Jim Wolf, Director of Academic Computing (ret.), June 2011
Data Accessibility
Data Preservation Timeframe
Preservation: more than just backup
Create consistent, standardized metadata
Perform regular file fixity and format checks
Identify, update and migrate file formats
Mitigate and eliminate file degradation Provide storage space, controlled
access and an “exit strategy”
Bit Rot: Files decay over timeThankó verù mucè foò á lovelù luncheoî anä somå splendiä views® Wå imaginå �yoõ no÷ iî Indiá anä wondeò iæ yoõ arå listeninç tï somå oæ thå samå Indianó �witè whoí wå talkeä yearó ago® Thå artistó anä economistó werå quitå �remarkable¬ buô thå politicaì scientistó useä tï talë abouô atomiã �bombó foò Indiá witè eager¬ burninç eyeó whilå beinç verù carefuì noô tï kilì �anù insects® (Severaì haä theiò beardó covereä iî whitå silk so that no insect �would get caught and be stifled there.)
Sources: Hoover Institution Library and Archives Blog, Nov. 18, 2011; http://timbrison.wordpress.com/2011/06/04/bit-rot-rides-again/
Data formats, devices, readers evolve
Media Deterioration and Format Obsolescence Demonstrate that “Backups” are Inadequate for Long-Term Preservation
Sources: http://oldcomputers.net/macintosh.html; http://www.classiccmp.org/dunfield/pc/index.htm
Preservation is an iterative process
Build content from one project to the next
Create a set of policies based on current best practices and funder requirements
Refine data collection, access, use, distribution, and preservation policies over time
Thank You
Elizabeth BrownLS-2504C(607) [email protected]
Slideshare: http://www.slideshare.net/ebrown