crowd sourcing the harvest of federal data · 2017-02-03 · ©2016 dataone 1312 basehart se...

6
The 2017 political transition in the United States has starkly illuminated the vulnerabilities of our research infrastructure to changes in political priorities. While significant resources have been invested in building robust data archives, these are most frequently built as stove-piped systems within a single department, agency, or government. It is now abundantly clear that federated data systems such as that developed by DataONE are fundamental to preserving the scientific data that are critical for an informed and healthy society. In a flurry of concerned activity coinciding with the American Geophysical Union meeting in December 2016, and following a tweeted call to scientists to identify critical climate related data sets, the science community began to compile an extensive list of federally funded research data that they feared might be lost during a transition of government. Scientists, researchers, librarians, and others organized and began downloading these data to personal machines, cloud storage, and shared servers. Individuals volunteered to shepherd particular data sets based on perceived priority, research interests, and dataset size. These were the first steps towards a more coordinated activity: the Climate Mirror. This crowd sourced rush to preserve federal climate data was preceded by other activities. When government leadership changes, or when budget priorities shift during an economic downturn, information and data can be endangered, or even lost. Since 2008, the End of Term Web Archive collaborative has worked to perform both comprehensive, broad sweeps and focused, more detailed crawls in order to harvest the websites of the US Government for preservation, future access, and use. The End of Term project documents changes in websites across the executive, judicial, and legislative branches of the government and includes a nomination tool for sharing the development of the collection between partners and stakeholders. The growing concern for preserving scientific information has spurred increased Crowd Sourcing the Harvest of Federal Data Volume 5 Issue 2 ©2016 DataONE 1312 Basehart SE University of New Mexico Albuquerque NM 87106 community participation. Widespread interest in protecting federally funded resources about climate change has drawn together a range of people, from experts to concerned citizens, to tackle the problem of disappearing data and resources. It is in this environment that crowd sourced projects such as DataRefuge have grown. The DataRefuge project, of which Climate Mirror is a partner and which partially relies on infrastructure from the EoT Web Archive, is described as “a public, collaborative project [..] committed to identifying, assessing, prioritizing, securing, and distributing reliable copies of federal climate and environmental data so that it remains available to researchers.” The founding team, based at the University of Pennsylvania as part of the Penn Program in the Environmental Humanities, has also partnered with the Environmental Data and Governance Initiative (EDGI). EDGI has worked to create a data toolkit, hosted on GitHub, that includes primers, models, coding tools, and other information to support #DataRescue events. Universities and sites around the US and Canada are hosting these one to two day #DataRescue events in support of the community effort. To date, these include activities in Toronto, Philadelphia, Indianapolis, Chicago, and Los Angeles, with at least seven more scheduled in February. The events are generally styled as informal hackathons and include people with a wide range of skills, from coders and scientists to archivists, researchers, and librarians. Contributions can include identifying important resources, nominating URLS to be scraped by the End of Term Web Harvest, creating or adapting scripts to acquire additional, larger datasets, and generating publicity about the aims and reach of the project. Due to the vast number of resources and data spread across many federal websites and silos, the project is complex and multi-pronged. Individuals can also contribute outside of these events. At a small scale, researchers can assist with the web crawl by identifying the top 10 government datasets that they use. There are other opportunities for individuals and partners to help (particularly with coding and large, uncrawlable datasets) that are further detailed on the DataRefuge site. This suite of initiatives, including the DataRefuge project, Climate Mirror, and the Environmental Data and Governance Initiative, brings new attention to the ongoing challenge of documenting, preserving, and providing access to federal government resources and data. While focused on securing data and resources related to climate change and the environment, given their perceived vulnerability, there are lessons and best practices that can cont’d page 2 ››› Figure : The Crowd by Unitron 6991

Upload: others

Post on 03-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crowd Sourcing the Harvest of Federal Data · 2017-02-03 · ©2016 DataONE 1312 Basehart SE University of New Mexico Albuquerque NM 87106 ... Figure : The Crowd by Unitron 6991

The 2017 political transition in the United States has starkly illuminated the vulnerabilities of our research infrastructure to changes in political priorities. While significant resources have been invested in building robust data archives, these are most frequently built as stove-piped systems within a single department, agency, or government. It is now abundantly clear that federated data systems such as that developed by DataONE are fundamental to preserving the scientific data that are critical for an informed and healthy society. In a flurry of concerned activity coinciding with the American Geophysical Union meeting in December 2016, and following a tweeted call to scientists to identify critical climate related data sets, the science community began to compile an extensive list of federally funded research data that they feared might be lost during a transition of government. Scientists, researchers, librarians, and others organized and began downloading these data to personal machines, cloud storage, and shared servers. Individuals volunteered to shepherd particular data sets based on perceived priority, research interests, and dataset size. These were the first steps towards a more coordinated activity: the Climate Mirror.

This crowd sourced rush to preserve federal climate data was preceded by other activities. When government leadership changes, or when budget priorities shift during an economic downturn, information and data can be endangered, or even lost. Since 2008, the End of Term Web Archive collaborative has worked to perform both comprehensive, broad sweeps and focused, more detailed crawls in order to harvest the websites of the US Government for preservation, future access, and use. The End of Term project documents changes in websites across the executive, judicial, and legislative branches of the government and includes a nomination tool for sharing the development of the collection between partners and stakeholders.

The growing concern for preserving scientific information has spurred increased

Crowd Sourcing the Harvest of Federal Data

Volume 5 Issue 2

©2016 DataONE 1312 Basehart SE University of New Mexico Albuquerque NM 87106

community participation. Widespread interest in protecting federally funded resources about climate change has drawn together a range of people, from experts to concerned citizens, to tackle the problem of disappearing data and resources. It is in this environment that crowd sourced projects such as DataRefuge have grown. The DataRefuge project, of which Climate Mirror is a partner and which partially relies on infrastructure from the EoT Web Archive, is described as “a public, collaborative project [..] committed to identifying, assessing, prioritizing, securing, and distributing reliable copies of federal climate and environmental data so that it remains available to researchers.” The founding team, based at the University of Pennsylvania as part of the Penn Program in the Environmental Humanities, has also partnered with the Environmental Data and Governance Initiative (EDGI). EDGI has worked to create a data toolkit, hosted on GitHub, that includes primers, models, coding tools, and other information to support #DataRescue events.

Universities and sites around the US and Canada are hosting these one to two day #DataRescue events in support of the community effort. To date, these include activities in Toronto, Philadelphia, Indianapolis, Chicago, and Los Angeles, with at least seven more scheduled in February. The events are

generally styled as informal hackathons and include people with a wide range of skills, from coders and scientists to archivists, researchers, and librarians. Contributions can include identifying important resources, nominating URLS to be scraped by the End of Term Web Harvest, creating or adapting scripts to acquire additional, larger datasets, and generating publicity about the aims and reach of the project. Due to the vast number of resources and data spread across many federal websites and silos, the project is complex and multi-pronged. Individuals can also contribute outside of these events. At a small scale, researchers can assist with the web crawl by identifying the top 10 government datasets that they use. There are other opportunities for individuals and partners to help (particularly with coding and large, uncrawlable datasets) that are further detailed on the DataRefuge site.

This suite of initiatives, including the DataRefuge project, Climate Mirror, and the Environmental Data and Governance Initiative,

brings new attention to the ongoing challenge of documenting, preserving, and providing access to federal government resources and data. While focused on securing data and resources related to climate change and the environment, given their perceived vulnerability, there are lessons and best practices that can

cont’d page 2 ›››

Figure : The Crowd by Unitron 6991

Page 2: Crowd Sourcing the Harvest of Federal Data · 2017-02-03 · ©2016 DataONE 1312 Basehart SE University of New Mexico Albuquerque NM 87106 ... Figure : The Crowd by Unitron 6991

� Winter 2016/2017

2

Each Member Node within the DataONE federation completes a description document summarizing the content, technical characteristics and policies of their resources. These documents can be found on the DataONE.org site at bit.ly/D1CMNs. In each newsletter issue we will highlight one of our current Member Nodes.

NSF Arctic Data Center https://arcticdata.io

The most recent organization to join the DataONE federation as a Member Node is the Arctic Data Center (https://arcticdata.io/). The NSF-funded Arctic Data Center helps the research community preserve and discover all products of NSF-funded science in the Arctic, including data, metadata, software, documents, and provenance that link these in a coherent, reproducible knowledge model. Key to the initiative is the partnership between the National Center for Ecological Analysis and Synthesis (NCEAS) at UC Santa Barbara, the National Oceanic and Atmospheric Administration’s National Centers for Environmental Information (NCEI) and DataONE, each of which bring critical capabilities to the Arctic Data Center.

The long-term repository allows for the preservation and sharing of data spanning many disciplines from the Arctic, now and into the future. For example, current holdings span oceanography (doi:10.18739/A28K75), terrestrial plant ecology (doi:10.18739/A2DD35), paleo-climate data from ice cores (doi:10.18739/A20306), ethnography (doi:10.18739/A2Z64F), and many other disciplines.

The Arctic Data Center leverages the same search and discovery platform as DataONE and so whether searching through DataONE or directly via the Center, users are able to search the extensive Arctic data collection using filters including the name of data creator, year, identifier, taxa, location, keywords and more. The search interface also provides a map-based overview of the spatial distribution of data sets, which is helpful in locating historical data in specific regions. Authors are able to seamlessly upload and share data from their desktop, contributing associated metadata which undergoes both automated and human-curation before publishing with a Digital Object Identifier so that their data are easily citable. A newly developed ‘Quality Report’ enables data creators and users to quickly determine the extent and congruency of metadata associated with a data set.

By joining the DataONE federation, Arctic Data Center content is more widely exposed and allows for great preservation options, taking advantage of DataONE’s replication policies to ensure preservation and access to Arctic Data Center content for decades to come. The Arctic Data Center is one of the largest Member Nodes so far, bringing over 500,000 data objects to the DataONE federation and bringing the total count of publicly readable data objects to over 900, 000.

MemberNodeDESCRIPTION�CoverSTORY cont’dbe drawn from these efforts across all data types. DataONE has stimulated an ongoing conversation over the years about the need for sustainable, long-term infrastructure that supports, preserves, and provides access to research data. As a federation of independently operated member repositories with institutionally diverse business models and funding sources, DataONE is uniquely suited to proactively preserve data through institutional cooperation and collaboration. DataONE’s model for replication and data integrity auditing across diverse repositories could be much more effectively leveraged to ensure that multiple copies of well curated data exist. As a community, we should embrace partnerships and initiatives with groups like DataRefuge and Climate Mirror, which may signal increasing opportunities, commitment and interest in building, funding, and growing effective data preservation infrastructure. n

— Heather Soyka Postdoctoral Scholar, DataONE

— Amber BuddenDirector for Community Engagement and Outreach,

DataONE— Matthew Jones

Director of Informatics Research and Development, NCEAS

co-PI, DataONE

Page 3: Crowd Sourcing the Harvest of Federal Data · 2017-02-03 · ©2016 DataONE 1312 Basehart SE University of New Mexico Albuquerque NM 87106 ... Figure : The Crowd by Unitron 6991

� Winter 2016/2017

3

WorkingGroupFOCUS Community Engagement and Outreach

One of the primary activities of the CEO Working Group is in creating, publishing, communicating and maintaining DataONE education resources. These span a variety of content types from databases to videos and webinars. These resources are designed to help users become proficient in managing research data across all stages of the data life cycle, using DataONE and other tools.

Below is an overview of some of the primary educational products and services delivered by the CEO Working Group. A new service, the Education Evaluation tool (EEVA), is described in more detail on page 4.

• Hands-on�training in good data management practices and tool use is offered to students and researchers at professional society meetings.

• Webinars are offered frequently and cover all aspects of the data life cycle as well as open science and other topics of general interest; webinars can also be watched online at any time.

• Education�modules consist of classroom presentations (in PowerPoint), exercises and handouts that can be downloaded, modified and used to teach data management principles and practices to audiences ranging from citizen scientists to undergraduate and graduate students.

• Online�best�practices,�screencasts�and�data�stories comprise a collection of tips, tutorials and examples that assist researchers in acquiring, managing and analyzing research data.

• Librarian�outreach�kit provides librarians with various resources that can be used in training researchers and students in data management principles and practices.

• Education�EVAluation�tool�(EEVA)�is an evaluation survey instrument for training/education resources.

The working group is currently engaged in the transition of our education content to a GitHub repository. This will allow for more streamlined community feedback on content and versioning. It will also provide the materials in a format that is being increasingly used by members of our community. We look forward to announcing completion of that project and inviting you to customize the lessons for your own communities.

Clockwise from top: In-person workshops; Presentation and outreach at disciplinary meetings; DataONE webinar series speakers; Primer on data management; Data management education

modules. Center: Exhibit events for outreach.

www.dataone.org

Primer on Data Management: What you always wanted to know** but were afraid to ask

Carly Strasser, Robert Cook, William Michener, Amber Budden

Contents

1. Objective of This Primer 1

2. Why Manage Data? 1

2.1. It will benefit you and your collaborators 1

2.2. It will benefit the scientific community 2

2.3. Journals and sponsors want you to share your data 2

3. How To Use This Primer 2

4. The Data Life Cycle: An Overview 3

5. Data Management Throughout the Data Life Cycle 4

5.1 Plan 4

5.2 Collect 4

5.3 Assure 5

5.4 Describe: Data Documentation 5

5.5. Preserve 6

5.6. Discover, Integrate, and Analyze 7

6. Conclusion 7

7. Acknowledgements 8

8. References 8

9. Glossary 9

1. Objective of This Primer

The goal of data management is to produce self-describing data sets. If you give your data to a scientist or

colleague who has not been involved with your project, will they be able to make sense of it? Will they be

able to use it effectively and properly? This primer describes a few fundamental data management

practices that will enable you to develop a data management plan, as well as how to effectively create,

organize, manage, describe, preserve and share data.

2. Why Manage Data?

2.1. It will benefit you and your collaborators

Establishing how you will collect, document, organize, manage, and preserve your data at the beginning of

your research project has many benefits. You will spend less time on data management and more time on

research by investing the time and energy before the first piece of data is collected. Your data also will be

easier for you to find, use, and analyze, and it will be easier for your collaborators to understand and use

your data. In the long term, following good data management practices means that scientists not involved

with the project can find, understand, and use the data in the future. By documenting your data and

recommending appropriate ways to cite your data, you can be sure to get credit for your data products

and their use [1].

DataONE Best Practices Primer 1

TheDUGoutDear DUG Members -

As the new year dawns, the DataONE User Group (DUG) will again begin the cycle of organizing the 2017 DataONE User Group Meeting. The DUG Meeting will remain collocated with the 2017 Federation of Earth Science Information Partners (ESIP) Summer Meeting which is being held at Indiana University, Bloomington, IN. The DUG Steering Committee will begin the discussions on the format and agenda for the 2017 DUG Meeting. More on this will be made available through emails, newsletters or our website, https://www.dataone.org/dataone-users-group.

All members are invited to contribute their ideas to make the event as fruitful and worthwhile to attend. Email your inputs to the DUG Chairs ([email protected]) or Amber Budden ([email protected]). It is also expected that DUG Meetings will continue to have oral, as well as, poster presentations from all of its members and we would encourage all members to submit an abstract.

The event will also include the nomination and election of new Chairs, who will hold the position for two years. It is important that all members participate in the nomination and election of their Chairs. DataONE and DUG Chairs encourages members to join the DUG Steering Committee to assist in the formulation of recommendations for DataONE. For more on these positions, please visit the DUG Procedural Guidelines at https://www.dataone.org/DUG-guidelines.

—co-chairs: Felimon Gayanilo

Texas A&M University, Corpus Christi, TXPlato Smith

University of Florida, Gainesville, FL

Page 4: Crowd Sourcing the Harvest of Federal Data · 2017-02-03 · ©2016 DataONE 1312 Basehart SE University of New Mexico Albuquerque NM 87106 ... Figure : The Crowd by Unitron 6991

� Winter 2016/2017

4

FeaturedRESOURCE�

Education EVAluation Tool (EEVA) https://www.dataone.org/education-evaulation

DataONE’s Education EVAluation tool (EEVA) is a freely available evaluation survey instrument for training/education resources. The tool was developed in direct response to a need for more systematic evaluation of educational resources in the research data management community. DataONE, like many other organizations, has created many education resources and training opportunities for both online learning and for face-to-face instruction. To ensure these materials are meeting their objectives and are remaining relevant to the needs of the community, DataONE trainers actively collect user feedback in the form of surveys. Through participation in collaborative training activities with other organizations we discovered that many survey instruments are shared across organizations but that there was a need for an evaluation tool to assist in survey design. In response, EEVA was developed through an activity as part of the DataONE 2016 Summer Internship Program, carried out by Sophie Hou in collaboration with the CEO WG.

Why Use EEVA?EEVA makes it easy to create and download customized survey questions so that you can get quick, actionable feedback on data management

training and resources. EEVA provides 89 potential survey questions that users can select from. Each survey question contains the question text; an indication of whether the question is considered mandatory, optional or recommended; the Kirkpatric Evaluation Area¹ the question corresponds with; and the question type (Scale, Multi Choice, Open Ended, Dichotomous). Where appropriate, suggested responses are provided (e.g. Strongly Agree - Strongly Disagree). These survey questions can be filtered according to the length of the proposed training (some questions are less relevant for short 1h training activities), when the survey will be implemented (pre- or post-training) and by the type of training delivery (in person, online, self paced etc). The results of the query can then be downloaded in *.doc or *.xls format and a full unfiltered question set is also available in *.qsf format.

How to use EEVA• Select relevant filters for training duration and applicability. If you would like to use all 89 questions, do not select any filters.• Using the +/- for each section, you may see all questions from ten categories such as “Objectives,” “Content/Substance,” and more.• Generate and download your customized survey in *.doc and *.xls formats, or full set in *.qsf format.• Edit the survey as appropriate. • Distribute via your preferred method** The *.qsf format may be imported directly into Qualtrics.

We are currently in the process of developing a download option for Survey Monkey and would welcome any other feedback or feature suggestions for the tool.

¹ Kirkpatrick, D. L. (1996). Techniques for evaluating training programs. Classic writings on instructional technology, 1(192), 119

Above: Question filter options available within EEVARight: EEVA output showing questions found within the

proposed ‘Objectives’ and ‘Content/Substance’ sections of the survey

Page 5: Crowd Sourcing the Harvest of Federal Data · 2017-02-03 · ©2016 DataONE 1312 Basehart SE University of New Mexico Albuquerque NM 87106 ... Figure : The Crowd by Unitron 6991

� Winter 2016/2017

5

Members of the DataONE Team will be at the following events. Full information on training activities can be found at bit.ly/D1Training and our calendar is available at bit.ly/D1Events.

Feb. 20-23 Int. Data Curation Center Meeting Edinburgh, UK http://www.dcc.ac.uk/events/idcc17

Apr. 5-7 Research Data Alliance Plenary Barcelona, Spain https://www.rd-alliance.org/plenaries/rda-ninth-plenary-meeting-barcelona

May. 29- Jun. 2 Linking Environmental Data and Samples Symposium Canberra, Australia https://csiro-enviro-informatics.github.io/environmental-data-symposium-2017/

Jun. 22-23TaPP 2017 Seattle, Washington http://batesa.web.engr.illinois.edu/tapp17/

Jul. 24-25DataONE Users Group Bloomington, Indiana https://www.dataone.org/dataone-users-group

Jul. 25-28Federation of Earth Science Information Partners (ESIP) Bloomington, Indiana http://meetings.esipfed.org/summer-meeting-2017

Aug. 6-11Ecological Society of America (ESA) Portland, Oregon http://www.esa.org/portland/

Apr. 5-7 Research Data Alliance Plenary Montreal, Canada https://www.rd-alliance.org/plenaries/rda-tenth-plenary-meeting-montreal-canada

UpcomingEVENTS��

Spring sees the launch of the DataONE Summer Internship Program and we are pleased that once again we will be able to provide up to four interns the opportunity to work with DataONE on a broad range of projects and activities that support our mission to “Enable new science and knowledge creation through universal access to data about life on earth and the environment that sustains it”. Information on the internship program can be found at https://www.dataone.org/internships and we will be announcing the project opportunities mid February. With only a one month application period, interested individuals should read through the application and eligibility requirements, and explore previous internship projects, in advance.

We are also collaborating on a 3 week Open Science for Synthesis workshop being run by the National Center for Ecological Analysis and Synthesis (NCEAS). The workshop is designed to provide skills and training to established and early career researchers working on topics related to he Gulf of Mexico ecosystems. The training opportunity will be held at NCEAS in Santa Barbara, CA during July 2017 and the application period is currently open. Full

OutreachUPDATE information about the course and how to apply can be found at: https://www.nceas.ucsb.edu/OSS2017. Don’t delay - the deadline is February 20th.

DataONE continues to provide webinar presentations that engage participants in relevant and cutting-edge topics concerning data management within Earth and environmental sciences. Topics may be broad conceptual themes or more specific instructional webinars focussed on open science, stages of the data life cycle or community tools for data management. The remainder of webinar topics and speakers is listed below and we are always interested in soliciting topics and speakers for our next series, launching September 2017. Reach out via [email protected].

•� February: Reproducible Science with Jupyter: Changing our publicaiton models - Fernando Pérez

•� March: Data Collection - Title TBA. Bob Arko.

•� April: Engaging research Data Users and Contributors in Fostering a Culture of re-use - Panel: Patricia Condon, Angela Murillo, Adam Kriesberg

•� May: Sloan Sky Project and Big Data - Chris Borgman and Irene Pasquetto.

Page 6: Crowd Sourcing the Harvest of Federal Data · 2017-02-03 · ©2016 DataONE 1312 Basehart SE University of New Mexico Albuquerque NM 87106 ... Figure : The Crowd by Unitron 6991

� Winter 2016/2017

6

1312 Basehart SEUniversity of New MexicoAlbuquerque, NM 87106

Fax: 505.246.6007

DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under

a Cooperative Agreement.

Project Director:

William [email protected]

505.814.7601

Executive Director:

Rebecca [email protected]

505.382.0890

Director of Community Engagement and Outreach:

Amber [email protected]

505.205.7675

Director of Development and Operations

Dave [email protected]

Status UpdateThe DataONE production environment

now has 39 participating Member Nodes after the addition of the Digital Archaeological Record (tDAR), the Gulf of Mexico Research Initiative Information (GRIIDC), the Rolling Deck to Repository (R2R) Program, the Arctic Data Center and the Biological & Chemical Oceanography Data Management Office (BCO-DMO) as Member Nodes within the federation. Combined, these Member Nodes provide access to more than 281,935 publicly readable, current version data sets containing over 934,914 data objects. Including prior revisions, a total of 1,723,603 individual objects are resolvable and retrievable through DataONE and the participating Member Nodes.

In January we released a major upgrade to our Generic Member Node (GMN) package. The GMN is a complete implementation of the DataONE Member Node stack and can be used by organizations to expose their data through DataONE. The GMN version 2 (GMN v2) fully supports the latest version of the DataONE architecture, including support for representing data that gets updated (Series ID) and simplified authentication using JSON Web Tokens (JWT). In addition, GMN v2

0  

100  

200  

300  

400  

500  

600  

700  

800  

900  

1000  

Jun  '12   Nov  '12   May  '13   Nov  '13   May  '14   Nov  '14   May  '15   Nov  '15   May  '16   Nov  '16  

Thou

sand

s  

Date  Uploaded  to  DataONE  

Data      Metadata    

CyberSPOT

simplifies maintenance of system metadata, and it contains a number of performance and usability improvements. Several current DataONE Member Nodes have chosen GMN as their Member Node software solution, including the Northwest Knowledge Network (https://www.northwestknowledge.net/), the International Arctic Research Center (http://www.iarc.uaf.edu/), the Long Term Ecological Research Network (https://lternet.edu/), and the Nevada Research Data Center (https://sensor.nevada.edu/NRDC/).

Development activities continue with our R and Matlab provenance support enabling manual contribution of provenance metadata. This functionality is now complete within R and we are working to provide the same within Matlab. Look for announcements of release coming soon. n

Figure 1: Counts of data/metadata/resource maps uploaded to DataONE since release in July 2012

DataONE Webinar SeriesUpcoming Webinar

Reproducible Science with Jupyter:

Changing our publication models

Fernando�PérezTuesday Feb 14th 0900 Pacific

https://attendee.gotowebinar.com/register/8259955039642712066