databases & data infrastructure › files › 2018 › 07 › nsf-2013... · from databases to...

35
+ Databases & Data Infrastructure Kerstin Lehnert

Upload: others

Post on 08-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+

Databases & Data

Infrastructure

Kerstin Lehnert

Page 2: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+

to allow verification of research

results

to allow re-use of data

Access to Data is Needed 2

Page 3: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ “The road to reuse is perilous”(1)

Accessibility

Discovery, long-term access, permissions

Usability

understand what was measured and how (materials and

methods), computations that were applied, presentation of data

(units, symbols, etc.)

ability to apply standard tools to all file formats

Motivation

Professional benefits vs effort and economic burden of

publication; policies

3

(1)Rees, Jonathan (2010): “Recommendations for independent scholarly publication of data

sets.” Creative Commons Working Paper

Page 4: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ From Databases to Data

Infrastructure

Data-driven science creates new requirements for data:

Data need to discoverable.

Data need to be persistently and reliably accessible.

Data need to be curated and reviewed for quality assurance.

Data need to be unambiguously identifiable & located.

Data need to be citable.

Data need to be interoperable.

This requires the development of a data infrastructure.

Trusted repositories instead of informal databases.

September 10, 2013

4

Page 5: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ From Databases to Data

Infrastructure

Technological Infrastructure

Workforce

Management Models

Distributed versus centralized databases

Control, oversight

Financial Support

Legal & Policy Framework

Open Access policies

Policy enforcement

Cultural & Behavioral Changes

Data sharing

Data citation

September 10, 2013

5

From: Arzberger et al.,

Science, 2004

Page 6: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ Where Are We Now?

Few data repositories fulfill the requirements.

National data centers (NCDC, NGDC, NSIDC, etc.)

Domain-specific data facilities: IRIS, BCO-DMO, IEDA (MGDS,

EarthChem), etc.

Most databases don‟t, but at least provide access to data.

local, no standards

single point of failure

not persistent

Many gaps in coverage

Too much dark legacy data

September 10, 2013

6

From: http://www.elsevier.com/about/content-innovation/database-linking

Page 7: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ How Can We Advance ?

Infrastructure

Sustained and comprehensive repositories

Tools and workflows for data management, including data publication

Best practices, standards

Incentives for data sharing

Credit (data citation, bibliometrics for data)

Better science

Policy enforcement

Funding agencies

Publications

September 10, 2013

7

Page 8: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ Advancing Data Infrastructure

CIF21 & EarthCube

Community building

Building Blocks development

Interoperability

Growing number of initiatives & organizations to develop and

implement best practices and policies

Research Data Alliance

CODATA/World Data Systems

BRDI

New approaches to data publication & citation

September 10, 2013

8

Page 9: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ EarthCube

Transform the conduct of research in geosciences

by supporting the development of community-guided

cyberinfrastructure

to integrate data and information for knowledge

management across the Geosciences.

September 10, 2013

9

Page 10: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ 10

source: B. Ransom, NSF, 2012 September 10, 2013

Page 11: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ EarthCube Progress

2 year planning and community building phase

charrettes

community & concept awards

domain end-user workshops

Started initial developments:

Research Coordination Networks funded in summer 2013

Building Blocks & Conceptual Designs funded in summer 2013

CINERGI – Inventory of CI resources

Test Enterprise Governance starting Sept 15

Closer collaboration of data facilities

Web Services Interop Building Block project

Consortium of Data Facilities (workshop coming up)

The „D8‟ and/or „D20‟ concept

September 10, 2013

11

Page 12: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ Guidelines & Best Practices

September 10, 2013

12

“The Research Data Alliance aims to accelerate and facilitate research data sharing and exchange.”

“… ensuring the long-term stewardship and provision of quality-assessed data and data services to the international science community and other stakeholders.”

• data publication

• open access

• data attribution & citation

• data standards

• trustworthiness of repositories

Page 13: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ Data Publication: Options

Data Paper

Institutional

Repositories

13

Disciplinary

Repositories

Conventional

publication

September 10, 2013

Page 14: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ Role of Data Repositories

Ensure long-term preservation

Data documentation (catalog metadata)

Persistent & unique identification

Sustainable infrastructure & business models

Ensure Usability (Disciplinary Repositories!)

Adopt and/or develop community-based standards for documenting:

Provenance of data (collection strategies, procedures and underlying assumptions)

Data precision, errors, workflows for data quality assurance

Comply with standards for data representation (formats, semantics, etc.)

QA/QC of datasets and metadata

Science-driven tools for data search & access

Standards-based interfaces for programmatic access

Integrate data for analysis

September 10, 2013

14

Page 15: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ Domain-specific Repositories: Linking

Stakeholders

15

Domain-specific

Resource

Collections Long-term

archiving

Tool

development

Interoperability

Links to

publications &

bibliometrics

Data policies

Enforcement

Science Community

Computer Sciences

Libraries, data centers

Publishers, editors

Funding agencies

Norms & needs Systems & services

September 10, 2013

Page 16: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ Data Publication Process Example: IEDA

Synthesis

databases

Journal Portal

[XML]

IEDA

Metadata

Catalog

Data

Manuscript

DOI linking

Review

IEDA Data Managers

Editors

Submission Publication Integration

IEDA Data Managers

September 10, 2013

16

Page 17: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ EarthChem Standards for Data

Publication

Following recommendations of the Editors Roundtable (Policy

Statement released in 2009: www.earthchem.org/editors)

complete disclosure of data used in a publication

full documentation of data provenance & quality (uncertainty)

unique identification of samples

geospatial & taxonomic information about samples

Currently reviewing policy statement to align with emerging

best practices and new publication capabilities

submission of data to repositories as part of editorial process

„data review‟ by repositories

September 10, 2013

17

Page 18: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ DOI to allow proper citation

Link to publications

Link to funding source

September 10, 2013

18

Page 19: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ Linking Data & Publications

September 10, 2013

19

Page 20: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+

Linking Data & Publications

September 10, 2013

20

Page 21: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+

Discovery visualizations

Analytics and mining

Global Census, Virtual Mineral Laboratory, ...

Integrated

Applications

Application-level mediation: vocabulary,mapping to science and data terms

Semanticinteroperability

SemanticinteroperabilitySoftware,

Tools & Apps

Data

Repositories

….

Semantic query,hypothsis and

inference

Query,access anduse of data

Metadata,schema,

data... ... ...

Deep Energy/ Life

Applications

Physics/ Chemistry

Models

Res/FluxApplications

GVP MINDAT EOS EarthChem

Semantic mediation: physics, chemistry, mineral, emission data - ChemML,

Schematic for Deep Carbon Virtual Observatory and Interoperability

Emission/ Compositions

Slide: Courtesy of Peter

Fox, RPI (July 2012)

Multi-Disciplinary Data Science

Implementation

September 10, 2013

21

Page 22: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ What makes data useful?

22

“Knowing that I can trust the numbers.”

September 10, 2013

Page 23: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ Data Quality Standards

23

Science Technology

Norms

Standards

Tools

September 10, 2013

Page 24: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+

Data Citation

September 10, 2013

24

Page 25: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ Polar Data Infrastructure

Don‟t reinvent

Many data types already have well-established repositories,

standards, best practices, community governance

Integration of polar data into appropriate disciplinary

repositories will augment their quality and usage

Polar data are diverse, difficult to cover all data types with the

appropriate level of expertise

Fill obvious gaps

Leverage existing data infrastructure

Follow standards for data publication, metadata, and repository

trustworthiness

September 10, 2013

25

Page 26: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ The CI Vision

Enable new forms of scholarship that are

information-intensive

data-intensive

distributed

collaborative

multi-disciplinary

From Elmagarmid et al. (2008): “Community-Cyberinfrastructure-Enabled

Discovery in Science and Engineering” September 10, 2013

26

Page 27: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ Opportunities for Polar CI

Leverage EarthCube developments

Build capabilities that can be used by other communities

Adopt & adapt developments that are useful

Use EarthCube Resources

Stakeholder Alignment survey

Lists of existing resources

Gap analyses

Use cases / science scenarios

Social network (EarthCube MatchMaker)

Experiences of community building & governance

September 10, 2013

27

Page 28: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ Big Data Long Tail Data

Sample-based

Geospatial Grids & Vectors

Categorical

September 10, 2013

28

Page 29: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ Recurring Themes of CI Gaps

Data (& samples): access, coverage, integration, standards

Models: dynamic, shared, linked

Interdisciplinary conceptual frameworks

Data analysis tools: visualization, multivariate analysis, statistics

Data management support: workflows, software, education

Knowledge: limitations and uses of data and models across, within and between disciplines

Community: collaboration, shared knowledge of existing resources

29

September 10, 2013

Page 30: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+

Enriched Links (under development)

September 10, 2013

30

Page 31: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ CZOData II Architecture

LocalCZOs

LocalCZOWebSite

CZODisplayFiles

Standards-based

WebServiceClients

EarthChem

CZOMainWebPortal

CZOMainWebSite

Open-Topography(LiDAR)

CUAHSIHIS

CZOCentralDataManagementSystem

CZOCentralHarvester

LocalCZOs CZOCentralCoordina onFunc ons

CZOCentralData

Repositories

Non-CZOIntegratedData&DiscoverySites

Clients

CZO-ISGNRegistra on

System

SharedVocabularySystem

CZchemDBSystem

(w/EarthChemService

Interface)

CZOCentralHydroSystem(w/WFS&CUAHSIHISService

Interface)

DataONE

DataONEInterface

CZOMetadataCatalog

(w/CSWServiceInterface)

SESAR

DataManagement

Tools

CZODataDiscoveryPortal

TimeSeriesDataDisplay&AccessTool

CZchemDBDataAccess

Tool

September 10, 2013

31

Leverage existing systems & capabilities

Build integrative components

Page 32: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ Data Infrastructure for Polar

Sciences

Data Diversity: A unique situation?

Many disciplines

Big Data versus Long Tail

International

September 10, 2013

32

Page 33: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ Disciplinary Repositories

Ensure Usability

Develop/promote community-based data reporting standards

Provenance of data, data precision, errors, etc.

Work with publishers, editors, professional societies

Align with other data & interoperability standards

Provide services for persistent data identification (DOI), data

attribution and citation, long-term archiving, etc.

Advance Access

provide science-driven tools for data search & access

provide programmatic interfaces for cross-disciplinary use

links to publications

33

September 10, 2013

Page 34: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ Data Infrastructure

September 10, 2013

34

Acquisition

Access Analysis

Archiving

Page 35: Databases & Data Infrastructure › files › 2018 › 07 › nsf-2013... · From Databases to Data Infrastructure Data-driven science creates new requirements for data: Data need

+ The Foundation: Data

Open access to a global, distributed knowledge base of

scientific data and information, including legacy data

Seamless integration of data

within disciplines

across disciplines

with tools (visualization, analysis) and models

35

September 10, 2013