data management a scientific tour...a scientific tour shaun de witt0, 1, rob akers , hannes...

29
CCFE is the fusion research arm of the United Kingdom Atomic Energy Authority. This work was part-funded by the RCUK Energy Programme [grant number EP/I501045] and the European Union’s Horizon 2020 research and innovation programme. Data Management A Scientific Tour Shaun de Witt 0 , 1 , Rob Akers 1 , Hannes Thiemann 2 , Margareta Hellström 3 , Alberto Michelini 4 , Massimo Farres 4 , Peter Danecek 4 0. -Corresponding author [email protected] 1 Culham Centre for Fusion Enenrgy 2 - Deutsche Klimarechenzentrum 3 University of Lund 4 - Istituto Nazionale di Geofisica e Vulcanologia

Upload: others

Post on 03-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

CCFE is the fusion research arm of the United Kingdom Atomic Energy Authority.

This work was part-funded by the RCUK Energy Programme [grant number EP/I501045]

and the European Union’s Horizon 2020 research and innovation programme.

Data Management

A Scientific Tour

Shaun de Witt0,1, Rob Akers1, Hannes

Thiemann2, Margareta Hellström3, Alberto

Michelini4, Massimo Farres4, Peter Danecek4

0. -Corresponding author [email protected]

1 – Culham Centre for Fusion Enenrgy

2 - Deutsche Klimarechenzentrum

3 – University of Lund

4 - Istituto Nazionale di Geofisica e Vulcanologia

Page 2: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

• What is Data Management

• Data Management in History

• Who cares about data management?

• What is this data management stuff anyway?

• What’s wrong with the way we do it now?

• OK – so how do other people do it?

• So how can we improve (aka the paradigm shift)…

Agenda

Page 3: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

• "Data Resource Management is the development and execution of

architectures, policies, practices and procedures that properly manage

the full data lifecycle needs of an enterprise.” – Data Management

Association

– ‘…deliver the scientific discoveries and major scientific tools that transform

our understanding of nature and advance the energy, economic, and

national security of the United States.’ – DoE Office of Science

– ‘Good research data management is not a goal in itself, but rather the key

conduit leading to knowledge discovery and innovation, and to

subsequent data and knowledge integration and reuse.’ – EC

– ‘…maximising the benefits from ARC-funded research, including by

ensuring greater access to research data.’ – Australian Research Council

Data Management – Definition and Reasons

Page 4: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

A Pictorial History

• 40,000-5,500 BC • Slow Write Rates • OK long term preservation

(needs right conditions) • Subject to deliberate and

accidental corruption/overwrite

• Mainly used for recording hunting parties

• 5000 BC – 75-AD • Very slow write rates • Excellent long term

archival • Corruption unlikely • Mainly used for tax

recording

• 2500BC – present • OK Write and Read Rates • Subject to loss and

corruption • Low data density • Many Uses

• 1600’s – present • Good archival properties under

right conditions • Slow write rates, low data

density • Easily corrupted/destroyed • Read is very subjective • Lacks Standardisation

• 1890’s-2000’s • High Data Density • High Write Rate, Slow

read rate • Good Archival Material • Not subject to deliberate

corruption • Needs good archival

techniques

• 1950’s-1970’s • 2002 for tape

• Rise of Magnetic Media • Winchester hard disk (60-

140MB) • Reel Tapes

(160MB@57kbps)

• PC’s-1981-present • HDD – • Floppy Disk - 0.36-1.44MB • Exabyte Tapes – 20GB@3MB/s • Tape Libraries – 7 tapes

• Rise of the Supercomputer • Huge disk arrays (PB’s) • Even huger tape libraries

(500PB@>140MB/s)

Page 5: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

Why the Explosion?

Research Infrastructure trends: Internationalisation

Diversification

middle age 19th century 20th century 21st century

Large Scale Projects: SKA

ITER

LHC

Human Brain Project

IPCC/CMIP

ITER

Page 6: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

Mirrored in Fusion

1

10

100

1000

10000

100000

1000000

10000000

100000000

MAST JET2007 JET2017 ITER DEMO

VolumeofD

ata(GB/Day)

DailyDataVolumes

Page 7: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

Data Management and the Data Lifecycle

UK Data Archive

Digital Curation Centre

Page 8: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

• Promoted by EC and RDA – Findable

• Descriptive Metadata, Human and Machine Readable, Defined Ontology

– Accessible • Unique and persistent identifier, metadata, security, licensing

– DOI, PID, ARK, PURL, URN, …

– ‘Open’ Data

– Interoprable • Well defined file format, machine (and human) readable, metadata

• Within a community as well as outside

– Reusable • Bit preservation, format migration, metadata, provenance

• Within a community as well as outside

• 4-star competing ’principle’

Data Management – The FAIR Principle

Page 9: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

Fusion Data Management

• Fusion Experiments operate as Islands – Each site does an excellent job of curation

– Getting remote access is a bit like making a phone call in the 1930’s

– Sometimes it’s easier to go to the data – but that can be difficult and expensive

– Local Standards (at worst) but two main ways of accessing data

• HDF5 (open standard, widely used, self describing)

• MDSPlus (open source, optimised for fusion, bridges for reading HDF5 files)

• UDA provides abstraction (another standard?)

• ITER Data may be federated geographically – Islands won’t work any more

Page 10: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

Future of Fusion DM? • Data still stored at islands

– Automated Data Replication

– Federated and Optimised Data Access (bridging the islands)

– Ability to ’prestage’ data to HPC

• Data Abstraction – Access any data in a well defined format

– Without the need for code recompilation

– Using something like OpenDDL (bridging the formats)

• Why? – Analyse any data, anywhere, anytime (AAA)

– Test UDA at scale

• What’s Missing? – Federated AAI,

– Metadata Standards,

– data placement rules http://openddl.org/

Page 11: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

Federated Authentication is HARD???

BUT IT CAN AND HAS BEEN DONE • Problem is one of politics and

perception, not technical

Page 12: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

Examples of Federated Data Mangement

Michael Loughlin

Page 13: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

European Plate Observing System (EPOS)

https://www.epos-ip.org/

• 25 Countries • 141 Institutions • 128 Laboratories • 7224 Sensors • 256 Research

Infrastructures • >1PB of data

Page 14: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

Metadata

Page 15: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

The ESGF peer-to-peer enterprise system provides services and resources essential for global-scale Earth system science. This system is developed, deployed and maintained by an international multi-agency federation. ESGF’s open source, operational code base disseminates petabytes of data including model simulations, observational, and reanalysis data for research assessments.

1 CEDA UK

2 DKRZ Germany

3 ANU NCI Australia

4 NOAA

GFDL

U.S.

5 NASA

GSFC

U.S.

6 IPSL France

7 NASA JPL U.S.

8 DOE LLNL U.S.

9 LiU Sweden

Page 16: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

ESGF ensures equal access to large disparate datasets. It enables scientists to evaluate models, understand their differences, and explore the impacts of geophysical disturbances through a common interface, regardless of data location.

Data Management

Distributed Search

Federation

Analysis and Visualization

Provenance Capture

Security

Network

Compute Facilities

Dynamic Resources

Data Transfer

Long-Tail Publication

Data Citation

Machine Learning

Current Capabilities

Future Capabilities

Page 17: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

ESGF: Digital Footprint

Supports >700,000 datasets from

universities as well as national and

international laboratories. ~4 million

datasets downloaded.

Manages >5 PB of data in the total

ESGF federated archive, which is

expected to expand to >40 PB of

uncompressed data, distributed across

>25 projects and ~70 model

intercomparison projects (MIPs).

Services 18 highly visible national and

international geophysical data products,

including CMIP3, CMIP5, and soon

CMIP6.

Page 18: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

• 50-100TB/yr (RAW) • 2 GB/year (processed) • 100GB/year (high level)

Raw data

Ecosystem Thematic

Centre

Atmospheric

Thematic

Centre

Oceanic

Thematic

Centre

• Data ingestion • Metadata services • Data discovery & access • Usage tracking

• Data management • Repository administration • Preservation planning • User community support

Central Analytical Laboratory

ICOS Carbon Portal

ICOS repository (data & metadata)

Ecosystem, atmosphere and ocean measurement stations

sensor data (Near Real-Time)

Measurement & station metadata

+ metadata

External HPC & HTC services

Metadata registry & catalogue services

Data products & metadata

Wide range of end user communities

User 1 User 2 User 3 User 4

Elaborated data products

Near Real-Time data & metadata QA/QC:ed data & metadata

(User-initiated) On-demand computing

Compute request Results

• Anyone can ACCESS (after registration)

• More formal procedures for upload

• SparQL end-point for machine searchability

• Custom AAI CPAuth ncluding local accounts and social ID

Page 19: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0
Page 20: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

Tier 0

Tier 1 Tier

1

Tier 1

Tier 1

Tier 1

Tier 1

Tier 1

Tier 1

Tier 1

Tier 1

Tier 1

Tier 2

Tier 2

Tier 2

Tier 2

Tier 2

Tier 2

Tier 2

Tier 0R

Made possible through the use of xrootd (http://xrootd.org/)

Data accessed via GUID or algorithmic data placement

X509 Security with Virtual Organisations and Roles

Page 21: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

European Open Science Cloud

• Data Management Services • Global searchable metadata

service • Long Term Archival • Repository Services • Data Transfer Services • Federated AAI

• Federated Compute Services (HTC)

• Cloud Computing Infrastructure

• Application on Demand • Data Transfer Services • Federated AAI

• Execution Frameworks

• PaaS • Data Analytics

Services • Federated AAI

Page 22: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

• Agreements are Key

– AAI and Metadata

• Make sure users are

involved in decision

making process

– Good governance

• Make a function centric

architecture

• Communication

– Semantics and syntax

Key Challenges to Federated Data Management

• Loss of control

• What’s in it for me

• Lack of funding

• Data models don’t replace

the need for information

architecture

• Lack of infrastructure

• Legacy systems

Derived from Enterprise Architecture Shared Interest Group 2003

Page 23: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

• Scaling up current solutions is not be an option

– Federation and ease of access are key

• It takes longer to set up am infrastructure than planned

– Start designing and limited scale testing now – 5 years before first ignition is too late

• Start with the hard things at current scale

– Metadata, federated authentication, provenance, interoperability, efficient data movement

– TRUST is KEY

• Make sure ALL stakeholders are involved

– But drive has to come from the top

Conclusion

Page 24: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

BACKUP

Page 25: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

EUDAT Authentication (SAML/X509)

Page 26: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

Integrating an existing identity provider

Page 27: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

ESGF AAI

Page 28: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

Accessible Attributes (EGI – X509/openID/SAML)

Attribute friendly name Attribute OID Example value

eduPersonUniqueId urn:oid:1.3.6.1.4.1.5923.1.1.1.13 ef72285491ffe53c39b75bdcef46689f5d26ddfa00312365cc4fb5ce97e9ca87@egi.eu

mail urn:oid:0.9.2342.19200300.100.1.3 [email protected]

displayName urn:oid:2.16.840.1.113730.3.1.241 John Doe

givenName urn:oid:2.5.4.42 John

sn urn:oid:2.5.4.4 Doe

eduPersonAssurance urn:oid:1.3.6.1.4.1.5923.1.1.1.11 https://aai.egi.eu/LoA#Substantial

distinguishedName urn:oid:2.5.4.49 /C=NL/O=Example.org/CN=John Doe

eduPersonScopedAffiliation urn:oid:1.3.6.1.4.1.5923.1.1.1.9 [email protected]

eduPersonEntitlement urn:oid:1.3.6.1.4.1.5923.1.1.1.7 urn:mace:egi.eu:www.egi.eu:wiki-editors:[email protected]

Page 29: Data Management A Scientific Tour...A Scientific Tour Shaun de Witt0, 1, Rob Akers , Hannes Thiemann2, Margareta Hellström3, Alberto Michelini 4, Massimo Farres , Peter Danecek4 0

EGI Application on Demand