data management a scientific tour...a scientific tour shaun de witt0, 1, rob akers , hannes...
TRANSCRIPT
CCFE is the fusion research arm of the United Kingdom Atomic Energy Authority.
This work was part-funded by the RCUK Energy Programme [grant number EP/I501045]
and the European Union’s Horizon 2020 research and innovation programme.
Data Management
A Scientific Tour
Shaun de Witt0,1, Rob Akers1, Hannes
Thiemann2, Margareta Hellström3, Alberto
Michelini4, Massimo Farres4, Peter Danecek4
0. -Corresponding author [email protected]
1 – Culham Centre for Fusion Enenrgy
2 - Deutsche Klimarechenzentrum
3 – University of Lund
4 - Istituto Nazionale di Geofisica e Vulcanologia
• What is Data Management
• Data Management in History
• Who cares about data management?
• What is this data management stuff anyway?
• What’s wrong with the way we do it now?
• OK – so how do other people do it?
• So how can we improve (aka the paradigm shift)…
Agenda
• "Data Resource Management is the development and execution of
architectures, policies, practices and procedures that properly manage
the full data lifecycle needs of an enterprise.” – Data Management
Association
– ‘…deliver the scientific discoveries and major scientific tools that transform
our understanding of nature and advance the energy, economic, and
national security of the United States.’ – DoE Office of Science
– ‘Good research data management is not a goal in itself, but rather the key
conduit leading to knowledge discovery and innovation, and to
subsequent data and knowledge integration and reuse.’ – EC
– ‘…maximising the benefits from ARC-funded research, including by
ensuring greater access to research data.’ – Australian Research Council
Data Management – Definition and Reasons
A Pictorial History
• 40,000-5,500 BC • Slow Write Rates • OK long term preservation
(needs right conditions) • Subject to deliberate and
accidental corruption/overwrite
• Mainly used for recording hunting parties
• 5000 BC – 75-AD • Very slow write rates • Excellent long term
archival • Corruption unlikely • Mainly used for tax
recording
• 2500BC – present • OK Write and Read Rates • Subject to loss and
corruption • Low data density • Many Uses
• 1600’s – present • Good archival properties under
right conditions • Slow write rates, low data
density • Easily corrupted/destroyed • Read is very subjective • Lacks Standardisation
• 1890’s-2000’s • High Data Density • High Write Rate, Slow
read rate • Good Archival Material • Not subject to deliberate
corruption • Needs good archival
techniques
• 1950’s-1970’s • 2002 for tape
• Rise of Magnetic Media • Winchester hard disk (60-
140MB) • Reel Tapes
(160MB@57kbps)
• PC’s-1981-present • HDD – • Floppy Disk - 0.36-1.44MB • Exabyte Tapes – 20GB@3MB/s • Tape Libraries – 7 tapes
• Rise of the Supercomputer • Huge disk arrays (PB’s) • Even huger tape libraries
(500PB@>140MB/s)
Why the Explosion?
Research Infrastructure trends: Internationalisation
Diversification
middle age 19th century 20th century 21st century
Large Scale Projects: SKA
ITER
LHC
Human Brain Project
IPCC/CMIP
ITER
Mirrored in Fusion
1
10
100
1000
10000
100000
1000000
10000000
100000000
MAST JET2007 JET2017 ITER DEMO
VolumeofD
ata(GB/Day)
DailyDataVolumes
Data Management and the Data Lifecycle
UK Data Archive
Digital Curation Centre
• Promoted by EC and RDA – Findable
• Descriptive Metadata, Human and Machine Readable, Defined Ontology
– Accessible • Unique and persistent identifier, metadata, security, licensing
– DOI, PID, ARK, PURL, URN, …
– ‘Open’ Data
– Interoprable • Well defined file format, machine (and human) readable, metadata
• Within a community as well as outside
– Reusable • Bit preservation, format migration, metadata, provenance
• Within a community as well as outside
• 4-star competing ’principle’
Data Management – The FAIR Principle
Fusion Data Management
• Fusion Experiments operate as Islands – Each site does an excellent job of curation
– Getting remote access is a bit like making a phone call in the 1930’s
– Sometimes it’s easier to go to the data – but that can be difficult and expensive
– Local Standards (at worst) but two main ways of accessing data
• HDF5 (open standard, widely used, self describing)
• MDSPlus (open source, optimised for fusion, bridges for reading HDF5 files)
• UDA provides abstraction (another standard?)
• ITER Data may be federated geographically – Islands won’t work any more
Future of Fusion DM? • Data still stored at islands
– Automated Data Replication
– Federated and Optimised Data Access (bridging the islands)
– Ability to ’prestage’ data to HPC
• Data Abstraction – Access any data in a well defined format
– Without the need for code recompilation
– Using something like OpenDDL (bridging the formats)
• Why? – Analyse any data, anywhere, anytime (AAA)
– Test UDA at scale
• What’s Missing? – Federated AAI,
– Metadata Standards,
– data placement rules http://openddl.org/
Federated Authentication is HARD???
BUT IT CAN AND HAS BEEN DONE • Problem is one of politics and
perception, not technical
Examples of Federated Data Mangement
Michael Loughlin
European Plate Observing System (EPOS)
https://www.epos-ip.org/
• 25 Countries • 141 Institutions • 128 Laboratories • 7224 Sensors • 256 Research
Infrastructures • >1PB of data
Metadata
The ESGF peer-to-peer enterprise system provides services and resources essential for global-scale Earth system science. This system is developed, deployed and maintained by an international multi-agency federation. ESGF’s open source, operational code base disseminates petabytes of data including model simulations, observational, and reanalysis data for research assessments.
1 CEDA UK
2 DKRZ Germany
3 ANU NCI Australia
4 NOAA
GFDL
U.S.
5 NASA
GSFC
U.S.
6 IPSL France
7 NASA JPL U.S.
8 DOE LLNL U.S.
9 LiU Sweden
ESGF ensures equal access to large disparate datasets. It enables scientists to evaluate models, understand their differences, and explore the impacts of geophysical disturbances through a common interface, regardless of data location.
Data Management
Distributed Search
Federation
Analysis and Visualization
Provenance Capture
Security
Network
Compute Facilities
Dynamic Resources
Data Transfer
Long-Tail Publication
Data Citation
Machine Learning
Current Capabilities
Future Capabilities
ESGF: Digital Footprint
Supports >700,000 datasets from
universities as well as national and
international laboratories. ~4 million
datasets downloaded.
Manages >5 PB of data in the total
ESGF federated archive, which is
expected to expand to >40 PB of
uncompressed data, distributed across
>25 projects and ~70 model
intercomparison projects (MIPs).
Services 18 highly visible national and
international geophysical data products,
including CMIP3, CMIP5, and soon
CMIP6.
• 50-100TB/yr (RAW) • 2 GB/year (processed) • 100GB/year (high level)
Raw data
Ecosystem Thematic
Centre
Atmospheric
Thematic
Centre
Oceanic
Thematic
Centre
• Data ingestion • Metadata services • Data discovery & access • Usage tracking
• Data management • Repository administration • Preservation planning • User community support
Central Analytical Laboratory
ICOS Carbon Portal
ICOS repository (data & metadata)
Ecosystem, atmosphere and ocean measurement stations
sensor data (Near Real-Time)
Measurement & station metadata
+ metadata
External HPC & HTC services
Metadata registry & catalogue services
Data products & metadata
Wide range of end user communities
User 1 User 2 User 3 User 4
Elaborated data products
Near Real-Time data & metadata QA/QC:ed data & metadata
(User-initiated) On-demand computing
Compute request Results
• Anyone can ACCESS (after registration)
• More formal procedures for upload
• SparQL end-point for machine searchability
• Custom AAI CPAuth ncluding local accounts and social ID
Tier 0
Tier 1 Tier
1
Tier 1
Tier 1
Tier 1
Tier 1
Tier 1
Tier 1
Tier 1
Tier 1
Tier 1
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 2
Tier 0R
Made possible through the use of xrootd (http://xrootd.org/)
Data accessed via GUID or algorithmic data placement
X509 Security with Virtual Organisations and Roles
European Open Science Cloud
• Data Management Services • Global searchable metadata
service • Long Term Archival • Repository Services • Data Transfer Services • Federated AAI
• Federated Compute Services (HTC)
• Cloud Computing Infrastructure
• Application on Demand • Data Transfer Services • Federated AAI
• Execution Frameworks
• PaaS • Data Analytics
Services • Federated AAI
• Agreements are Key
– AAI and Metadata
• Make sure users are
involved in decision
making process
– Good governance
• Make a function centric
architecture
• Communication
– Semantics and syntax
Key Challenges to Federated Data Management
• Loss of control
• What’s in it for me
• Lack of funding
• Data models don’t replace
the need for information
architecture
• Lack of infrastructure
• Legacy systems
Derived from Enterprise Architecture Shared Interest Group 2003
• Scaling up current solutions is not be an option
– Federation and ease of access are key
• It takes longer to set up am infrastructure than planned
– Start designing and limited scale testing now – 5 years before first ignition is too late
• Start with the hard things at current scale
– Metadata, federated authentication, provenance, interoperability, efficient data movement
– TRUST is KEY
• Make sure ALL stakeholders are involved
– But drive has to come from the top
Conclusion
BACKUP
EUDAT Authentication (SAML/X509)
Integrating an existing identity provider
ESGF AAI
Accessible Attributes (EGI – X509/openID/SAML)
Attribute friendly name Attribute OID Example value
eduPersonUniqueId urn:oid:1.3.6.1.4.1.5923.1.1.1.13 ef72285491ffe53c39b75bdcef46689f5d26ddfa00312365cc4fb5ce97e9ca87@egi.eu
mail urn:oid:0.9.2342.19200300.100.1.3 [email protected]
displayName urn:oid:2.16.840.1.113730.3.1.241 John Doe
givenName urn:oid:2.5.4.42 John
sn urn:oid:2.5.4.4 Doe
eduPersonAssurance urn:oid:1.3.6.1.4.1.5923.1.1.1.11 https://aai.egi.eu/LoA#Substantial
distinguishedName urn:oid:2.5.4.49 /C=NL/O=Example.org/CN=John Doe
eduPersonScopedAffiliation urn:oid:1.3.6.1.4.1.5923.1.1.1.9 [email protected]
eduPersonEntitlement urn:oid:1.3.6.1.4.1.5923.1.1.1.7 urn:mace:egi.eu:www.egi.eu:wiki-editors:[email protected]
EGI Application on Demand