preserving scientific data

15
Preserving Scientific Data Jamie Shiers, Information Technology Department, CERN, Geneva, Switzerland

Upload: julie-rich

Post on 01-Jan-2016

20 views

Category:

Documents


0 download

DESCRIPTION

Preserving Scientific Data. Jamie Shiers, Information Technology Department, CERN, Geneva, Switzerland. Agenda. Motivation for preserving scientific data – examples from a range of sciences Volume of data involved and related issues Some concrete archiving examples from Particle Physics - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Preserving Scientific Data

Preserving Scientific Data

Jamie Shiers, Information Technology Department, CERN, Geneva,

Switzerland

Page 2: Preserving Scientific Data

UNESCO Information Preservation debate, April 2007 - [email protected]

Agenda

• Motivation for preserving scientific data – examples from a range of sciences

• Volume of data involved and related issues

• Some concrete archiving examples from Particle Physics

• Remaining challenges

• Conclusions

Page 3: Preserving Scientific Data

UNESCO Information Preservation debate, April 2007 - [email protected]

Motivation

• Climate data: in an era when climate change is hotly debated, the motivations appear clear…

• Medical data: important for understanding issues such as historical pandemics, cross-species diseases etc. Avian flu, HIV, …

• Cosmological data: plays a vital role in our evolving understanding of the Universe – astrophysics community has an explicit policy (data is made public after 1 year – data volume doubles each year)

• Particle Physics data: Similar arguments – will we ever be able to build similar accelerators to those of today? If we ‘lose’ this data, what of our scientific heritage? Need to look at old data for a signal that should have been seen (has happened several times)

Page 4: Preserving Scientific Data

UNESCO Information Preservation debate, April 2007 -

[email protected]

4http://www.damtp.cam.ac.uk/user/gr/public/bb_history.html

Standard Cosmology

Good model from 0.01 secafter Big Bang

Supported by considerable observational evidence

Elementary Particle Physics

From the Standard Model into theunknown: towards energies of1 TeV and beyond: the Terascale

Towards Quantum Gravity

From the unknown into the unknown...

Tim

e

En

erg

y, D

en

sity

, Tem

pera

ture

Page 5: Preserving Scientific Data

UNESCO Information Preservation debate, April 2007 - [email protected]

Issues

• How much data is involved?

• Preserving the bits

• Understanding the bits

Page 6: Preserving Scientific Data

UNESCO Information Preservation debate, April 2007 - [email protected]

How much data is involved?

• In 1998, the following estimates were made regarding the data from LEP (1989 – 2000) that should be kept

Experiment Analysis dataset Reconstructable dataset

ALEPH 250GB 1-2TB

DELPHI 2-6TB

L3 500GB 5TB

OPAL 300GB 1-2TB

By today’s standards, these data volumes are trivial

• Even though the total volume of data at the LHC is much much higher, the data that must be kept beyond the life of the machine (2007 to ~2020) will be easily handled by then

The LHC will generate some 15PB of data per year!

Concorde(15 Km)

Balloon(30 Km)

CD stack with1 year LHC data!(~ 20 Km)

Mt. Blanc(4.8 Km)

Page 7: Preserving Scientific Data

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this p icture.

UNESCO Information Preservation debate, April 2007 - [email protected] 7

The LHC machine - OverviewIntroduction

Status of

LHCb

ATLAS

ALICE

CMS

Conclusions

LHC : 27 km ring100m underground

ATLAS

General Purpose,pp, heavy ions

General Purpose,pp, heavy ions

CMS+TOTEM

Heavy ions, pp

ALICE

pp, B-Physics,CP Violation

Page 8: Preserving Scientific Data

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this p icture.

UNESCO Information Preservation debate, April 2007 - [email protected] 8

The size of HEP detectors

CMS

ATLASBld. 40

Introduction

Status of

LHCb

ATLAS

ALICE

CMS

Conclusions

Page 9: Preserving Scientific Data

UNESCO Information Preservation debate, April 2007 - [email protected]

Understanding the bits

• In the mid-1990s, a successful re-analysis of 10-year old data from the JADE collaboration at the PETRA accelerator at DESY was made

• A sub-set of the data was found abandoned in an office corner. The programs to read the data were in an obsolete language and were unusable. The data format was proprietary (but de-codable).

This provided valuable input into the LEP data archive

• Data format: will this be readable in 5 / 10 / 100 years? 1000?• Programs: languages / operating systems / hardware platforms

have very short life-spans wrt an archive• Metadata: essential to understand what the data means

The best solution to date is a so-called ‘Museum system’, but this is still a very short term solution wrt even Einstein, let alone Tyco Brahe, Kepler and Newton…

Page 10: Preserving Scientific Data

UNESCO Information Preservation debate, April 2007 - [email protected]

Preserving the bits

• Lifetimes of Particle Physics experiments are extremely long! Currently measured in decades…

• Ironically, one of the solutions proposed for the LEP data archive (the then-current proposal for the LHC) was later abandoned (technical / commercial reasons)

• This necessitated a ‘triple migration’: Of 300TB of data between storage media; Of the same data from one data format to another; Of the accompanying processing codes.

• In the end, the exercise took around 2 months per 100TB of data migrated, as well as a significant amount of effort (~1 FTE / 100TB) and hardware resources

Page 11: Preserving Scientific Data

UNESCO Information Preservation debate, April 2007 - [email protected]

Outstanding Issues

• There are no data formats, programming languages, computing hardware or operating systems with lifetimes that can be guaranteed beyond the short term

• Virtual machine technology may extend an environment’s (see above) natural life – perhaps doubling it

• Reducing the data into a much simplified and widely-used format can have significant advantages, but only allows restricted analyses to be performed

• Preserving the detailed knowledge of the experimental apparatus is beyond current technology – it would require extreme discipline on behalf of the researchers as well as major advances in the understanding and description of metadata

Page 12: Preserving Scientific Data

UNESCO Information Preservation debate, April 2007 - [email protected]

Conclusions

• As long as advances in storage capacity continue, there are no significant issues related to the volume of scientific data that must be kept

• Periodic migration between different types of storage media must be foreseen

• Specific storage formats must also be catered for – this can require much more significant (time consuming and expensive) migrations

By far the biggest problem concerns understanding the data – there is currently no clear solution in this domain

Page 13: Preserving Scientific Data

UNESCO Information Preservation debate, April 2007 - [email protected]

References

• LEP Data archive• 1997: http://s.web.cern.ch/s/sticklan/www/archive/• 2002: http://mgt-focus.web.cern.ch/mgt-focus/Focus25/maggim.pdf • 2003:

http://cern.ch/pfeiffer/LEP-Data-Archive/proposal/ProposalForTheLEPDataArchive.html

• http://tenchini.home.cern.ch/tenchini/Status_Archiving_6_Mar_2003.pdf

• Lisbon workshop• http://cern.ch/knobloch/talks/CernCodataLisbon.ppt• http://www.erpanet.org/events/2003/lisbon/LisbonReportFinal.pdf

• COMPASS / HARP data migrations• http://storageconference.org/2003/papers/06-Lubeck-Overview.pdf• http://www.slac.stanford.edu/econf/C0303241/proc/papers/

THKT001.PDF• http://indico.cern.ch/getFile.py/access?

contribId=448&sessionId=24&resId=1&materialId=paper&confId=0

Page 14: Preserving Scientific Data

UNESCO Information Preservation debate, April 2007 - [email protected]

Acknowledgements

The following people provided material and / or pointers for this talk (knowingly or otherwise):

• LEP Data Archive coordinators:• David Stickland, [email protected] (L3)• Andreas Pfeiffer, [email protected]• Marcello Maggi, [email protected] (ALEPH)

• COMPASS / HARP migrations:• Andrea Valassi, [email protected]

• ERPANET/CODATA Workshop• Jürgen Knobloch, [email protected]

Page 15: Preserving Scientific Data

The End