preserving scientific data
DESCRIPTION
Preserving Scientific Data. Jamie Shiers, Information Technology Department, CERN, Geneva, Switzerland. Agenda. Motivation for preserving scientific data – examples from a range of sciences Volume of data involved and related issues Some concrete archiving examples from Particle Physics - PowerPoint PPT PresentationTRANSCRIPT
Preserving Scientific Data
Jamie Shiers, Information Technology Department, CERN, Geneva,
Switzerland
UNESCO Information Preservation debate, April 2007 - [email protected]
Agenda
• Motivation for preserving scientific data – examples from a range of sciences
• Volume of data involved and related issues
• Some concrete archiving examples from Particle Physics
• Remaining challenges
• Conclusions
UNESCO Information Preservation debate, April 2007 - [email protected]
Motivation
• Climate data: in an era when climate change is hotly debated, the motivations appear clear…
• Medical data: important for understanding issues such as historical pandemics, cross-species diseases etc. Avian flu, HIV, …
• Cosmological data: plays a vital role in our evolving understanding of the Universe – astrophysics community has an explicit policy (data is made public after 1 year – data volume doubles each year)
• Particle Physics data: Similar arguments – will we ever be able to build similar accelerators to those of today? If we ‘lose’ this data, what of our scientific heritage? Need to look at old data for a signal that should have been seen (has happened several times)
UNESCO Information Preservation debate, April 2007 -
4http://www.damtp.cam.ac.uk/user/gr/public/bb_history.html
Standard Cosmology
Good model from 0.01 secafter Big Bang
Supported by considerable observational evidence
Elementary Particle Physics
From the Standard Model into theunknown: towards energies of1 TeV and beyond: the Terascale
Towards Quantum Gravity
From the unknown into the unknown...
Tim
e
En
erg
y, D
en
sity
, Tem
pera
ture
UNESCO Information Preservation debate, April 2007 - [email protected]
Issues
• How much data is involved?
• Preserving the bits
• Understanding the bits
UNESCO Information Preservation debate, April 2007 - [email protected]
How much data is involved?
• In 1998, the following estimates were made regarding the data from LEP (1989 – 2000) that should be kept
Experiment Analysis dataset Reconstructable dataset
ALEPH 250GB 1-2TB
DELPHI 2-6TB
L3 500GB 5TB
OPAL 300GB 1-2TB
By today’s standards, these data volumes are trivial
• Even though the total volume of data at the LHC is much much higher, the data that must be kept beyond the life of the machine (2007 to ~2020) will be easily handled by then
The LHC will generate some 15PB of data per year!
Concorde(15 Km)
Balloon(30 Km)
CD stack with1 year LHC data!(~ 20 Km)
Mt. Blanc(4.8 Km)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this p icture.
UNESCO Information Preservation debate, April 2007 - [email protected] 7
The LHC machine - OverviewIntroduction
Status of
LHCb
ATLAS
ALICE
CMS
Conclusions
LHC : 27 km ring100m underground
ATLAS
General Purpose,pp, heavy ions
General Purpose,pp, heavy ions
CMS+TOTEM
Heavy ions, pp
ALICE
pp, B-Physics,CP Violation
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this p icture.
UNESCO Information Preservation debate, April 2007 - [email protected] 8
The size of HEP detectors
CMS
ATLASBld. 40
Introduction
Status of
LHCb
ATLAS
ALICE
CMS
Conclusions
UNESCO Information Preservation debate, April 2007 - [email protected]
Understanding the bits
• In the mid-1990s, a successful re-analysis of 10-year old data from the JADE collaboration at the PETRA accelerator at DESY was made
• A sub-set of the data was found abandoned in an office corner. The programs to read the data were in an obsolete language and were unusable. The data format was proprietary (but de-codable).
This provided valuable input into the LEP data archive
• Data format: will this be readable in 5 / 10 / 100 years? 1000?• Programs: languages / operating systems / hardware platforms
have very short life-spans wrt an archive• Metadata: essential to understand what the data means
The best solution to date is a so-called ‘Museum system’, but this is still a very short term solution wrt even Einstein, let alone Tyco Brahe, Kepler and Newton…
UNESCO Information Preservation debate, April 2007 - [email protected]
Preserving the bits
• Lifetimes of Particle Physics experiments are extremely long! Currently measured in decades…
• Ironically, one of the solutions proposed for the LEP data archive (the then-current proposal for the LHC) was later abandoned (technical / commercial reasons)
• This necessitated a ‘triple migration’: Of 300TB of data between storage media; Of the same data from one data format to another; Of the accompanying processing codes.
• In the end, the exercise took around 2 months per 100TB of data migrated, as well as a significant amount of effort (~1 FTE / 100TB) and hardware resources
UNESCO Information Preservation debate, April 2007 - [email protected]
Outstanding Issues
• There are no data formats, programming languages, computing hardware or operating systems with lifetimes that can be guaranteed beyond the short term
• Virtual machine technology may extend an environment’s (see above) natural life – perhaps doubling it
• Reducing the data into a much simplified and widely-used format can have significant advantages, but only allows restricted analyses to be performed
• Preserving the detailed knowledge of the experimental apparatus is beyond current technology – it would require extreme discipline on behalf of the researchers as well as major advances in the understanding and description of metadata
UNESCO Information Preservation debate, April 2007 - [email protected]
Conclusions
• As long as advances in storage capacity continue, there are no significant issues related to the volume of scientific data that must be kept
• Periodic migration between different types of storage media must be foreseen
• Specific storage formats must also be catered for – this can require much more significant (time consuming and expensive) migrations
By far the biggest problem concerns understanding the data – there is currently no clear solution in this domain
UNESCO Information Preservation debate, April 2007 - [email protected]
References
• LEP Data archive• 1997: http://s.web.cern.ch/s/sticklan/www/archive/• 2002: http://mgt-focus.web.cern.ch/mgt-focus/Focus25/maggim.pdf • 2003:
http://cern.ch/pfeiffer/LEP-Data-Archive/proposal/ProposalForTheLEPDataArchive.html
• http://tenchini.home.cern.ch/tenchini/Status_Archiving_6_Mar_2003.pdf
• Lisbon workshop• http://cern.ch/knobloch/talks/CernCodataLisbon.ppt• http://www.erpanet.org/events/2003/lisbon/LisbonReportFinal.pdf
• COMPASS / HARP data migrations• http://storageconference.org/2003/papers/06-Lubeck-Overview.pdf• http://www.slac.stanford.edu/econf/C0303241/proc/papers/
THKT001.PDF• http://indico.cern.ch/getFile.py/access?
contribId=448&sessionId=24&resId=1&materialId=paper&confId=0
UNESCO Information Preservation debate, April 2007 - [email protected]
Acknowledgements
The following people provided material and / or pointers for this talk (knowingly or otherwise):
• LEP Data Archive coordinators:• David Stickland, [email protected] (L3)• Andreas Pfeiffer, [email protected]• Marcello Maggi, [email protected] (ALEPH)
• COMPASS / HARP migrations:• Andrea Valassi, [email protected]
• ERPANET/CODATA Workshop• Jürgen Knobloch, [email protected]
The End