afternoon session: the archival problem and infrastructure for solutions

7
Afternoon session: The archival problem and infrastructure for solutions Prof John R Helliwell [email protected] Interactive Publications and the Record of Science ICSTI Winter Workshop Paris, Monday, February 8, 2010

Upload: studs

Post on 09-Jan-2016

34 views

Category:

Documents


1 download

DESCRIPTION

Afternoon session: The archival problem and infrastructure for solutions. Prof John R Helliwell [email protected]. Interactive Publications and the Record of Science ICSTI Winter Workshop Paris , Monday, February 8, 2010. JRH research, publications background. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Afternoon session: The archival problem and infrastructure for solutions

Afternoon session:

The archival problem and infrastructure for solutions

Prof John R [email protected]

Interactive Publications and the Record of Science

ICSTI Winter Workshop

Paris, Monday, February 8, 2010

Page 2: Afternoon session: The archival problem and infrastructure for solutions

JRH research, publications background

• Professor of Structural Chemistry• DSc Physics• Approx 200 research papers; 5 books (2 as

monographs)• Editor-in-Chief of journals published by IUCr 1996-2005

(Acta Crystallographica, Journal of Applied Crystallography, Journal of Synchrotron Radiation)

• IUCr Representative to ICSTI

Page 3: Afternoon session: The archival problem and infrastructure for solutions

What needs to be in place for interactive content to be available in the future?

• Emulation of legacy software environments? • How to package, identify and interlink the independent

components of a complex article?• Can we handle distributed articles? • Can we identify and retrieve slices through large

archived data sets? • How to work with changing data sets?

• What is worth keeping anyway?

Page 4: Afternoon session: The archival problem and infrastructure for solutions

The importance of data for publication

• Interactive figures depend on data• Semantic value is added to data, or forms

additional (meta)data• Fundamental principle of research publication:

the work is reproducible– exact experimental conditions are given– data are preserved/accessible – in recent case of animal clones, ‘samples’ also had to

be made available upon request

• Increasing requirement to archive primary data

Page 5: Afternoon session: The archival problem and infrastructure for solutions

Data and publication in crystallography

• A reasonable state of affairs ...– molecular models archived by journals (CIFs: interactive figures)– reduced diffraction data preserved by databases or some

journals (data validation; retracted papers)

• ... but with room for improvement– molecular dynamics for the crystalline state difficult to interpret;

whole diffraction images preferable for archiving– scientific fraud in structural biology/chemistry: archiving of

diffraction images provides better security against such frauds– but diffraction data images from crystal diffraction experiments

are uncompressed, file sizes large. Thus limited appetite (and resources) to preserve it

Page 6: Afternoon session: The archival problem and infrastructure for solutions

Crystals, diffraction spots and smears, molecules and dynamics

Zoom

Page 7: Afternoon session: The archival problem and infrastructure for solutions

Some archive technical details

• Protein Data Bank: 60,000 macromolecular structures– 80% derived from crystal structure analysis– archive doubling in size every 2 to 3 years– coordinate file for typical protein ~0.25 Mb; derived from core diffraction

data of 1Mb; extracted from ~1 Gb of diffraction images data.– data sets need to be archived in quintuplicate (EBI Director to JRH Jan 12 2010)

– thus 60,000 x 1Gb x 5= 300 Terabytes of primary data for PDB currently– cost estimate for PDB to be the sole primary archive provider ca GBP

200,000 per annum: unable to take on this responsibility on

• Currently researcher agrees to hold project diffraction images for at least 5 years and release them upon request; no archiving commitment from research sponsor

• Solution in distributed or federated archives (experimental facilities / laboratories / data repositories)?