create, curate, re-use: the expanding life course of digital research data

Post on 30-Nov-2014

471 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation to Educause Australasia 2007

TRANSCRIPT

a centre of expertise in data curation and preservation

Funded by:This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 UK: Scotland License, excluding content property of others. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/2.5/scotland/ ; or, (b) send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

Create, curate, re-use: the expanding life course of digital research data

Chris Rusbridge

EDUCAUSE Australasia May 2007

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Contents• Science and digital curation• Why are data important?• What kinds of data?• What to do with your data: frontiers of

practice• Repository frontiers• Changing practice

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Digital Curation Centre Mission“The over-riding purpose of the DCC is to support and promote continuing improvement in the quality of data curation, and of associated digital preservation”

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Science and curation• Creating and managing data suitable for re-use• Good curation supports good science (managing

your data properly)• Poor curation allows sloppy science?

• Data curation should save money• Murray-Rust/Frey on interesting but fruitless experiments!

• Some science impossible without curation…• QCD strong coupling constant prediction (Bethke)• Viscosity of earth mantle from Shang Dynasty eclipse

records (Pang et al)• Science depending on past baselines (eg environmental,

social sciences)

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Records of science• Data increasingly important as evidence

• Key part of the scholarly record (public good)• Unrepeatable observations & experiments

• Experimental verifiability (the basis of science)• Would Chang retractions have been reduced if his first

data were available?

• Allows additional interpretations• Legal and compliance

• See APSR/AERES report for good examples

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

What kinds of data?• Observations

• eg UARS (Upper Atmosphere) Level 0: telemetry• UARS Level 1: measured physical parameters (post

calibration?)

• Derived data• UARS Level 2: calculated geophysical? profiles• UARS level 3: gridded, interpolated?

• Combined data• Crafted data

• Eg annotated gene/protein databases

• Descriptive (meta)data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Retaining research data means…• Data secure against loss (within group)• Communal repository (secure bit dump)• Re-usable, sharable information• As above, plus active curation (eg bio-

informatics)• Long term preservation of information

• Be clear what you are trying to do!

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

… or the data trajectory is…• Hard drive lost (crash)• Hard drive DVD Cardboard box Loft

Skip/dumpster lost

• Sometimes this is a very bad thing• Sometimes these are the right options!

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Long term bit storage…• A solved problem? Just requires well-

understood good data management practices?

• Wrong! For very large datasets over very long time, there are significant problems…

BAKER, M., SHAH, M., ROSENTHAL, D. S. H., ROUSSOPOLOUS, M., MANIATIS, P., GIULI, T. J. & BUNGALE, P. (2006) A Fresh Look at the Reliability of Long-term Digital Storage. EuroSys '06. Leuven, Belgium, ACM.

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

How Well Must We Preserve?

Keep a petabyte for a century

– With 50% chance of remaining completely undamaged

Consider each bit decaying independently

– Analogy with radioactive decay

That's a bit half life of 10**18 years

– One hundred million times the age of the universe

That's a very demanding requirement

– Hard to measure

– Even very unlikely faults will matter a lot

•Slide from David Rosenthal, LOCKSS

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

What to do about curation• Build curation/reusability into your workflow

• Curation begins before creation• What’s easy at first becomes (impossibly) hard

later• Describe your data (metadata schemas,

“representation info”, etc)• Keep experimental parameters (technical, who,

what, when, where)• Keep ability to process• Keep data!

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

What to do about curation - 2• Use standard/agreed formats for data• Make ownership & restrictions clear, &

explain how to cite your data• Offer for deposit in institutional or discipline

repository• Appraisal and selection essential• Possible time-limited embargos

• “Publish” data in support of articles

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Internet Archaeology: publication with data

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Database as book…• Buneman (early pilot)

work on IUPHAR database

• MySQL to XML database• Historic to logical

schema

• XML via XSLT to LaTeX

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

The StORe vision

• Seamless transport from research data to research publications and vice versa

• Bi-directional links proven in social science e-research but capable of export to other disciplines Source

Output

Middleware

•Slide from Graham Pryor•http://jiscstore.jot.com/WikiHome/

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

What are the reusability issues?• Data not neutral to hypothesis• Hard to know the risks & pitfalls of a particular

dataset• Data not self-describing: hard to find

appropriate data (but see Murray-Rust on Googling InChi etc)

• Hard to “understand” data once found• Really need information, not data!

• Hard to use data once understood

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Context • Data meaningless without context

• Metadata of many kinds• Representation information… from data to

information• Linkage and connection between datasets• Use your workflow!

• Provenance • Authenticity/integrity• Computational lineage

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Csat8-day composite and subsceneCsatE0SST8-day composite and subscenePbopt calc Ctot calc Zeu calcPPeu calcPAR subsceneHRPT

NASA

University research group1

research group3 local

decision-making body

University research group2

Slide from Rajendra Bose

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Access and re-use• Ethics and rights control access

• Weak in expressing this long-term

• Collaboration tools• Annotation, discussion, review (see DART…)• Re-use leading to change and development

• “Publication”• Not just in “print”• Underlying data should be “published”, too

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Who does curation?• Individuals• Departments or groups• Institutions, maybe through libraries• Communities• Disciplines• Publishers• National services• Other 3rd parties…

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Curation: Individual• “Small science 2-3 times more data than Big

science”, but much more at risk• PhD student? RA? PI? Administrator? IT support?• Data potentially on local hard drives, or at best

shared network drives• May be inadequately protected• Liable for policy-led deletion on resignation

• Individual “knows” too much (tacit knowledge)• Documentation/metadata unlikely to be adequate

• Future: gone!

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Curation: Individual

•© Marita Bushell

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Department: eCrystals• Partnership with Institutional

Repository• Specialist department

archive (& national service)• Workflow recording of lab

parameters (R4L)• Public & private elements• Trying to build eCrystals

federation (eBank 3)• Future: likely to continue

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Data in institutional repositories

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Institution: Cambridge Chemistry• 175,000 small molecule

structures in CML• Alongside Archaeology,

Manuscripts, Learning Materials, etc

• No library curation skills; dependent on research group enthusiast

• Collection isolated from other Chemistry

• (Only 5 UK institutional repositories claim to hold data)

• Future: assured…

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Community: LOCKSS?• Self-selected group of

collectors: closest to genuine open activity (despite Alliance)?

• Traditionally libraries collecting eJournals

• Model respects IPR• No domain expertise; rely on

origins• Data limitations…• Future: potentially very

persistent (low cost, high reliability, attack resistance, distributed)

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Discipline: Atmospheric Science• Strong believer in need

for domain scientists as curators

• Significant participant in “community proxy” agenda-setting activities

• Internationally fragmented resources

• Future: mostly dependent on grant funding (but strong commitment)

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Bio-informatics: Nature article 23 June 05

• Databases in Peril• 51 out of 89 biological databases contacted reported they

were struggling financially• 7 have closed• Several being updated in owner’s spare time• (Notes that not all deserve long term support)

• [Nucleic Acids Research reports 968 databases in 2007!]

• Major issue: money

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Publisher: Crystallography

• Publisher and Scientific Union

• Created key domain crystallographic standard (CIF)

• Strong motivator for deposit of structure data

• Consistent quality checks• DOIs used for structure data• Future: publishing business

model

•Slide from IUCr

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

National bodies: British Library• Serious and robust

approach• Legal deposit powers &

responsibilities as driver• Oriented primarily

towards “cultural heritage” (broadly interpreted)

• Little data, no science domain experience

• Future: strong future commitment

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

National bodies: TNA/NDAD• Specialist archive for

government datasets• Understand government

regulations, dynamics & requirements

• Subject generalists; disconnected from associated science

• Technology specialists (understand databases)

• Future: likely to pass eventually to The National Archives

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

3rd parties: Portico• Specific area: eJournals• Depends on publisher

agreements• No data or domain

science expertise• Future: commitment

from Mellon + publishers + subscriptions, good funding mix

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

3rd Parties: Iron Mountain?• Records management

IS a curation problem• Organisations like this

very likely to branch out• No domain science

expertise• Future: business case,

viability, stock market…

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

3rd parties: Web 2.0 style, Swivel.com??

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Institutions & the network• Institutions have

fundamental sustainability

• Disciplines have domain knowledge advantage but sustainability is an issue

• Can we get the best of both?

• Needs serious work to examine!

Inst’n 1

Inst’n 2

Inst’n 3

Discipline 1 X X

Discipline 2 X X

Discipline 3 X X

etc

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Who are the curation players?

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Cultural change• If we build it, will they come? NO!!• Outreach important: communication with

scientists and researchers is hard graft• Cultural change to new approach requires more:

• Incentives, rewards and mandates• Successful exemplars (well publicised)• Discipline-oriented approach (one size does not fit all)

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Australian context?• In the emerging context of the Research

Quality Framework, and the expected National Collaborative Research Infrastructure Strategy, curation can only increase in importance!

a centre of expertise in data curation and preservation

EDUCAUSE Australasia 2007

Thank you

•(Citations in paper in proceedings)

top related