dr liz lyon, associate director outreach uk digital curation centre an introduction digital curation...
TRANSCRIPT
Dr Liz Lyon, Associate Director Outreach
UK Digital Curation Centre An Introduction
Digital Curation Centrea centre of support for data curation and preservation
Grand Challenge Meeting, Bath June 2005
2
For later use? In use now (and the future)?
Repositories and digital curation
Data preservation Data curation
Static Dynamic
“maintaining and adding value to a trusted body of digital information for current and future use”
3
Assuring permanent access to the records of science & the humanities?
Long term access to primary data
• Increasing data volumes from eScience and Grid-enabled / cyberinfrastructure applications
• Changing research paradigm: data-driven science, “big science”
• Observational data, simulations, large-scale experimentation
• Multi-media resources, statistical data, surveys, geo-spatial data……
4
5
Facilitate “post-processing” and knowledge extraction
Enable the acquisition of newly-derived information and knowledge
• Run complex algorithms over primary datasets
• Mining (data, text, structures)
• Modelling (economic, climate, mathematical, biological)
• Analysis (statistical, lexical, pattern matching, gene)
• Presentation (visualisation, rendering)
6
7
Provide additional functionality beyond digital preservation processes
Annotations
• Gene and protein sequences
• e-Lab books (Smart Tea Project in chemistry)
8
Research & e-Science workflows
Aggregator services: national, commercial
Repositories : institutional, e-prints, subject, data, learning objects
Data curation: databases & databanks
Validation
Harvestingmetadata
Data creation / capture / gathering: laboratory experiments, Grids, fieldwork, surveys, media
Deposit / self-archiving
Peer-reviewed publications: journals, conference proceedings
Publication
Validation
Data analysis, transformation, mining, modelling
Searching , harvesting, embedding
Presentation services: subject, media-specific, data, commercial portals
Resource discovery, linking, embedding
Linking
The scholarly knowledge cycle : linking research data to publications
eBank UK Projecthttp://www.ukoln.ac.uk/projects/ebank-uk/
Emerging policy on open access to data
9
DCC people (some of them…)
• Management & Co-ordination– Director Chris Rusbridge (University of Edinburgh)
• Community Support & Outreach– Led by Dr Liz Lyon (UKOLN, University of Bath)
• Service Definition & Delivery– Led by Professor Seamus Ross (HATII [ERPANET], University of
Glasgow)
• Development– Led by Dr David Giaretta (Astronomical Software & Services,
CCLRC)
• Research– Led by Professor Peter Buneman (Informatics, University of
Edinburgh)
10
(Some of) the challenges we face
Standards: Interoperability issues: technical & ??soluble
Scale: Volume and diversity of datasets
Culture: Bringing communities together
• Library/information science/archives “document tradition”
• Domain research (chemists, astronomers, biologists)
• Computer science (databases)
• Commercial suppliers (storage technology)
Process & Skills: Highly-distributed organisation
• Use collaborative tools, combined skills
Engagement: Existing work & key players
11
User requirements analysis: some sound bytes…
R&D issues: Annotation services, Ontology development, Automating metadata creation, Tools and toolkits, Data Format Description Language, Identifiers, Registries, Economic and cost-benefits studies
Advisory services :“Ask-a-Curator”,FAQs, reports, briefings, awareness-raising materials, best practice guidance, Storage media, “Like Erpanet”, advise Government, Research Councils, funding bodies
Professional development: Short courses, conferences, seminars, workshops, secondments to DCC and to working repository services
Outreach: Leadership for the future, case studies, sharing solutions, collaboration with other partners, international peers, industry links
Taxonomy of “Users”
12
Outline Taxonomy of digital curation users by role
1. Data Creators
2. Data Curators
3. Data Re-users
4. Policy makers
-funding bodies
-other leaders
Data Preservers
Data publishers
13
Outline Taxonomy by significant function of organisational entity
1. Research
2. Service provision
3. Learning & teaching
4. Funders
5. Policy / strategy makers
“Designated communities”
Commercial
14
Advisory services
• Responses to queries—from legal to technical guidance [email protected]
• FAQs constructed• Informing workshops and information
services• Monthly site visits (National Institute of
Environmental eScience)
15
Professional development workshops
• 2005 Programme – Persistent identifiers June, Glasgow – Institutional repositories: July
University of Cambridge, with DSpace– Cost models July British Library,
London with the Digital Preservation Coalition
– Preservation of medical databases: October Gulbenkian Institute, Lisbon with ERPANET & the Wellcome Trust
16
Standards Watch
• Covering existing and emerging standards• Working with community and standards
bodies (e.g. ISO)• Organising associates groups around new
standards developments• Initiating standardisation definitions where
gaps identified• Currently re-purposing Diffuse database of
standards materials
17
Digital Curation Manual
• A world class resource• Constructed from topic-specific chapters
– written by international experts– editorial board comprising leading researchers and
practitioners
• 45 initial topics including– Appraisal and Selection; Costs; Freedom of
Information; Interoperability; the OAIS Reference Model; Preservation Strategies; and Open Source
• Less in-depth insight offered by DCC Briefing Papers, aimed at needs of senior managers
18
OAIS Reference Model – Functional Model
4-1.
2
MANAGEMENT
Ingest
Data Management
SIP
AIPDIP
queries
result setsAccess
PRODUCER
CONSUMER
Descriptive Info
AIP
orders
Descriptive Info
Archival Storage
Administration
Preservation Planning
19
Audit and Certification (1)
• How can people know who to entrust with their information?
• There is a demand for a certification process for– Repositories and components e.g. archive storage– Software
• Certification standards (ISO 9000 and ISO 17799) do not do the job
• OCLC/RLG Trusted Digital Repositories: Attributes and Responsibilities– high level model for design, delivery and maintenance of
digital repositories
20
Audit and Certification (2)
• International expert group led by RLG and NARA is drafting a Certification standard
• DCC is participating: aiming for international consensus
• Draft goes to Technical Editor end of June• DCC testbeds to support development of audit
and certification standards• Commitment to
– offer guidance on self-audit and self-certification– carry out independent audits– issue certificates to qualifying repositories
21
Tools and Technologies
• Accumulate and Maintain Registry and online Repository of relevant tools– Repository Implementations– Packaging Tools– Rendering Software– Format Converters– Device Drivers
22
Representation Registry development• Simple PHP prototype• Scoping study
– Formats, standards, tools
• More robust prototype in development– Based on ebXML & JAXR– Potentially distributed, cooperative maintenance
model – Representation information: describe CCLRC
(science) data using EAST,
• Links to PRONOM, GDFR and other pilots • Aim to handover to services
Development info – see
http://dev.dcc.ac.uk
for details of Wiki and email list open to all
23
Research agenda (1)
• Publishing & integrating scientific databases• ‘Archiving’ past states of volatile databases• Database provenance and annotation• Organisational dynamics of trusted
repositories• Automating metadata extraction• Cost-benefit analysis of data curation• Rights and responsibilities
24
The database picture
Source data Curated data: classified, cleaned, annotated, integrated, cross-linked
25
Curated databases – some issues
• Integrating, publishing and citing data so that someone else can use it.
• Annotating existing data and moving annotations to other databases
• Provenance: where did this data come from?
• Archiving: how do you preserve something that is constantly changing?
26
Research agenda (2)
• Publishing & integrating scientific databases• ‘Archiving’ past states of volatile databases• Database provenance and annotation• Organisational dynamics of trusted
repositories• Automating metadata extraction• Cost-benefit analysis of data curation• Rights and responsibilities
– “Public domain, public interest, public funding” paper Waelde & McGinley
27
www.dcc.ac.uk
28
• www.ijdc.net
• Launch planned July
• Peer-review Editorial Board
• Peter Buneman Editor (research)
• Production editor Philip Hunter
• Papers for submission are very welcome!
29
1st DCC International Conference
• Location - Bath UK
• 29-30 September 2005
• Keynote speakers
Clifford Lynch CNI
Graham Cameron European Bio-informatics Institute
• DCC Research update
• Social highlights
30
Associates Network
Goals
Develop understanding, share best practice, advance research, promote recognition, develop consensus
Membership
International groups, national bodies, industry partners, funders, research groups, HEIs, FEIs, individuals……
Benefits
Early access to R&D outputs, advisory services, training, input to definition and design, community participation
Discussion Forum www.dcc.ac.uk Please join us!