digital challenges – bridging the gap between publication and data adam farquhar head of digital...

31
Digital Challenges – Bridging the gap between publication and data Adam Farquhar Head of Digital Library Technology The British Library IASSIST, Tampere, 27 May 2009

Post on 22-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Digital Challenges – Bridging the gap between publication and data

Adam Farquhar

Head of Digital Library TechnologyThe British Library

IASSIST, Tampere, 27 May 2009

The British Library:

‘This is the life blood of research and innovation’

GIA Funding 08/09:£94.8m operational, £12m capital

Other funding secured 07/08: c.£33m

Helping people advance knowledge

to enrich lives

National library of the UK.

Serves researchers, business, libraries, education & the general public

Collection includes over 2m sound recordings, 5m reports, theses and conference papers, the world’s largest patents collection (c.50m)

The largest document supply service in the world. Secure e-delivery and ‘just in time’ digitisation enables desktop delivery within 2 hours

2 main sites in London and Yorkshire. Circa 2,000 staff

Business and IP Centre: Providing inspiration, and enabling protection of creative capital and business development

Generates value to the UK economy each year of 4.4 times public funding

Collection fills over 600km of shelving and grows at 11km per year

30 Tb of digital material growing rapidly

Science and Innovation Investment Framework 2004-2014, H.M. Treasury (2004)Information infrastructure2.23 The growing UK research base must have ready and efficient access to information of all kinds – such as experimental data sets, journals, theses, conference proceedings and patents. This is the life blood of research and innovation.

3

Supporting research

Social Sciences

Science, Technology & Medicine

Arts & Humanities

Document Supply service provides 1.4m articles/year primarily to scientists

Renewed engagement with researchers using digital content and online services

In-depth focus on biomedicine and energy/environment Collection includes journals, patents, theses and more, and

is updated by some 9,000 articles every day

A significant international collection of books, journals, reports, theses, official publications and other materials

A unique collection of grey literature, of special interest to practitioners and theoreticians

Research collaboration with ESRC

Greatest research collection of its kind in the world World-class curatorial expertise by subject, medium and geographical area BL has been developing world-leading e-innovations for past decade (e.g.

International Dunhuang Project) and building a significant corpus of digitised texts Research collaboration with AHRC, British Academy and HEIs

4

Building the Digital Research Infrastructure

BL Digital library system Large scale, highly resilient digital store Continuous validation & correction Long term digital storage for BL content &

eLegal deposit/distribution Long term access (digital preservation)

Leading EU-funded digital preservation project ‘Planets’ (16 partners)

Developing cost models and case studies with UCL (‘Life’ projects)

Addressing root causes of digital obsolescence

Edinburgh -2009

Aberystwyth

Boston Spa

St. Pancras

Cambridge Univ.

Oxford Univ.

55

Digital Library

Live Content Streams

Sound Archives

Voluntary Digital Donations

Nineteenth Century Digitised Books

Born Digital Newspapers

Storage

>440,000 Digital Items

>30 Terabytes of Content

Coming soon

eJournals

Digitised Newspapers

66

Role of the British Library in Science, Technology and Medicine

Long history of collecting scientific and technical literature

Serves business & industry, researchers, academics and students

Dedicated reading rooms in London

The Library operates the world’s largest document delivery service - millions of items each year to customers all over the world predominantly in the STM disciplines

Indexing the UK input into Medline/PubMed

Creation of AMED (Allied and Complementary Medicine A&I Database) research articles on complementary medicine and allied health

Lead Partner in UK PubMed Central

7

WorldWideScience.org

Global science gateway based on US Department of Energy’s Science.Gov service

Multilateral partnership to enable federated searching of national and international scientific databases and portals.

Launched in 2008 Large number of countries already

providing access to publicly funded research outputs - latest addition is China

Chaired by British Library

88

UK PubMed Central

Number of articles: 1.4 million Over 2,500 manuscripts submitted by grant holders Information held on 20,000 research grants awarded to 9,000

PIs by UKPMC Funders Downloads have grown strongly with over 300,000 in March

2009 UKPMC users are predominantly UK based (70%) but service

is accessed across the world Working with the Bioscience community and Funders to

develop the service based on UK research community needs

Launched in January 2007

9

Research Information Centre – the research lifecycle

9

Based on Microsoft’s Sharepoint product Developed with Microsoft External Research

Team DOI:10.1109/ADVCOMP.2007.14 Beta tested by 25 bioscience research teams

(academia & commercial) in UK & US

Supports full research life-cycle Accessible by web browser Configured for biosciences but flexible Designed for collaboration

10

Social Science Collection and Research

New team established in 2006 Priorities: define and develop the collection, improve accessibility,

raise awareness, build networks, build capacity Strong focus on researcher needs

Develop strategies for grey literature and data access

Build the collection of government publications Recent and historic print collections with LSE and Oxford Soc

Science Library, … Digital and web collections with TNA and UK e-OP ‘digital

continuity’ Managing Access to Government Information Collaboratively

(MAGIC) with LSE

©Clive Sherlock

11

Social Science Collection and Research

Research collaborations Voices of the UK; Children’s play in the media age

Knowledge exchange, awareness and capacity building Corporate and Social Responsibility seminars Multi-modal PhD seminars ESRC Festival of Social Science ESRC Interns Postgraduate training days, thematic study days, ESDS

seminars Public events - Census 2011 to explain the role of

quantitative and qualitative social surveys

©Clive Sherlock

12

Books and data – a parable

A scientist measured environmental conditions to determine their impact on leather bindings

When the project was complete, he printed the data, bound it, and submitted it to UK copyright libraries

Thirty years later, a scientist took it off the shelf and started to reuse the data, and collect anew

When his project was complete, he had had 30,000 images and megabytes of data

Too big for any shelf

Not interesting for a data centre

Is the project web site enough?

13

Journals and data – a problem

In 2003, Legal Deposit Legislation in the UK is extended to cover digital material Building on the 1911 Legal Deposit Act

Electronic journal articles are covered – they will be collected and archived for the long term

… But supplementary material is not covered For now, it remains on the publisher web sites

14

Long-term access is critical

According to a Parse.Insight survey 50% needed research data gathered by other researchers that was not

available

Within High Energy Physics More than 90% think that data preservation is important - crucial

Benefits include Verify scientific results independently (60%) Combine past and future data (60%) Re-analyze in the light of new theories and future results (75%)

45% - old data could have improved their scientific results 40% - important HEP data have been lost in the past.

Many are willing to share 80% would provide data behind tables and figures 45% would provide “raw” data But 50% believe costs to repackage for sharing are high

15

Widening gap

A widening gap in the scientific record between published research and the data that underlies it Published work held by libraries Datasets held by data centres No effective way to link between

datasets and articles No widely used method to identify

datasets No widely used method to cite

datasets

As a result, datasets are Difficult to discover Difficult to access Second-class citizens in the

scientific record

16

Datasets in the scholarly record (OECD White Paper)

45% of journal publishers provide access to datasets associated with journal articles they publish (ALPSP)

But there are no rules about how to publish, present, cite, or otherwise catalogue datasets

CitationMain mortality estimate: Estimated settler mortality. Settler mortality is calculated from the mortality rates of European-born soldiers, sailors, and bishops when stationed in colonies. It measures the effects of local diseases on people without inherited or acquired immunities. Source: Acemoglu et al. (2001), based on Curtin (1989) and other sources.

CitationTertiary school enrollment: School enrollment, tertiary (% of gross). Source: Barro and Lee (2000) and their databases

17

Datasets – first class citizens?

Datasets

Data is difficult to manage after project funding ceases

Informal networks provide the primary means of sharing

Only 21% use a national or international facility

Datasets are not included in impact analysis

Good luck finding it (your discipline may vary)!

UKRDS Study

Published articles

Libraries ensure long-term storage and management

Established funded services provide the primary means of access

Nearly all published articles are held in multiple national libraries

Articles and citations form the backbone of impact analysis

Catalogues and full-text search support discovery

18

Global responses to the challenge

Research council mandates Data management plans Data retention plans

Funded initiatives Australian National Data Service UK Research Data Service UK Digital Curation Centre US DataNet programme JISC Data programme EU Science Data Infrastructure, …

STM publishers Brussels Declaration: Raw research data should be made

freely available to all researchers

19

MakeVisible

Find

AccessTrackImpact

Verify

Reuse

Cite

?Persistent

Identification

A key component for many goals

20

Dataset citation using Digital Object Identifiers (DOIs)

The DOI system offers an easy way to connect the article with the underlying data

Several organisations have started to assign DOIs to datasets IUCR, ICPSR, OECD through

CrossRef Pangea, Mare, and others

through TIB (German Science Library)

DatasetG.Yancheva, N. R. Nowaczyk et al (2007)Rock magnetism and X-ray flourescence spectrometry analyses on sediment cores of the Lake Huguang Maar, Southeast China, PANGAEA doi:10.1594/PANGAEA.587840

ArticleG. Yancheva, N. R. Nowaczyk et al (2007) Influence of the intertropical convergence zone on the East Asian monsoonNature 445, 74-77doi:10.1038/nature05431

Cites

21

It looks so easy

Organisational challenges Data centres, funders have

regional or disciplinary scope Universities have teaching

and research mission and competitive relationships

Publishers do not cover un-published material

Consortium of the above require large and fragile coalitions

We need an consortium of national institutions with a long-term stewardship role

Social challenges Acceptance by key

stakeholders including funders, data centres, universities, researchers, publishers

Use by data creators and authors

Technical challenges Robust infrastructure Identifying the right thing Ensuring longevity

22

DataCite

Organisations with the national science library role are working together to establish a European and global infrastructure to support researchers by providing methods for them to locate, identify, and cite research datasets with confidence

Publishing agents (data centres, research institutes) are responsible for: Quality assurance Content storage and access Creating the identifier Creating and updating metadata

The DataCite registration agency Maintains the resolution infrastructure Maintains a searchable database of metadata Manages the identifiers over the long term Establish and share best practice

23

Memorandum of Understanding

Paris, March 2, 2009

Recognizing the importance of research datasets as the foundation of knowledge and sharing a common commitment to promote and establish persistent access to such datasets, we, the signed parties, hereby express our interest to work together to promote global access to research data.

Our long term vision is to support researchers by providing methods for them to locate, identify, and cite research datasets with confidence.

24

Initial Signatories

Technische Informationsbibliothek (TIB), Germany

Library or the ETH Zürich, Switzerland

L’Institut de l’Information Scientifique et Technique (INIST), France

Library of TU Delft, The Netherlands

Technical Information Center of Denmark

The British Library

25

Key facts about DOI

Usage >35m DOIs have been

assigned >2m resolutions each month

Organizational Not-for-profit International

DOI Foundation (IDF) Provides social infrastructure Includes registration agencies Registration done in co-

operation with a publication agent

Publication agents are responsible for the content

Technical A DOI Name is a persistent

identifier used to cite and link resources Linked to an object – not to

a location The location may change,

but the DOI remains the same

The DOI System holds metadata about objects including their URL

Resolution redirects the user from a DOI name to the URL

26

Strengths and weaknesses of DOI

DOIs have some strong advantages Accepted by researchers and scientists Mature infrastructure Put datasets on the same playing field as articles

But perceived as Expensive

The current IDF business model favours larger registration agencies

Publisher oriented The largest registration agency is the publisher-oriented

CrossRef

27

DataCite Structure

DataCite

NationalInstitution

Data CentreData CentreData Centre

NationalInstitution

Data CentreData CentreData Centre

Carries

Works with

International DOI Foundation

Global Handle System

28

Typical workflow (Data Centre)

Data Centre registers with DataCite Data Centre ingests a dataset and assigns an identifier Data Centre registers the dataset by submitting an XML file

containing relevant bibliographic metadata and the URL for the dataset’s access page Metadata drawn from ISO 690-2 for referencing electronic

information

• language• publisher• publishing date• publishing place

• author• title• size• edition

29

Typical workflow (2)

Author Includes citation using the DOI, just like an article

Reader Follows the resolvable link that includes the DOI (or

searches for it), just like an article Reaches a unique landing page at the Data Centre for the

dataset Open to every reader Includes the DOI and metadata to help the reader decide

if the dataset will help May need to take additional steps to access the dataset

30

Research Data in Articles

31

Thanks!

The British Library has a duty of care for the scientific record Renewed engagement in STM and Social Sciences Actively partnering to achieve goals

There is a widening gap between published research and the data that underlies it

DataCite will support researchers by enabling them to locate, identify, and cite research datasets with confidence This is the start of a long and open dialogue There are many open issues to address

We welcome your comments, questions, and ideas!

Email: adam.farquhar {@} bl.uk