david shotton - opencon oxford, 1st dec 2017

Click here to load reader

Post on 21-Jan-2018


Embed Size (px)


  • Oxford e-Research Centre

    University of Oxford, UK

    OpenCon 2017 OxfordWeston Library

    1 December 2017

    David Shotton 2017 Published under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 Licence

    [email protected]

    David Shotton

    Open publication current progress Bibliographic resources and

    bibliographic citations

  • What proportion of academic papers are open?

    Heather Piwowar et al. recently estimated that at least 28% of the scholarly

    literature is now Open Access, and for 2015, the most recent year they analyzed,

    the proportion is 47% Open Access

    Piwowar, H. et al. The State of OA: A large-scale analysis of the prevalence

    and impact of Open Access articles. PeerJ Prepr. (2017).


    They determined the so-called open-access citation advantage:

    Accounting for age and discipline, Open Access articles receive

    18% more citations than average

    Many funders, including the US National Institutes of Health, the US National

    Science, the European Commission, the Wellcome Trust, and the Bill and

    Melinda Gates Foundation, make Open Access mandatory for grantees

    For biomedical papers, this is achieved by putting articles in PubMed Central

    PMC (https://www.ncbi.nlm.nih.gov/pmc/) holds ~4.5 million full text articles

    Of these, over 1.6 million comprise the PMC open access subset

  • How to find open content use Google Scholar

    Use Google Scholar

    Links on the right take you to Open Access copies works well

  • On the publishers web site

    Use Google Scholar

    Links on the right take you to Open Access copies

  • Open Access copy at Academia.edu

  • How to find open content use Unpaywall.org

    Unpaywall is a browser plugin that used oaDOI behind the scenes to search for

    OA versions of journal articles you may be viewing on publishers web sites

  • Unpaywall uses oaDOI

    oaDOI is a service that finds Open Access copies of articles

    identified by a DOI (Digital Object Identifier)

    If it finds one, it puts an Unpaywall Open Access logo on

    the right of the article page

    Clicking on that takes you to the Open Access version of the paper

    A cool idea

    However, in my experience it fails to catch many OA copies

    because it uses BASE the Bielefeld Academic Search Engine

    (https://www.base-search.net/), which only searches official Green

    Open Access repositories

  • Sci-Hub provides illegal access to subscription content

    The pirate website Sci-Hub provides access to scholarly literature via full text

    PDFs illegally downloaded from behind publishers' paywalls

    Its stated goal is to make research papers free, to aid academia

    But several science journals have taken it to court for breach of copyright

    Daniel Himmelstein et al. determined that Sci-Hub contains 70% of all

    ~81.6 million scholarly articles

    This rises to 85% for those published in subscription-access journals, and

    97% for articles published by Elsevier, the largest and least open publisher

    Himmelstein, D. S., Romero, A. R., McLaughlin, S. R., Tzovaras, B. G. & Greene, C. S. Sci-

    Hub provides access to nearly all scholarly literature. PeerJ Prepr. (2017).


  • The fight for open access can get nasty

    Sci-Hubs founder, Alexandra Elbakyan, is a fugitive from justice

    Elsevier won a $15 million court order against her in June

    Sci-Hubs domain names sci-hub.io, sci-hub.ac and sci-hub.cc have recently

    been blocked, following another court order earlier this month

    https://www.theregister.co.uk/2017/11/23/scihubs_become_inactive_ following_court_order/

    But http://sci-hub.bz/, and still work

    Martin Eve, Professor of Literature, Technology and Publishing at Birkbeck

    College, University of London, said:

    I think domain blocking is going to prove an ineffective technique to

    shutdown Sci-Hub permanently.

    Academic publishers would do better to reroute their efforts into

    developing business models for scholarly communications that allow open

    dissemination of educational research content and that are, therefore,

    immune to initiatives such as SciHub.

  • Research publishing has changed very little over 350 years

    We still have a linear narrative, with references

    While the article has moved online, and may indeed be Open Access, the norm is to publish a static PDF file, mimicking the printed page

    This is totally antithetical to the spirit of the Web, and ignores its great potential

    Rather, we need lively journal content

    Semantic mark-up of text

    Interactive figures

    Links between papers and datasets

    Actionable numerical data

    . . . what I have called Semantic Publishing

  • Our experiment in semantic enhancement of articles

    To provide a compelling existence proof of the possibilities of semantic

    publication, we took an ordinary research article from PLoS Neglected Tropical

    Diseases and enhanced it as an exemplar

    The results can be seen at http://dx.doi.org/10.1371/journal.pntd.0000228.x001

  • The Five Stars of Online Journal Articles

    Shotton D (2012). The Five Stars of Online Journal Articles a Framework for Article Evaluation.

    D-Lib Magazine 18 (1/2) (January/February 2012 issue). http://dx.doi.org/10.1045/january2012-shotton

  • The Reis et al. PLoS article, before and after enhancement

    Before After

    The article already scored well for open access (O) and peer review (P)

    Our semantic enhancements gave considerable improvement in enhanced

    content (E), available datasets (A) and machine-readable metadata (M)


    M A




    M A



  • Citations - Crossref provides the fundamental infrastructure


    Crossref is the registration agency of Digital Object Identifiers (DOIs) for

    scholarly publications (journal articles, conference papers, books, etc.)

    Its head office is here in Oxford, at the Oxford Centre for Innovation

    Most scholarly publishers are members, paying annual fees

    For all scholarly publications that have DOIs

    Crossref hold metadata (record of authors, title, publication year, etc.)

    and also reference lists, if these are submitted by the publishers

    Crossref presently holds over one billion references!

    But the records on CrossRef are raw data, not organized or structured so

    that non-experts can query them in useful ways, such as asking for the

    highest-cited paper published by a particular university in a particular year

  • The importance of citations

    A citation permits an author to give credit to another person's endeavours

    Direct citation is a key indicator of a cited publications significance

    Citations also integrate our independent acts of scholarship into a global

    knowledge network

    Bibliometric analysis of the citation network can reveal patterns of

    communication between scholars and the development and demise of

    academic disciplines

    But aggregated citation data have been hidden behind subscription firewalls

    In this Open Access age, it is a scandal that all citation data are not freely

    available for use by the scholars who created them

    Citations now need to be recognized as a part of the Commons basic facts

    that should be freely and legally available for sharing and reuse by all

    The Initiative for Open Citations (I4OC) is working to achieve this

  • The Initiative for Open Citations

    The Initiative for Open Citations is a collaboration between scholarly publishers,

    researchers and others to promote unrestricted availability of scholarly citations

    Launched on April 6, 2017 Web site https://i4oc.org

    Within a short space of time, I4OC has persuaded most of the major scholarly

    publishers to make their reference lists submitted to Crossref open

    Before I4OC, only 1% of these were open

    By the I4OC launch last April, that proportion was 40%

    By September 2017, more that 50% of the almost one billion journal

    article references stored in Crossref were open

    However, there is much more that publishers could do

    52% of the journal articles documented at Crossref lack references

    And of these that are submitted, almost 50% are yet not open

    See https://opencitations.wordpress.com/2017/11/24/


  • The problem with Elsevier

    The largest scholarly publisher is Elsevier

    It has about 15 million journal article records in Crossref

    References from journal articles published by Elsevier constitute 32% of all

    journal articles references stored at Crossref

    While 75% of such references from other publishers are open

    NONE of the ~300 million references from Elsevier articles are open

    As a consequence, of all journal article references deposited at Crossref that

    are not yet open, 65% are from journals published by Elsevier

    I have just submitted an article to Nature that discusses this problem, entitled

    Open Citations The Elephant in the Room

    to be published in the New Year

    See https://opencitations.wordpress.com/2017/11/24/


  • Enhancing citation data - the OpenCitations Corpus

    OpenCitations (http://opencitations.net) is an infrastructure organization directed

    by myself and by Silvio Peroni of the University of Bologna

    Its primary purpose is to host and develop the OpenCitations Corpus (OCC), a

    Linked Open Data repository of bibliographic citation data covering all disciplines

    The first OCC prototype was created here in Oxford in 2011

    A new instance of the OCC, based on our revised OpenCitations Metadata Model,

    was then set up with my colleague Silvio Peroni at the University of Bologna

    It has been ingesting scholarly references continuously since early July 2016

    OCC provides the largest Linked Open Data collection of citations on the Web

    Currently holds references from ~285,000 citing bibliographic resources

    Provides >12 million citation links to over 6 million cited resources

    These data are freely available under a CC0 public domain waiver

  • The SPAR (Semantic Publishing and Referencing) Ontologies

    FaBiO, the FRBR-aligned Bibliographic Ontology - an ontology for

    describing bibliographic entities (books, articles, etc.)

    CiTO, the Citation Typing Ontology - an ontology that enables the

    characterization of citations, both factually and rhetorically

    BiRO, the Bibliographic Reference Ontology - an ontology to define

    bibliographic records and references, and their compilation into

    bibliographic collections and reference lists, respectively


    OCC data are described in RDF (JSON-LD) using, with other standard

    vocabularies, the SPAR (Semantic Publishing and Referencing) ontologies

    These SPAR ontologies include

  • The OpenCitations ingestion rate

    The OpenCitations Corpus is current ingesting ~8 million new citations per year

    With new hardware funded by the Sloan Foundation OpenCitations

    Enhancement Project, this rate will increase thirty-fold early in 2018 to

    ~240 million new citations per year

    By the end of 2018, the OpenCitations Corpus should hold

    ~250 million citations, compared to Web of Knowledges ~1.25 billion

    Even this partial coverage will include citations of all important papers

    A further five-fold increase in ingest rate - significant but achievable with

    additional resources (and funding!) - would enable us to reach parity by 2020

  • Where will the references come from?

    We will quickly consume all 1.6 million OA articles in PubMed Central

    We will then start harvesting the half-billion references from the ~18 million

    articles already made open at Crossref in response to The Initiative for Open

    Citations, of which OpenCitations is a founding member

    Other possible sources of open citation data include

    ArXiv (1.3 million preprints, mainly in physics and the hard sciences)

    CiteSeerX (>120 million references from >6 million documents)

    CitEc (11 million references from a million Economics papers)

    References from pre-digital publications extracted by text mining, e.g.

    From Bodleian catalogues of its holdings of illuminated manuscripts

    In the Social Sciences, from the LOC-DB at the University of Mannheim

    In Biological Taxonomy, mined into BioStor from the Biodiversity

    Heritage Library, e.g. http://biostor.org/reference/105357

  • Adopting the OpenCitations Data Model

    The OpenCitations data model provides the possibility of interoperability between

    independent citation collections

    Several other organizations and projects have adopted, or are considering

    adoption of the OpenCitations data model

    This will provide immediate interoperability of RDF citation data

    and will enable seamless import into the OpenCitations Corpus

    In this way, we hope that OpenCitations can become a global hub for open

    citation data structured in RDF

  • 2017 The year of success - citation data are freed!

    Two fantastic success stories

    The Initiative for Open Citations https://i4oc.org/

    The OpenCitations Corpus http://opencitations.net

    Two Italian heros: Dario Taraborelli and Silvio Peroni


  • Thank you!

    [email protected]

    David Shotton

    Website: http://opencitations.netEmail: [email protected]: @opencitationsBlog: https://opencitations.wordpress.com