2013 crossref workshops text data mining geoffrey bilder

56

Upload: crossref

Post on 24-Jun-2015

450 views

Category:

Technology


0 download

DESCRIPTION

2013 CrossRef Workshops presentation on Text and Data Mining by Geoffrey Bilder.

TRANSCRIPT

Page 1: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 2: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 3: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Geoffrey Bilder Director of Strategic Initiatives

Cambridge, MA 2013

Introducing CrossRef Prospect

Page 4: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Geoffrey Bilder Director of Strategic Initiatives

Taking the tedium out of TDM….

Page 5: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 6: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Text & Data Mining

Page 7: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Gold

Diamond

Text & Data ?

Page 8: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 9: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

The Problem

Page 10: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Ceramic Society Of Japan * Cevre Koruma Ve Arastirma Vakfi * Cfa Institute * Channel View Publications, Ltd * Chartered Institution Of

Building Service Engineers * Chattagram Maa-O-Shishu Hospital Medical College * Chelonian Conservation And Biology Journal * Chelonian Research Foundation * Chem-Bio Informatics Society * Chemical Engineering Diponegoro University * Chemical Science

Transactions * Chemical Society Of Japan * Chiang Mai University * Children, Youth And Environments Center * Chimera Innova Group * China Agricultural University * China Communications Magazine,

Co., Ltd. * China Journal Of Chinese Materia Medica * China Petroleum Industry Press * China Science Publishing & Media Ltd. * Chinese Astronomical Society * Chinese Birds * Chinese Birds (Press) * Chinese Civilisation Centre * Chinese Geoscience Union * Chinese

Institute Of Automation Engineers (Ciae) * Chinese Journal Of Mechanical Engineering * Chinese Mathematical Society * Chinese Physical Society * Chinese Physiological Society * Chinese Society

Of Theoretical And Applied Mechanics * Chonnam National University Medical School (Kamje) * Christ University Bangalore *

Page 11: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 12: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

• All parties would benefit from support of standard APIs and data representations in order to enable TDM across both open access and subscription-based publishers. • Subscription-based publishers find it impractical to negotiate multiple bilateral agreements with thousands of researchers and institutions in order to authorize TDM of subscribed content.

• Researchers find it impractical to negotiate multiple bilateral agreements with hundreds of subscription-based publishers in order to authorize TDM of subscribed content.

Page 13: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Common API

Page 14: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

DOI Content

Negotiation

Page 15: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

http://dx.doi.org/10.5555-12345678

(Accept: text/html)

Page 16: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

http://dx.doi.org/10.5555-12345678

(Accept: application/bibjson+json)

Page 17: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 18: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 19: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

New Metadata

Page 20: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Full Text Link

Page 21: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

License Information

Page 22: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Rate Limiting(Optional)

Page 23: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Prospect HTTP Headers

CR-Prospect-Rate-Limit: 1500!(the rate limit ceiling per window on Prospect requests) !

CR-Prospect-Rate-Limit-Remaining: 1387!(number of requests left for the current window) !

CR-Prospect-Rate-Limit-Reset: 1378072800!(the remaining time in UTC epoch seconds before the rate limit resets and a new window is started)

*this is a technique used by many APIs, including Twitter’s

Page 24: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Common API Summary

• Content Negotiation (Required)

• New Metadata (Required)

• Full text URIs

• License URIs

• Rate Limiting Headers (optional)

Page 25: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Stop here if

• You are an open access publisher

• You include TDM as a part of your subscription license/T&Cs.

Page 26: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Click-Through License Service

(Optional)

Page 27: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 28: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 29: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 30: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 31: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 32: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 33: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 34: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Research queries DOI using CN + API token

Publisher verifies API token with Prospect

If token verified AND access control allows, publisher returns full text

(frequency at publisher discretion)

Page 35: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Research queries DOI using CN + API token

curl -H "Accept: text/turtle" "http://dx.doi.org/10.5555/515151" -D - -L !

Page 36: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Link: <http://data.crossref.org/full-text/10.5555/515151>; rel="http://id.crossref.org/schema/full-text"; anchor="http://annalsofpsychoceramics.labs.crossref.org/fulltext/515151/515151.pdf"

Page 37: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Publisher verifies API token with Prospect

curl -H "CR-Prospect-Publisher-Token: MdvA59fGn8ukykYlSxJL6g" "https://prospect.crossref.org/licenses/hZqJDbcbKSSRgRG_PJxSBA" -D - -L!

Page 38: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

{ "result": "ok", "message": "licenses", "orcid": "0000-0002-1825-0097", "given_names": "Josiah", "family_name": "Carberry", "licenses": [ { "uri": "http://www.crossref.org/tdm_license", "status": "rejected", "reviewed_at": "2013-05-28T17:09:36+00:00" }, { "uri": "http://www.oxygenxml.com/", "status": "read", "reviewed_at": "2013-05-29T12:08:59+00:00" } ] }

Page 39: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 40: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 41: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 42: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Sustainability Model

Page 43: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

• New initiatives are always optional to our members. Members who do not participate in our new initiatives will not be charged for them.

• We do not charge end-users (e.g. researchers, librarians) for access to metadata and APIs

• We sometimes charge intermediaries for access to our services (to cover the cost of administration, maintaining SLAs, etc.)

• We do not charge our members for depositing extra metadata into our services

• We sometimes charge our members for the cost of administering our services, maintaining SLAs, development, etc.

• We eschew charging mechanisms that involve complex administrative overhead. The cost of developing and running them generally negates the revenue raised by implementing them.

• We try to tie any charges as directly as possible to where costs are incurred.

Page 44: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Current State

Page 45: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Prospect Working Group• AAAS: Walter Jones, Stewart Wills, Deborah Rivera-Wienhold

• American Institute of Physics: Evan Owens,

• American Physical Society: Mark Doyle

• Elsevier: Chris Shillum, Ale de Vries

• HighWire: John Sack, Craig Jurney

• Institute of Physics Publishing: Graham McCann, James Walker

• Springer: Chinchu Ann Belarmin, Michiel van der Heyden

• Taylor & Francis: Gillian Howcroft

• Walter de Gruyter: Bettina de Keijzer

• Wiley: Edward Wates, Alan Bacon

• CrossRef: Geoffrey Bilder, Chuck Koscher, Ed Pentz, Carol Meyer, Kirsty Meddings.

Page 46: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder
Page 47: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

• DOI Content Negotiation

• CrossRef support for recording links to full text

• CrossRef metadata Search for Discovery

• CrossRef metadata support for license URIs

• Click-through TDM license registry

• Prospect publisher API for verifying, managing

tokens

•Sample publisher code

•Sample researcher code

Exists

Exists

Exists

Exists

Exists

✻ being extended to support mime-types

CrossRef

Exists

Exists

Exists

Page 48: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

We are using CrossRef's Prospect text mining API in the context of the Hiberlink project, which investigates reference rot in scholarly papers at a very large scale. The API is really straightforward and based on common technical approaches; it can easily be integrated in a broader workflow. In our case, we have a work bench that monitors newly published papers, obtains their XML version via the API, extracts all HTTP URIs, and then crawls and archives the referenced content. Currently, we can only access Elsevier papers via the API but as more publishers join Prospect, it will become a powerful, uniform one-stop-shop for text mining scholarly literature.

--Martin Klein and Herbert Van de Sompel, Los Alamos National Laboratory

Page 49: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

I think this is a big step in the right direction and makes retrieving full text file a lot easier, I hope that publishers support it. --Maximilian Haeussler, UCSD

Page 50: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

What do I need to do?

Page 51: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Publishers (required)

• Register full-text URLs with CrossRef

• Register <lic_ref> well-known license URIs with CrossRef

Page 52: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Publishers (optional)

• Register click-through proprietary licenses with Prospect click-through service

• Adapt platform APIs to handle Prospect API tokens

Page 53: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder

Researchers

• Register with Prospect and accept/decline licenses

• Modify TDM tools to look for <lic_ref> elements

• Modify TDM tools to make use of Prospect API token

Page 54: 2013 CrossRef Workshops Text Data Mining Geoffrey Bilder