surviving the deluge: lessons from dois and electronic publishing at john wiley & sons

16
Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons By Matthew Larson

Upload: crossref

Post on 24-Jun-2015

731 views

Category:

Business


0 download

DESCRIPTION

Matt Larson's presentation at the 2009 CrossRef Technical Working Group

TRANSCRIPT

Page 1: Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons

Surviving the Deluge: Lessons from DOIs

and Electronic Publishing at John

Wiley & Sons

ByMatthew Larson

Page 2: Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons

Who I Am

• Developer, tech lead, and system architect at Wiley for the last 9 years

• Responsible for DOI registration and other systems that support electronic publishing activities

Page 3: Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons

The Challenge

• A lot of content from a lot of places• “A lot of content”: > 1000 new DOIs

per day from journals, books and major reference works

• “A lot of places”: Offices in New Jersey, Boston, San Francisco, Oxford, Germany, Singapore, and elsewhere

• How to handle DOI registration for all this content?

Page 4: Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons

Surviving the Deluge: Some Principles

• Here are five system design principles we’ve used for handling the constant stream of content

• We’ll look at how we employ them in our CrossRef registration application, XIRS (eXternal Identifier Registration System)– XIRS was built and launched in 2008 to

help handle the combined Wiley/Blackwell content load

– XIRS receives CrossRef XML, submits it to CrossRef, and tracks the CrossRef responses

– Written in Java (1.6) and deployed on Tomcat 6

Page 5: Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons

XIRS Data Flows

Page 6: Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons

Principle 1: Centralization

• Content from multiple sources must go through a single system to be manageable.– Might be tempting to offer a client library

instead but that invites trouble• XIRS handles all Wiley DOI registrations

– All error handling, reporting and support in one place

• Any system that needs to store or process content must be on the critical path for publication– Otherwise you will always be chasing

synchronization

Page 7: Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons

Principle 2: Transparency

• Make the system as transparent as possible– It’s easy to miss errors when batch

processing this much content• How?

– Always make real-time system views available

• Provides quick assurance that everything is alright and shows when it isn’t

– Email daily error reports with the most critical data in the subject line

– Show as much information as possible to make support easier

– Provide easy-to-use reporting so anyone can query the system

Page 8: Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons

Real-time System Views

Page 9: Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons

Notifications

Page 10: Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons

As Much Info As Possible

Page 11: Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons

Simple Reporting

Page 12: Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons

Principle 3: Heal Thyself

• Systems that batch process this much content must be self healing

• Lowers support load and increases accuracy

• One good method: queue everything– Assume that nothing will work on the first try– Assume that networks, disks, and external

services will fail. – The system that sends data to XIRS queues

that data--if XIRS is down the data flow will start as soon as XIRS is available.

– XIRS itself queues the data it receives. If CrossRef is down, it will try until the data submission is successful.

Page 13: Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons

Principle 3: Heal Thyself

• Another good technique: belt and braces (have backup methods)– We process the CrossRef emails

coming back– But if we get no CrossRef response (for

whatever reason), we call CrossRef directly to get the response

– Lets us avoid extra support work if there are emails problems—the system will find those responses itself

Page 14: Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons

Principle 4: Trust No One

• Assume that bad data will show up sooner rather than later

• QA everything– We parse the CrossRef XML before it is sent to

XIRS– And then XIRS parses it again and rejects it if it

doesn’t parse

• Proactively check and fix data– We insert the timestamp in XIRS so it’s always

the latest– We shorten fields to fit within the size limits– We check every file for multiple prefixes in the

same file and split the file up automatically– All XSLT must check for string-length(.) > 0

before inserting any elements to avoid empty elements

Page 15: Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons

Principle 4: Trust No One

• Assume files will be too big– Some reference articles can have

500+ citations– We limit by DOI count and size on disk

• Assume files will be too small– Files can come very quickly with the

same container-level (journal) DOI– Causes database contention – We have to remove container-level

DOIs that have already been registered to avoid this

Page 16: Surviving the Deluge: Lessons from DOIs and Electronic Publishing at John Wiley & Sons

Principle 5: Synchronicity

• All system communications should be synchronous wherever possible– Delayed/asynchronous responses will

be lost when batch processing large amounts of data and support will be much tougher

– Long waits and timeout values are acceptable to ensure that the calling system knows whether the call was successful or not