counting on opendoar peter millington sherpa technical development officer crc, university of...

24
Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingh am.ac.uk

Upload: nathaniel-blair

Post on 27-Mar-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

Counting on OpenDOAR

Peter MillingtonSHERPA Technical Development Officer

CRC, University of [email protected]

Page 2: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

http://www.opendoar.org/

Background to OpenDOAR

• Created in 2005– Lists over 2320 repositories (2013-07-02)

• Manually validated– High quality…– …but we didn’t like to talk about the record counts

• Counts not updated after the initial entry– Unless prompted by users

• Fixed in 2012– Record counts updated about every 2 weeks

Page 3: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

http://www.opendoar.org/

Established counting methods

• Manual inspection– Labour-intensive

• Counting OAI-PMH record identifiers– Inefficient

• Handling big files• Iterative

– Unreliable• File size limits and timeouts

– Inaccurate• Need to account for deleted records

Page 4: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

http://www.opendoar.org/

How difficult can it be?

• SELECT COUNT(*) FROM repository;– Still fast even with added complexity– Statuses, Breakdown by date, etc.

• The number is often there on the web page– Headline number, or– “x to y of z” tally, or– Adding up numbers on a “Browse by year” page

Page 5: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

http://www.opendoar.org/

OpenDOAR’s Strategy

• Avoid OAI-PMH whenever possible• Use other m2m interfaces, if available/suitable• Screen scrape numbers from web pages• If all else fails, use manual methods

• Counts for “full texts” as well, where possible

Page 6: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

Some examples…

Page 7: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

http://www.opendoar.org/

Generic n records

Documents avec texte intégral 229181

Page 8: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

http://www.opendoar.org/

Generic x to y of z countersDSpace Browse Counter is a special case

Showing results 1 to 20 of 6727

Page 9: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

DSpace totalCnt Add-on

NCKUR 中的社群 [40782/74662] [ 全文筆數 / 總筆數 ]

-

Page 10: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

Generic Sum of List CountersEPrints count Browse List is a special case

Add up the numbersin brackets

Page 11: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

Numberof items

EPrints V.3 Counterhttp://eprints.nonesuch.ac.uk/cgi/counter

Page 12: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

Generic Sum of Numbers

Add up the numbers

Page 13: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

Generic HTML tag counting

Count item tags in HTML source code

Page 14: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

http://www.opendoar.org/

Counting multiple pages

• Separate pages per letter, document type, etc

• Issues with Greenstone – lack of predictability

Page 15: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

OAI-PMH ListIdentifiers: Simplehttp:// ... /oai?verb=ListIdentifiers&metadataPrefix=oai_dc

Count these

No resumptionToken

Page 16: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

OAI-PMH ListIdentifiers: IterativeresumptionToken

for blocks of identifiers

<resumptionToken>193114FUS</resumptionToken>

Page 17: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

OAI-PMH completeListSize

<resumptionToken completeListSize="89805"

Bingo!

Page 18: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

http://www.opendoar.org/

Twelve count harvesting methods

• Generic– Generic n records– Generic x to y of z counters– Generic Sum of List Counters– Generic HTML tag counting– Generic Sum of Numbers

• DSpace– DSpace Browse Counter– DSpace totalCnt Add-on

• EPrints– EPrints count Browse List– EPrints V.3 Counter

• OAI-PMH ListIdentifiers– Simple– Iterative– completeListSize

• Manual counting

Page 19: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

Efficiency of the methods

Iterative OAI-PMHso much slower

Page 20: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

Relative Frequency of Methods

Page 21: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

http://www.opendoar.org/

UgentNumbers galore

DSpace and EPrintsEasily scrapeable counts

Page 22: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

http://www.opendoar.org/

Count harvesting issues• No counts visible or harvestable• Static counts – often approx. – e.g. “over 2m items”• Connectivity issues– Infrastructure limitations – e.g. heavy internet traffic– HTTP 401 (unauthorised) & 403 (forbidden) errors

• Data hidden in include files (e.g. JavaScript)– Not visible in View Source code

• No direct URL known for the pages with counts– Only accessible to human navigators

• Remodelled websites – requiring updated settings

Page 23: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

http://www.opendoar.org/

Help OpenDOAR count your repository• Display record counts on your home page– Using distinctive wording & highlighting– Ideally in <div id="[ID]"> or <span id="[ID]"> tags

• Ensure numbers can be seen in View Source code• Ensure pages & files are not blocked to robots– Grant read-only access if necessary

• Implement OAI-PMH properly– Return ListIdentifiers in chunks – not one big file– Include completeListSize in the resumptionToken

• Tell us about any changes, so we can update settings

Page 24: Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

http://www.opendoar.org/

Ideas for the Future• Comparing counts from OpenDOAR & ROAR– E.g. Nottm ePrints: 1,239 < 1,277– E.g. HAL-Inserm: 7,498 > 2,773

• OpenDOAR– Growth charts– Full text counts

• Extending OAI-PMH– Statistical features– Trial PSH