counting on open doar
DESCRIPTION
Counting on Open DOAR. Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham [email protected]. Background to Open DOAR. Created in 2005 Lists over 2320 repositories (2013-07-02) Manually validated High quality… - PowerPoint PPT PresentationTRANSCRIPT
Counting on OpenDOAR
Peter MillingtonSHERPA Technical Development Officer
CRC, University of [email protected]
http://www.opendoar.org/
Background to OpenDOAR
• Created in 2005– Lists over 2320 repositories (2013-07-02)
• Manually validated– High quality…– …but we didn’t like to talk about the record counts
• Counts not updated after the initial entry– Unless prompted by users
• Fixed in 2012– Record counts updated about every 2 weeks
http://www.opendoar.org/
Established counting methods
• Manual inspection– Labour-intensive
• Counting OAI-PMH record identifiers– Inefficient
• Handling big files• Iterative
– Unreliable• File size limits and timeouts
– Inaccurate• Need to account for deleted records
http://www.opendoar.org/
How difficult can it be?
• SELECT COUNT(*) FROM repository;– Still fast even with added complexity– Statuses, Breakdown by date, etc.
• The number is often there on the web page– Headline number, or– “x to y of z” tally, or– Adding up numbers on a “Browse by year” page
http://www.opendoar.org/
OpenDOAR’s Strategy
• Avoid OAI-PMH whenever possible• Use other m2m interfaces, if available/suitable• Screen scrape numbers from web pages• If all else fails, use manual methods
• Counts for “full texts” as well, where possible
Some examples…
http://www.opendoar.org/
Generic n records
Documents avec texte intégral 229181
http://www.opendoar.org/
Generic x to y of z countersDSpace Browse Counter is a special case
Showing results 1 to 20 of 6727
DSpace totalCnt Add-on
NCKUR 中的社群 [40782/74662] [ 全文筆數 / 總筆數 ]
-
Generic Sum of List CountersEPrints count Browse List is a special case
Add up the numbersin brackets
Numberof items
EPrints V.3 Counterhttp://eprints.nonesuch.ac.uk/cgi/counter
Generic Sum of Numbers
Add up the numbers
Generic HTML tag counting
Count item tags in HTML source code
http://www.opendoar.org/
Counting multiple pages
• Separate pages per letter, document type, etc
• Issues with Greenstone – lack of predictability
OAI-PMH ListIdentifiers: Simplehttp:// ... /oai?verb=ListIdentifiers&metadataPrefix=oai_dc
Count these
No resumptionToken
OAI-PMH ListIdentifiers: IterativeresumptionToken
for blocks of identifiers
<resumptionToken>193114FUS</resumptionToken>
OAI-PMH completeListSize
<resumptionToken completeListSize="89805"
Bingo!
http://www.opendoar.org/
Twelve count harvesting methods• Generic
– Generic n records– Generic x to y of z counters– Generic Sum of List Counters– Generic HTML tag counting– Generic Sum of Numbers
• DSpace– DSpace Browse Counter– DSpace totalCnt Add-on
• EPrints– EPrints count Browse List– EPrints V.3 Counter
• OAI-PMH ListIdentifiers– Simple– Iterative– completeListSize
• Manual counting
Efficiency of the methods
Generic Sum of Numbers
Generic n records
OAI-PMH completeListSize
EPrints count Browse List
Generic Sum of List Counters
EPrints V.3 Counter
DSpace totalCnt Add-on
Generic x to y of z counter
DSpace Browse Counter
OAI-PMH Simple count
Generic HTML tag counting
OAI-PMH Iterative count
0 5000 10000 15000 20000 25000
Microseconds/Item
Big files
Small files
Iterative OAI-PMHso much slower
Relative Frequency of Methods
41%
3%
11%4%
6%
1%
0%
18%
8%
3% 0%
0%
5%
DSpace Browse CounterDSpace totalCnt Add-onEPrints V.3 CounterEPrints count Browse ListOAI-PMH completeListSizeOAI-PMH Simple countOAI-PMH Iterative countGeneric n recordsGeneric Sum of List CountersGeneric HTML tag countingGeneric x to y of z counterGeneric Sum of NumbersManual counting
http://www.opendoar.org/
UgentNumbers galore
DSpace and EPrintsEasily scrapeable counts
http://www.opendoar.org/
Count harvesting issues• No counts visible or harvestable• Static counts – often approx. – e.g. “over 2m items”• Connectivity issues
– Infrastructure limitations – e.g. heavy internet traffic– HTTP 401 (unauthorised) & 403 (forbidden) errors
• Data hidden in include files (e.g. JavaScript)– Not visible in View Source code
• No direct URL known for the pages with counts– Only accessible to human navigators
• Remodelled websites – requiring updated settings
http://www.opendoar.org/
Help OpenDOAR count your repository• Display record counts on your home page
– Using distinctive wording & highlighting– Ideally in <div id="[ID]"> or <span id="[ID]"> tags
• Ensure numbers can be seen in View Source code• Ensure pages & files are not blocked to robots
– Grant read-only access if necessary• Implement OAI-PMH properly
– Return ListIdentifiers in chunks – not one big file– Include completeListSize in the resumptionToken
• Tell us about any changes, so we can update settings
http://www.opendoar.org/
Ideas for the Future• Comparing counts from OpenDOAR & ROAR– E.g. Nottm ePrints: 1,239 < 1,277– E.g. HAL-Inserm: 7,498 > 2,773
• OpenDOAR– Growth charts– Full text counts
• Extending OAI-PMH– Statistical features– Trial PSH