metadata harvesting the hague, 13 & 14 january 2009 julie verleyen scientific coordinator,...
TRANSCRIPT
![Page 1: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/1.jpg)
Metadata Harvesting
The Hague, 13 & 14 January 2009
Julie Verleyen
Scientific Coordinator, Europeana Office
EuropeanaLocal Knowledge Sharing Workshop
![Page 2: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/2.jpg)
• Harvesting in Europeana: workflow and
requirements
• Best-practices
• Recommendations
• Common issues
• Tools / Software
• Resources
• Documentation
Table Of Content
![Page 3: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/3.jpg)
1. Determine collections to be contributed
• Questionnaire
Harvesting in Europeana
![Page 4: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/4.jpg)
2. Obtain OAI-PMH repository parameters:– Absolute minimum (enough for fully
implemented, tested and documented OAI repositories)• Server base URL
– Very useful to have:• Mapping between described collection(s) and OAI-
PMH set(s)• Prefix of metadata format to use preferably for
Europeana (if not described in ListMetadataFormats response): ex: oai_dc, mods, tel, ese
Harvesting in Europeana
![Page 5: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/5.jpg)
3. Configuration of harvester
4. Full harvest with ListRecords request– Records collected in XML files ≤ 10MB– Harvest stored in SVN
Harvesting in Europeana
![Page 6: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/6.jpg)
![Page 7: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/7.jpg)
• Compliancy to OAI-PMH 2.0 protocol specifications
http://www.openarchives.org/OAI/openarchivesprotocol.html .
Follow implementation guidelines OAI-PMH v2 for
repository implementers
http://www.openarchives.org/OAI/2.0/guidelines-repository.htm
• Full functional tests!!
Best-practices: implementation
![Page 8: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/8.jpg)
OAI validation
=
Your OAI repository correctly implements the OAI-PMH!
Correct response to all OAI-PMH requests: with arguments, various error conditions, every XML schema of every OAI response is valid,...
Best-practices: OAI validation
![Page 9: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/9.jpg)
• Follow the Open Archive Initiative Protocol Testing
• Validate your server using the validator supplied by the OAI.
http://www.openarchives.org/data/registerasprovider.html
Without registering clicking checkbox "only validate and do not register (you may then register later)."
Recommended approach to OAI validation
![Page 10: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/10.jpg)
http://www.openarchives.org/data/registerasprovider.html
![Page 11: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/11.jpg)
#Protocol_Conformance_Testing
![Page 12: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/12.jpg)
http://www.openarchives.org/data/registerasprovider.html => bottom of the page
![Page 13: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/13.jpg)
![Page 14: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/14.jpg)
• Set = "an optional construct for grouping
items for the purpose of selective
harvesting.“
Issues and recommendations: sets
![Page 15: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/15.jpg)
Number of obstacles related to sets:
• Interpreting how a repository has organized sets and determining which sets to harvest – Issue: setName not human understandable
and/or no setDescription provided. – Issue: Large number of sets to sort through.
• Knowing when there are records that belong to no sets – Issue: Items that belong to no sets are included in
the OAI repository.
• Knowing when there are empty sets – Issue: Data provider exposes sets with no
records.
![Page 16: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/16.jpg)
Number of obstacles related to sets:
• Understanding relationships between sets
– Issue: Relationships between sets are not
expressed.
• Mechanism to express relationships between hierarchical
sets
• But no mechanism to express relationships between
overlapping sets!
• The only way to know: harvest the identifiers or records
which contain the header information sets record
belongs to
![Page 17: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/17.jpg)
Number of obstacles related to sets:
• Knowing how many records there are within a
set before harvesting
– Issue: Not expressing how many records are
within a set which can be expressed via a
completeListSize attribute in a resumptionToken
or within the set description.
• Knowing when a set structure has been
substantially changed
– Issue: Changes in a set structure has not been
communicated
![Page 18: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/18.jpg)
• No single best practice for the organization of sets.
• Realistically: data providers organize sets in a way which best meets the needs of their primary service provider and can be easily done within their own internal workflows.
• Useful to organize the metadata items into sets according to the collections of resources they represent. – Concept of collections varies and not completely clear
in Europeana. – Useful for harvester to understand notion of collection
for data providers
Sets: recommendations
![Page 19: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/19.jpg)
• Repository implementation following OAI-
PMH v2.0 + tested
• Inform Europeana harvesting responsible of
any repository changes / maintenance
• No regular harvesting schema determined
yet
• “SLA” between data providers and
harvesters
Basic requirements
![Page 20: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/20.jpg)
• Unavailability / unreliability of repository
server
• Implementation of OAI-PMH v2 incomplete
– resumptionToken not supported
– Only ListIdentifiers
• XML syntax errors
• Character encoding errors
• Short lifetime of resumptionToken
Common issues
![Page 21: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/21.jpg)
TEL/Europeana OAI-PMH Harvester – Offline
documentation
– Harvester
– Java standalone application with GUI
– Multiple harvesting jobs
– Resuming unfinished jobs
– Logging
– No scheduling, No configuration interface
Tools / Software
![Page 22: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/22.jpg)
REPOX - http://repox.ist.utl.pt/
• Repository + Harvester
• Java standalone application with web GUI
• Multiple harvesting jobs, Scheduler
• Statistics
• Management of XML metadata repository – Versioning and identification of records
– Different metadata format
– User interface to create metadata crosswalks: Schema mapper
Tools / Software
![Page 23: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/23.jpg)
OAIcat from OCLC - http://www.oclc.org/research/software/oai/cat.htm
• Framework conforming to the OAI-PMH v2.0
• Repository + Harvesting
• Java web application
• Scheduling, logging
• Limited scalability (~2M records)
Tools / Software
![Page 24: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/24.jpg)
Other implementations in different languages to plug-in into a Library Management System:
– PHP: OAIbiblio data provider implementation of the OAI-PMH, version 2.0. This toolkit can be easily customized to communicate with an already existing, multi-table MySQL database
– PERL: CelestialOAI aggregator/cache application that imports OAI metadata from version 1.0,1.1,2.0 OAI-compliant repositories, and re-exposes that metadata through either an aggregated or per-repository OAI-compliant 2.0 interface. Celestial requires oai-perl v2, MySQL, Perl 5.6.x and a CGI-capable web server
– Ruby: ruby-oai Includes a client library, a server/provider library and a interactive harvesting shell
– Python: pyoai packageenables high-level access to an OAI-PMH Metadata Repository and also implements a framework for quickly creating OAI-PMH compliant servers
Tools / Software (TELplus D2.1)
![Page 25: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/25.jpg)
• ESE XML validation schemas developed by
partners
Tools / Software
![Page 26: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/26.jpg)
• The Open Archives Initiative Protocol for Metadata Harvesting v2.0 http://www.openarchives.org/OAI/openarchivesprotocol.html
• TELplus D2.1, “OAI-PMH implementation and
tools guidelines”, 21 pages– Protocol overview and description of main
concepts
– OAI-PMH implementation in libraries
– References
Resources
![Page 27: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/27.jpg)
![Page 28: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/28.jpg)
• Wiki “Best Practices for OAI Data Provider
Implementations and Shareable Metadata”:
Excellent source of guidelines, tutorials,
recommendations, implementation softwares and
tools, references etc...
http://webservices.itcs.umich.edu/mediawiki/oaibp/in
dex.php/Main_Page
Resources
![Page 29: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/29.jpg)
![Page 30: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/30.jpg)
![Page 31: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/31.jpg)
![Page 32: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/32.jpg)
![Page 33: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/33.jpg)
![Page 34: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56649e415503460f94b32f4d/html5/thumbnails/34.jpg)
• Requirements:
– Europeana OAI-PMH Harvesting
– Europeana OAI-PMH Repositories
• ESE XML validation schema
• Europeana OAI-PMH data providers registry &
forum/mailing list
– Local systems
– OAI-PMH repository solution
– Contact
Documentation in Europeana context