introduction to the oai metadata harvesting protocol hussein suleman, [email protected]@vt.edu...

55
Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, [email protected] Digital Library Research Laboratory Virginia Tech

Upload: shawn-mason

Post on 26-Dec-2015

226 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

Introduction to the OAI

Metadata Harvesting Protocol

Hussein Suleman, [email protected]

Digital Library Research Laboratory

Virginia Tech

Page 2: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 2

1. Introduction

What is the OAI-MHP?

General System Strategy

Case study: NDLTD

Page 3: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 3

1.1. What is the OAI-MHP ?

What is the Metadata Harvesting Protocol?Protocol to transfer metadata from a source archive to a destination archive

• Any metadata

• In a continuous stream

• As simply as possible

Page 4: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 4

1.2. General System Strategy

Services

Metadata Harvesting

Document Model

Page 5: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 5

1.3. Case Study: NDLTD

Networked Digital Library of Theses and Dissertations

Multiple independent university-based collections of electronic documents

Virginia Tech

Rhodes U.

U.Waterloo

International

ETD

Library

OAI

Metadata

Harvesting

Protocol

Page 6: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 6

2. Definitions / Concepts

Basic PrinciplesWhat is an Open Archive?

Harvesting vs. Federation

Data and Service Providers

Underlying TechnologyHTTP and XML

Protocol PoliciesWhat is a record?

Multiplicity of Metadata

Sets

Datestamp, Harvesting and Flow Control

Page 7: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 7

2.1. What is an Open Archive ?

Any WWW-based system that can be accessed through the well-defined interface of the Open Archives Protocol for Metadata Harvesting… aka OAI-Compliant RepositoryNo implications for:

Physical storage of dataCost of dataMetadata and data formatsAccess control to server

Page 8: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 8

2.2. Harvesting vs Federation

Competing approaches to interoperabilityFederation is when services are run remotely on remote data (e.g. Federated searching)

Harvesting is when data/metadata is transferred from the remote source to the destination where the services are located (e.g. Union catalogues)

Federation requires more effort at each remote source but is easier for the local system and vice versa for harvesting

OAI currently focuses on harvesting

Page 9: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 9

2.3. Data and Service Providers

Data Providers refer to entities who possess data/metadata and are willing to share this with others (internally or externally) via well-defined OAI protocols (e.g. database servers)Service Providers are entities who harvest data from Data Providers in order to provide higher-level services to users (e.g. search engines)OAI uses these denotations for its client/server model (data=server, service=client)

Page 10: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 10

2.4. HTTP and XML

Metadata Harvesting Protocol is an almost stateless request/response protocol

Requests and responses are sent via the HTTP protocol

Requests are encoded as GET/POST operations

Responses are well-formed XML documents

Page 11: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 11

2.5. What is a record ?

A record refers to an independent XML structure that may be associated with digital or physical objects

Records are usually associated with metadata, not data

OAI advocates harvesting of records, which contain metadata and additional fields to support the harvesting operation

Page 12: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 12

2.6. Sample OAI Record

<record> <header> <identifier>oai:sigir:ws3</identifier> <datestamp>2001-08-13</datestamp> </header> <metadata> <dc> <title>OAI Workshop at SIGIR</title> <creator>Hussein Suleman</creator> <language>English</language> </dc> </metadata> <about> <metadataID>oai:sigir:ws3md</metadataID> </about></record>

Page 13: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 13

2.7. Multiplicity of Metadata

Multiple formats of metadata allowed

Dublin Core is mandatory

Any other format allowed as long as it has an XML encoding

E.g. MARC (Libraries), IMS (Education), ETDMS (Theses/Dissertations), RFC1807 (Bibliographies)

Page 14: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 14

2.8. Sets

Protocol mechanism to allow for harvesting of sub-collections

No well-defined semantics – depends completely on local data providers

May be defined by arrangement between data providers and service providers

E.g. Subject areas, years, author names, search queries

Page 15: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 15

2.9. Datestamps & Harvesting

Each record needs a datestamp that indicates its date of creation or modification

Dates are used to allow for harvesting by date range, thus allowing incremental and continuous transfer of metadata from a data provider to a service provider

Page 16: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 16

2.10. Flow Control

HTTP “retry-after” mechanism can be leveraged to support server-side delaying of a client’s request

Resumption Tokens can be used to return partial results – the client is issued with a token which may be presented to the server to receive more results

Page 17: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 17

3. Metadata Harvesting Protocol

Service RequestsIdentifyListMetadataFormatsListSetsGetRecordListIdentifiersListRecords

Metadata MultiplicityDate RangesResumption Tokens

Page 18: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 18

3.1. Identify

PurposeReturn general information about the archive and its policies

ParametersNone

Sample URLhttp://www.anarchive.org/cgi-bin/OAI?verb=Identify

Page 19: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 19

3.2. Identify - Response

Page 20: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 20

3.3. ListMetadataFormats

PurposeList metadata formats supported by the archive as well as their schema locations and namespaces

Parametersidentifier – for a specific record (O)

Sample URLhttp://www.anarchive.org/cgi-bin/OAI?verb=ListMetadataFormats

Page 21: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 21

3.4. ListMetadataFormats - Response

Page 22: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 22

3.5. ListSets

PurposeProvide a hierarchical listing of sets in which records may be organized

ParametersNone

Sample URLhttp://www.anarchive.org/cgi-bin/OAI?verb=ListSets

Page 23: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 23

3.6. ListSets – Response

Page 24: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 24

3.7. GetRecord

PurposeReturns the metadata for a single identifier in the form of an OAI record

Parametersidentifier – unique id for record (R)

metadataPrefix – metadata format (R)

Sample URLhttp://www.anarchive.org/cgi-bin/OAI?verb=GetRecord&identifier=oai:test:123&metadataPrefix=oai_dc

Page 25: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 25

3.8. GetRecord - Response

Page 26: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 26

3.9. ListIdentifiers

PurposeList all unique identifiers corresponding to records in the repository

Parametersfrom – start date (O)until – end date (O)set – set to harvest from (O)resumptionToken – flow control mechanism (X)

Sample URLhttp://www.anarchive.org/cgi-bin/OAI?verb=ListIdentifiers&set=All

Page 27: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 27

3.10. ListIdentifiers - Response

Page 28: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 28

3.11. ListRecords

PurposeRetrieves metadata for multiple records

Parametersfrom – start date (O)until – end date (O)set – set to harvest from (O)resumptionToken – flow control mechanism (X)metadataPrefix – metadata format (R)

Sample URLhttp://www.anarchive.org/cgi-bin/OAI?verb=ListRecord&metadataprefix=oai_dc&from=2001-01-01

Page 29: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 29

3.12. ListRecords - Response

Page 30: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 30

3.13. Metadata Multiplicity

Page 31: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 31

3.14. Date Ranges

Page 32: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 32

3.15. Resumption Token

Page 33: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 33

That’s All Folks !

Page 34: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

The OAI Metadata Harvesting Protocol -

Communities and Services

Hussein Suleman, [email protected]

Digital Library Research Laboratory

Virginia Tech

Page 35: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 35

4. Service Providers

Harvesting 101/102/103

Scheduling

Tools

Repository Explorer

Case Study: ARC

Case Study: NDLTD

VTLS Virtua

Page 36: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 36

4.1. Harvesting 101

ListRecords (from=2000-09-12)

Response

ListRecords (resumptionToken=1)

Response

ListRecords (from=2000-09-13)

Response

resumption

Token=1

SERVICE

PROVIDER

DATA

PROVIDER

Set date=09-13

DAY ONE

DAY TWO

Set date=09-14...

Page 37: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 37

4.2. Harvesting 102

ListMetadataFormats

Response

ListIdentifiersResponse

GetRecord (id=1, prefix=oai_dc)

Response

Identifier:1

Identifier:2

Identifier:3

SERVICE

PROVIDER

DATA

PROVIDER

oai_dc

oai_rfc1807

GetRecord (id=2, prefix=oai_dc)

Response

...

record1

record2

Page 38: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 38

4.3. Harvesting 103

ListMetadataFormats (id=1)

Response

ListIdentifiersResponse

GetRecord (id=1, prefix=oai_dc)

Response

Identifier:1

Identifier:2

Identifier:3

SERVICE

PROVIDER

DATA

PROVIDER

oai_dc

oai_rfc1807

...

record1

Page 39: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 39

4.4. Scheduling

Problems:Granularity is coarse

Timezones are local for each site

Solutions:Overlap one day to compensate for granularity

Overlap one day or use remote times to compensate for timezones

Page 40: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 40

4.5. Tools

Check OAI website for sample codeXML parsers – depending on platform – check W3CXML Schema validators

Very few available – the reference version works but may not be easy to installIgnore validation if you can trust the source

Sample data providers – check the OAI website for a list of conformant public archives

Page 41: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 41

4.6. Repository Explorer

Page 42: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 42

4.7. Case Study: ARC

Page 43: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 43

4.8. Case Study: NDLTD

Virginia Tech U. OldenbergHumboldt U.

NDLTD ETD Union Catalog

VTLS Virtua MARIAN

Search/Browse Engines

Recommender Cross-Ref.

Other Services

Page 44: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 44

4.9. VTLS Virtua

Page 45: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 45

5. OAI Communities

Shared Metadata Formats

Shared semantics

Layering over OAI

Closed OAI networks

OAI within the DL

Page 46: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 46

5.1. Shared Metadata Formats

Use metadata formats accepted within a community to convey more specific information

ExamplesE-Print format (under development)

ETD-MS for theses and dissertations

VRA Core for multimedia

IMS Metadata for educational material

Page 47: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 47

5.2. Shared Semantics

Develop a shared understanding for the meanings of fields

ExamplesDeveloping controlled vocabularies for fields

Using specific fields for external links (OAI recommends using identifier in DC for this)

Choosing from among existing standards (like language names)

Page 48: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 48

5.3. Layering over OAI

Convert OAI records into more standard formats like MARC communications format

Collapse multiple requests into one to make harvesting easier

Name authority system (developed at OCLC) piggybacks name resolution over the OAI protocol

Page 49: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 49

5.4. Closed OAI networks

Data providers need not go public !

Within an organization, OAI can be used for data transfer among heterogeneous systems

More control over use, making global optimizations possible (like harvesting schedules and choice of metadata formats)

Page 50: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 50

5.5. OAI within the DL

Use the OAI protocol as the basis for components to communicate

ExamplesSearch Engines could use dynamic sets to correspond to search results

Browsing can be directed by sets

Reviews and Annotations can each be independent OAI data providers

Page 51: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 51

6. Now What ?

Reality Check

Links

More Links

Page 52: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 52

6.1. Reality Check

DO I REALLY WANT TO DO THIS?

Can I satisfy the requirements to be a data provider?

Do I want to be a service provider ?

Do I want to adopt and support this within my community ?

Page 53: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 53

6.2. Links

Open Archives Initiativehttp://www.openarchives.org

OAI Metadata Harvesting Protocolhttp://www.openarchives.org/OAI/openarchivesprotocol.htm

Virginia Tech DLRL OAI Projectshttp://www.dlib.vt.edu/projects/OAI/

Repository Explorerhttp://purl.org/net/oai_explorer

NDLTDhttp://www.ndltd.org

Page 54: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 54

6.3. More Links

ARC Cross-Archive Search Servicehttp://arc.cs.odu.edu/

XML Schema Validatorhttp://www.w3.org/2001/03/webdata/xsv

Dublin Core Metadata Initiativehttp://www.dublincore.org

E-Prints DL-in-a-boxhttp://www.eprints.org

XML Tools at W3Chttp://www.w3.org/XML/#software

Page 55: Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, hussein@vt.eduhussein@vt.edu Digital Library Research Laboratory Virginia Tech

SIGIR 2001 Slide 55

That’s All Folks !