metadata harvesting

47
Metadata Harvesting Interoperable digital collections

Upload: cheyenne-turner

Post on 01-Jan-2016

46 views

Category:

Documents


4 download

DESCRIPTION

Metadata Harvesting. Interoperable digital collections. Distributed libraries. The reality in most digital libraries is that no one location has all the materials that may be of interest. It is often more efficient to allow a number of sites each to retain some of the materials. - PowerPoint PPT Presentation

TRANSCRIPT

Metadata Harvesting

Interoperable digital collections

Distributed libraries

• The reality in most digital libraries is that no one location has all the materials that may be of interest.

• It is often more efficient to allow a number of sites each to retain some of the materials.

• How can we assure clients that they will see all relevant resources, regardless of which library they search?

Two basic approaches

• One service provider with access to resources stored in multiple locations– Information about all the resources located at the

service provider. – Services (DL scenarios) use the information to

provide connections to resources at multiple locations

• Distributed services– Information kept with the resources– Services, local to each collection, interact with

other collection sites

Two protocols

• Z39.50 – Developed before the web– Protocol for communicating with collection

holders in order to provide services.

• Open Archives Initiative– Relatively recent innovation– Central service provider gathers

information from collection holders

Z39.50 - briefly• Information Retrieval Service Definition and

Protocol Specifications for Library Applications

• Initially developed over the OSI network standards

• Protocol for information exchange– Free the information seeker from the need to know the

details of the target database configuration

• Each site provides services– Each service queries remote sites for needed

information• Information requests mapped to database queries at the

collection site.• Some inconsistency in the interpretation of queries.

Distributed ResourcesMultiple Services

Service provider -- search, browse, compare, etc.

Data provider

Data provider

Data provider

Data provider

Data provider

Approach 1 - One service provider gathers information about data and uses it to provide services

Distributed data and services

Approach 2: Each system is both a data repository and a service provider. Services query other data providers as needed.

Search, browse

Search, browse, compare

Service provider -- search, browse, compare, etc.

Data provider

Data provider

Data provider

Data provider

Data provider

Each server likely to have its own clients. Difference is whether the information exchange is periodic or ad hoc

Hybrid systems

Open Archives Initiative (OAI)

• Web-based– Uses HTTP to communicate between sites

• Centralized server– Services provided from a site that has

already gathered the information it needs for those services from a distributed collection of sites.

OAI PMH• Interoperability through Metadata Exchange

• The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability. Data Providers are repositories that expose structured metadata via OAI-PMH. Service Providers then make OAI-PMH service requests to harvest that metadata. OAI-PMH is a set of six verbs or services that are invoked within HTTP.

http://www.openarchives.org/pmh/

OAI - ORE• Aggregations of Web Resources

• Open Archives Initiative Object Reuse and Exchange (OAI-ORE) defines standards for the description and exchange of aggregations of Web resources. These aggregations, sometimes called compound digital objects, may combine distributed resources with multiple media types including text, images, data, and video. The goal of these standards is to expose the rich content in these aggregations to applications that support authoring, deposit, exchange, visualization, reuse, and preservation. Although a motivating use case for the work is the changing nature of scholarship and scholarly communication, and the need for cyberinfrastructure to support that scholarship, the intent of the effort is to develop standards that generalize across all web-based information including the increasing popular social networks of “web 2.0”.

http://www.openarchives.org/ore/

OAI-ORE example1. The URI http://arxiv.org/abs/astro-ph/0601007

of the human start page.2. The formats in which the document is

available, i.e. PostScript, PDF, etc. These are effectively the constituents of the aggregation that is the arXiv document. For the remainder of this example we will consider this human start page, the splash page, as also a constituent of the aggregation

3. The title of the arXiv document.4. The authors of the arXiv document.5. The creation and last modification date of the

arXiv document.6. Identifiers of entities that are in some manner

comparable to this arXiv document. For example, a version of this document was later published as an article in a peer-reviewed journal, and the Digital Object Identifier of that article is shown.

7. The versions of this document.8. Links to other arXiv documents in the same

collection (i.e., astro-ph).9. Citations made by this arXiv document, and

citations it received from other documents.http://www.openarchives.org/ore/1.0/primer

The problem is that this URI does not really represent the resource, although this is the human readable landing page.

OAI - ORE• ORE allows aggregation of related web

pages to form a logical unit– The representation allows access to all of

the components of a resource at once.

http

://w

ww

.op

ena

rchi

ves.

org

/ore

/1

.0/p

rimer

.htm

l#E

xam

ple

Our focus

• We will concentrate on OAI – PMH– Allowing us to know about other resources

of interest to our societies– Allowing others to know about the

resources we have available

Spot check

• What sort of resources are handled by your site? Are the resources well represented by the landing page? Do you have complex resources that need structural description as well as the usual Dublin Core fields?

• Spend a few minutes talking to someone not on your team about the resources you have and what it takes to describe them. Then switch and listen to the other person’s analysis of their resources.

• Report your conclusions

Older approaches - 1

• Z39.50– Special purpose protocol (machine to

machine, not web interface)– Gathers information when it is requested,

not on a scheduled basis.

OAI Compared to Z39.50Z39.50 OAI

Content (Objects) Distributed Distributed

World View Bibliographic Bibliographic

Object Presentation

Data provider Data provider

Searching is Distributed Centralized

Search done by Data provider Service provider

Metadata searched is

Up to date Stale

Semantic Mapping When searching Metadata delivery

Source: oai.grainger.uiuc.edu/FinalReport/JCDL_2003_OAI_Intro.ppt

Open Archives Initiative Protocol for Metadata Harvesting -- OAI-PMH

Repository

OAI

Harvester

OAI

HTTP req (OAI verb)

HTTP resp (XML)

OAI PMH defines an interface between the Harvester and any number of Repositories

Metadata Provider

Service Provider

Implemented as CGI, ASP, PHP, or other

Any system may serve as a harvester, repository, or both

OAI - PMH components

Service Providersand Data Providers

Requests and Responses

http://www.oaforum.org/tutorial/english/page3.htm#section3

Records• Metadata of a resource.• Three parts

– Header (required)• Identifier (required: 1 only)• Datestamp (required: 1 only)• setSpec elements (optional: 0, 1, or more)• Status attribute for deleted item

– Metadata (required)• XML encoded metadata with root tag, namespace• Repositories must support Dublin Core, other formats optional

– “About” statement (optional)• Right statements• Provenance statements

Identifiers

• Globally unique identifier

• Valid URI– Examples

• oai:<archiveId>:<recordId>• oai:etd.vt.edu:etd-1234567890

– Must resolve to one item• No duplicates• No reuse of previously used identifiers

Datestamps

• Date of last modification of a record– Used only for harvesting (meta metadata?)

• Mandatory for each item in the repository• Two levels of granularity possible

– YYYY-MM-DD– YYYY-MM-DDThh:mm:ssZ

• T … Z = Time zone -- must be GMT

• Allows harvesting incrementally -- get only what is new since last visit– Accessed by arguments from and until

The question of time

• What time is it?– How do you represent this moment in time

in a message that goes to people in several different places around the world?

• There is a standard for that.– Look up (Wikipedia will do) the ISO 8601

standard for unambiguous specification of time.

– Write down what time it is right now (use minutes, but not seconds) Yes, the time will change during our discussion.

The OAI-PMH verbs

• Each requests a specific response from a data repository

Identify• Function: Description of the archive• Example: http://www.language-archives.org/cgi-bin/olaca3.pl?verb=Identify• Parameters: none• Errors/exceptions:

– badArgument (there should not be any)• Response format:Element Example Ordinality ‡repositoryName My Archive 1baseURL http://archive.org/oai 1protocolVersion 2.0 1earliestDatestamp 1999-01-01 1deleteRecords no, transient, persistent 1granularity YYYY-MM-DD, YYYY-MM-DDThh:mm:ssZ 1adminEmail [email protected] +compression deflate, compress *description oai-identifier, eprints, friends, … *

‡ Ordinality: 1 = mandatory, 1 only; + = mandatory, 1 only; * = optional, 0 or more

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">

<responseDate>2012-03-28T21:30:33Z</responseDate>

<request verb="Identify">http://www.language-archives.org/cgi-bin/olaca3.pl</request>

<Identify>

<repositoryName>OLAC Aggregator</repositoryName>

<baseURL>http://www.language-archives.org/cgi-bin/olaca3.pl</baseURL>

<protocolVersion>2.0</protocolVersion>

<adminEmail>[email protected]</adminEmail>

<earliestDatestamp>1873-04-18</earliestDatestamp>

<deletedRecord>no</deletedRecord>

<granularity>YYYY-MM-DD</granularity>

<!-- maybe later <compression>identity</compression> -->

<description>...</description>

<description>...</description>

</Identify>

</OAI-PMH>

Actual response from

http://www.language-archives.org/cgi-bin/olaca3.pl?verb=Identify

Continued

These expand

Continued

First expansion

<description><oai-identifier xmlns="http://www.openarchives.org/OAI/2.0/oai-

identifier" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai-identifier http://www.openarchives.org/OAI/2.0/oai-identifier.xsd">

<scheme>oai</scheme><repositoryIdentifier>OLACA.language-archives.org</

repositoryIdentifier><delimiter>:</delimiter><sampleIdentifier>oai:ethnologue.com:aaa</sampleIdentifier></oai-identifier></description>

<description><olac-archive xmlns="http://www.language-archives.org/OLAC/1.1/olac-archive" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" type="institutional" xsi:schemaLocation="http://www.language-archives.org/OLAC/1.1/olac-archive http://www.language-archives.org/OLAC/1.1/olac-archive.xsd" currentAsOf="2012-03-28"><archiveURL>http://www.language-archives.org/archive_records/</archiveURL><participant name="Steven Bird" role="Curator" email="[email protected]"/><participant name="Gary Simons" role="Curator" email="[email protected]"/><participant name="Haejoong Lee" role="Administrator" email="[email protected]"/><institution>Open Language Archives Community</institution><institutionURL>http://www.language-archives.org/</institutionURL><shortLocation>Philadelphia, U.S.A.</shortLocation><location/><synopsis>This repository contains all records from OLAC-registered archives. It is intended to be used by services which do not want to harvest individual OLAC archives.</synopsis><access>Metadata may be used only subject to the access permissions given by the individual archives.</access></olac-archive></description>

ListMetadataFormats

• Function: retrieve available metadata formats from archive

• Parameters: identifier (optional)• Errors/exceptions:

– badArgument– idDoesNotExist– noMetadataFormats

Res

pons

e to

ht

tp:/

/ww

w.la

ngua

ge-a

rchi

ves.

org/

cgi-b

in/

olac

a3.p

l?ve

rb=

List

Met

adat

aFor

mat

s<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2012-03-28T21:38:46Z</responseDate><request verb="ListMetadataFormats">http://www.language-archives.org/cgi-bin/olaca3.pl</request><ListMetadataFormats><metadataFormat><metadataPrefix>olac</metadataPrefix><schema>http://www.language-archives.org/OLAC/1.1/olac.xsd</schema><metadataNamespace>http://www.language-archives.org/OLAC/1.1/</metadataNamespace></metadataFormat><metadataFormat><metadataPrefix>olac_display</metadataPrefix><schema>http://www.language-archives.org/OLAC/1.1/olac.xsd</schema><metadataNamespace>http://www.language-archives.org/OLAC/1.1/</metadataNamespace></metadataFormat><metadataFormat><metadataPrefix>olac_dla</metadataPrefix><schema>http://www.language-archives.org/OLAC/1.1/olac.xsd</schema><metadataNamespace>http://www.language-archives.org/OLAC/1.1/</metadataNamespace></metadataFormat><metadataFormat><metadataPrefix>oai_dc</metadataPrefix><schema>http://www.openarchives.org/OAI/2.0/oai_dc.xsd</schema><metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc/</metadataNamespace></metadataFormat></ListMetadataFormats></OAI-PMH>

ListSets

• Function: retrieve set structure of a repository

• Example: archive.org/oai-script?verb=ListSets

• Parameters: resumptionToken (exclusive)• Errors/exceptions:

– badArgument– badResumptionToken– noSetHierarchy Sets are optional and are used to divide

a repository into separate units that will be of interest to different harvesters.

ListIdentifiers• Function: abbrieviated form of ListRecords, retrieve only

headers• Parameters:

– from (optional)– until (optional)– metadataPrefix (required)– set (optional)– resumptionToken (exclusive)

• Errors/exceptions:– badArgument– badResumptionToken– cannotDisseminateFormat– noRecordsMatch– noSetHierarchy

ListRecords

• Function: harvest records from a repository• Parameters:

– from (optional)– until (optional)– metadataPrefix (required) – set (optional)– resumptionToken (exclusive)

• Errors/exceptions:– badArgument– badResumptionToken– cannotDisseminateFormat– noRecordsMatch– noSetHierarchy

GetRecord

• Function: retrieve an individual metadata record from a repository

• Parameters:– Identifier (required)– metadataPrefix (required)

• Errors/exceptions:– badArgument– cannotDisseminateFormat– idDoesNotExist

Spot Check

• Use the site from which we retrieved some information and use the other PMH verbs there.

Interoperability

• The goal: communication, without human intervention, between information sources– Books that “talk to each other”

• Live links for references• Knowledge of how to find relevant resources

when needed• Ability to query other information locations

Protocols

• Precise rules for interactions between independent processes– Format of the messages

• Both structure and content

– Specified behavior in response to specific messages

• Many ways to accomplish the same result, but both sides must have the same understanding of the rules of engagement.

Protocol Types

• RPC model– Point to point– Completely open to definition by developer

• Verbs (methods)• Nouns (objects, resources)

– Useful to closed community or group who know about the availability of the resource.

SOAP

• Initial words of the acronym have been discontinued. (Simple Object Access Protocol)

• Initially developed as part of the Microsoft .NET paradigm– Now in W3C committee

• Stateless, one-way message exchange paradigm• XML encoded• Flexibility of RPC, but more constrained in the

way communication is formatted.

SOAP is a lightweight protocol intended for exchanging structured information in a decentralized, distributed environment. SOAP uses XML technologies to define an extensible messaging framework, which provides a message construct that can be exchanged over a variety of underlying protocols. The framework has been designed to be independent of any particular programming model and other implementation specific semantics.

http://msdn.microsoft.com/en-us/library/ms995800.aspx

REST

• REpresentational State Transfer• An after-the-fact definition of the architecture of

the World Wide Web• The model is

– Client/server– Stateless– Cacheable– Layered

• Resource interface constrained– Restricted verbs– Restricted content types

• RESTful applications use HTTP requests to post data (create and/or update), read data (e.g., make queries), and delete data. Thus, REST uses HTTP for all four CRUD (Create/Read/Update/Delete) operations.

• REST is a lightweight alternative to mechanisms like RPC (Remote Procedure Calls) and Web Services (SOAP, WSDL, et al.). Later, we will see how much more simple REST is.

• Despite being simple, REST is fully-featured; there's basically nothing you can do in Web Services that can't be done with a RESTful architecture.

http://rest.elkstein.org/

REST and RPC

• RPC provides flexibility for any type of interaction between any type of resources

• REST provides consistency to allow interaction among resources without prior discovery of accepted actions and responses.

SOAP and REST

• Debate in the Web community about which is the better paradigm for application development

• REST -- restricted, but simple extension of existing Web processes

• SOAP -- added flexibility with cost in terms of bandwidth, security, complexity for development

References

• Giving SOAP a REST http://www.devx.com/DevX/Article/8155

• SOAP Version 1.2 Part 0: Primer http://www.w3.org/TR/2003/REC-soap12-part0-20030624/#L1153

• OAI For Beginners - The Open Archives Forum online tutorial: http://www.oaforum.org/tutorial/index.php

• Z39.50 Resource Page: http://www.niso.org/standards/resources/Z3950_Resources.html

• Z39.50 An Overview of Development and the Future (1995)

http://www.cqs.washington.edu/~camel/z/z.html

Plus a few other sites as noted in the slides