automatic metadata discovery from non-cooperative digital libraries by ron shi, kurt maly, mohammad...

23
Automatic Metadata Discovery from Non- cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

Upload: nicholas-woods

Post on 18-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

Automatic Metadata Discovery from Non-cooperative Digital Libraries

By

Ron Shi, Kurt Maly, Mohammad Zubair

IADIS International Conference

May 2003

Page 2: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 2

Table of content Introduction Motivation Problem Solution Approach Challenges Automated Metadata Discovery and Retrieval Future Works Conclusion Questions References

Page 3: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 3

Introduction

• What is a digital library?

Page 4: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 4

Motivation Growing number of digital libraries on the Internet Each implementation done independently from the others Provide interoperable service across heterogeneous systems

Page 5: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 5

Problems Independent data providers without following any common protocol Digital library does not provide metadata or a way to obtain its

metadata Each digital library has its own way to define metadata Each digital library can display any subset of its metadata at its own discretion Each digital library has its own rules as to which metadata to display and in what

form

Page 6: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 6

Sample Search results of ACM DL

Page 7: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 7

Sample result list page and record page of Cogprint DL

Page 8: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 8

Proposed Solutions Lightweight Federated Digital Library Provide a metadata retrieval mechanism for non-cooperating digital

libraries Post processing techniques based on general web search-engines

Page 9: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 9

Approaches Metadata Harvesting

Collect data at a central location from different digital libraries Unified search interface

Distributed Search Metadata resides at its original location Only retrieve relevant metadata when needed

Page 10: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 10

Challenges Flexible integration Transparent relocation and/or deletion of digital libraries Performance requires post processing of data

Page 11: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

Automatic Metadata Discovery and Retrieval

Page 12: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 12

Approach Generic universal search interface based on Dublin Core Dublin Core is a set of metadata descriptions about resources on the

Internet Simple Dublin Core Metadata Element Set (DCMES) consists of 15

metadata elements Develop a search engine that retrieves pages with metadata Define rules to extract metadata from these pages Develop a metadata parser Use Dublin Core metadata set as a common set All individual DL’s metadata fields are mapped to the closest Dublin

Core field

Page 13: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 13

Architecture

DL1

FILTERXSLT

Controller

Search Engine

Result ProcessEngine

RulesEngine

DL agent

Metadata Parsing Rules

Query Mapping

Rules

DL3

DL2

DL 1Spec.

DL 3Spec.

DL 2Spec.

Data Processing LayerPresentation Layer

LFDL Core

Results.xml

Processed Search Results

UniversalSearch

Interface

Local Repository

IntelligentEngine

ConsistencyEngine

Page 14: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 14

FILTERXSLT

Controller

Search Engine

RulesEngine

Query Mapping

Rules

DL agent

Metadata Parsing Rules

Result ProcessEngine

Architecture (cont.)

Page 15: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 15

Retrieval and Parsing Results Process Engine checks for parsing rules from the DL

specifications Process Engine applies parsing and generate metadata to be stored in a

cache If DL specification also defines lower level metadata parsing rules, all

record HTML pages will be retrieved from remote DL, and parsed Extra process on cached metadata so that they are ready to be

displayed Results are merged and then displayed to end-users Periodically, cached metadata will be saved to persistent storage such

as a database

Page 16: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 16

Metadata Parsing Rules Definition Same DL XML specification for metadata parsing rules as for query

mapping and metadata retrieval Digital Library Definition Language is extended to:

Result list page level Single record document level

Raw string is separated into several segments, each segment has one or several metadata fields

Page 17: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 17

Local Repository – Intelligence Cache Parsed metadata is stored in local database Improved search performance Improved service reliability Cache grouped by metadata group provides service quality as good as

the search service provided by individual DL Consistent engine maintains consistency between local storage and

remote digital libraries

Page 18: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 18

Post processed results in LFDL after metadata parsing

Page 19: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 19

Future Works Improve performance through intelligent caching Improve service quality through better navigation tool sets

Page 20: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 20

Conclusions Pros

Easy to follow Comprehensive background information of the problem Detail explanation on design architecture

Cons Incomplete on caching and service How to dedupe similar information Repetitive information throughout the paper

Page 21: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 21

Conclusions (cont.) Improvements

Combine crawling with LFDL Clearly defined scope Utilize open source architecture like Hadoop and/or Solr Use internet cloud for better availability Demonstrated financial incentives of this subject

Page 22: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 22

Questions

Page 23: Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003

July 6, 2010 Automatic Metadata Discovery and Retrieval 23

Reference• R. Shi, K Maly, M. Zubair, “ Automatic Metadat Discovery from Non-cooperative Digital libraries” , IADIS International

Conference e-Society, Lisbon, Portugal, Nov 2003

• Fotosearch, http://www.fotosearch.com/bigcomp.asp?path=UNN/UNN501/u14104684.jpg

• Wikipedia, http://en.wikipedia.org/wiki/Dublin_Core

• Answers, http://www.answers.com/topic/dublin-core

• R Shi, “Lightweight Federation of Non-Cooperative Digital Libraries”, Ph D Dissertation, Old Dominion University, 2005

• W. Arms, Digital libraries. Cambridge, MA: MIT Press, 1999

• S. M. Griffin, “ Taking the initiative for Digital Libraries,” The Electronic Library, vol. 16, no. 1, pp. 24-27, Feb. 1998

• A. Paepcke, C. K. Chang, T. Winograd, and H. Garcia-Molina, “ Interoperability for digital libraries worldwide,” Communications of the ACM, vol. 41, no. 4, pp. 33-43, April 1998