distributed information discovery

35
Distributed Information Discovery CS 430 Carl Lagoze 2001-03-08 Lecture 14

Upload: louisa

Post on 23-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Lecture 14. Distributed Information Discovery. CS 430 Carl Lagoze 2001-03-08. Goals and Motivation. Lesson from the Web: relevant and valuable information is “everywhere” Rethinking the “library” in the digital age: Not as collector of information - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Distributed Information Discovery

Distributed Information Discovery

CS 430

Carl Lagoze 2001-03-08

Lecture 14

Page 2: Distributed Information Discovery

Goals and Motivation

• Lesson from the Web: relevant and valuable information is “everywhere”

• Rethinking the “library” in the digital age:– Not as collector of information– Rather as access point to distributed

information

• Perfect scenario: uniform access to all information with rich functionality

Page 3: Distributed Information Discovery

Problems with the Perfect Scenario

• Heterogeneity – what is the structure of the information we wish to discovery

• Reliability – machines, networks, and organizations are sometimes (often) flaky

• Complexity – cost vs. functionality tradeoff

Page 4: Distributed Information Discovery

Function versus cost of acceptance

Function

Cost of acceptance

Metadata Harvesting

SDLIP

Z39.50

Page 5: Distributed Information Discovery

Z39.50

http://www.loc.gov/z3950/agency/

Page 6: Distributed Information Discovery

Aims of Z39.50• Permits one computer, the client, to search and retrieve

information on another, the database server

• Important both technically and for its wide use in library systems

• Most development has concentrated on bibliographic data

• Most implementations emphasize searches that use a bibliographic set of attributes to search databases of MARC records

Page 7: Distributed Information Discovery

Technical history

Z39.50

• Developed for X.25 networks (connection orientation), conversion to run over TCP fitted later

• Original concept in days when repeating a search was expensive computation (about 1980)

• WAIS is a stateless derivative of an early version of Z39.50

Page 8: Distributed Information Discovery

Z39.50 principlesAbstract view of database searching.

• Server stores a set of databases with searchable indexes

• Interactions are based on a session

• The client opens a connection with the server, carries out a sequence of interactions and then closes the connection.

• During the course of the session, both the server and the client remember the state of their interaction.

Page 9: Distributed Information Discovery

State

Z39.50

• The server carries out the search and builds a results set

• Server saves the results set.

• Subsequent message from the client can reference the result set.

• Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database.

Page 10: Distributed Information Discovery

Z 39.50 services

init -- client connects to the server and exchanges initial information, e.g., preferred message size

explain -- client inquires of the server what databases are available for searching, the fields that are available, the syntax and formats supported, and other options

search -- client presents a query to a database choices of syntax for specifying searches

• only Boolean queries widely implemented • one or more records may be returned to the client

Page 11: Distributed Information Discovery

Z 39.50 services

manipulation of results sets -- e.g., sort or delete

present -- requests the server to send specified records from the results set to the client in a specified format

• options: for controlling content and formats

for managing large records or large results sets

Page 12: Distributed Information Discovery

Sample query

In the database named "Books" find all records for which the access point title contains the value "evangeline" and the access point author contains the value "longfellow.“

Z39.50 defines a rich variety of search access points that can be extended by implementers

Page 13: Distributed Information Discovery

Simple Digital Library Interoperability Protocol

http://www-diglib.stanford.edu/~testbed/doc2/SDLIP/

Page 14: Distributed Information Discovery

SDLIP

• Compromise between a full-scale, all encompassing search middleware design such as Z39.50 and the “anything goes” approach typical for ad-hoc search interface design on web

• Developed jointly by Stanford, Berkeley, and UC Santa Barbara

• Heavily influenced by DASL from IETF

Page 15: Distributed Information Discovery

SDLIP – search middleware

Page 16: Distributed Information Discovery

Managing complexity through separate interfaces

                                                          

Page 17: Distributed Information Discovery

SDLIP Interfaces

• Search Interface – defines simple query language, protocol can then include other languages

• Result Interface – parking meter metaphor supports varying notions of results sets

• Source Metadata Interface – provides extension mechanism through discovery server capabilities

Page 18: Distributed Information Discovery

Open Archives Initiative Metadata Harvesting Protocol

http://www.openarchives.org

Page 19: Distributed Information Discovery

OAI Metadata Harvesting Protocol

• Low-barrier framework for repository interoperability

• Minimal burden for data providers

• Plug-in concept to allow community and service specialization

Page 20: Distributed Information Discovery

metadata

e-print

e-print

e-print

e-print

e-print

Metadata Harvesting

Page 21: Distributed Information Discovery

metadata

AuthorTitleAbstractIdentifer

e-print

e-print

e-print

e-print

e-print

Metadata Harvesting

Page 22: Distributed Information Discovery

• low-barrier interoperability

• data-provider & service-provider model

• metadata harvesting model

• shared metadata format and parallel, community-

specific metadata formats

OAI 1.0 protocol

Dublin Core

HTTP based

Community specific

Reply • XML Schema

• Self contained

OAI core concepts

Page 23: Distributed Information Discovery

Some thoughts

• There is (and will never be) one right solution (technical vs. cost vs. complexity vs. ??)

• Distributed technical solutions have organizational ramifications

• Distributed resource discovery (as with any distributed computer solution) entails various tradeoffs

Page 24: Distributed Information Discovery

Distributed Searching Issues

Global Distribution

Page 25: Distributed Information Discovery

25

Broadcast Distributed Search

Page 26: Distributed Information Discovery

26

Backup Index server•replicates all query servers

•used when primary is down

backupindex

Page 27: Distributed Information Discovery

Deploying Collection Globally

• Internet connectivity varies considerably• Good connectivity between nodes often

does not correspond to geographic proximity

• Connectivity Region - a group of nodes on the network that among them have good connectivity, relative to nodes outside of the region.

Page 28: Distributed Information Discovery

Connectivity Regions

• When possible route queries within region• In case of failure, use an alternate either within the

region or in a “nearby” region

Page 29: Distributed Information Discovery

Distributed Searching Issues

Query Routing

Page 30: Distributed Information Discovery

Routing ProblemDisjoint Indexes

Hopcroft I1, I3Hartmanis I3Tarjan I1, I2Wilensky I2

I1 I2 I3

I1,I3

doc8 doc1, doc2

Content Summary

author=Hopcroft?

Hopcroft doc8Tarjan doc9

Tarjan doc6Wilensky doc7

Hopcroft doc1, doc2Hartmanis doc3, doc4

Page 31: Distributed Information Discovery

Routing ProblemReplicated Distributed Indexes

author=Hopcroft?

Hopcroft doc8Tarjan doc9

Tarjan doc6Wilensky doc7

Hopcroft doc8Tarjan doc9

Tarjan doc6Wilensky doc7

Page 32: Distributed Information Discovery

Routing Issues

• Choice of primary?, secondary?, etc.

• Fault-tolerance

• Routing Factors– Performance-based– Freshness-based– Cost-based– weighted mix based on user preference

Page 33: Distributed Information Discovery

Components of Replicated Routing Problem

• Metadata Issue: metadata made available by indexer to aid in routing

• Metadata Distribution Issue: topology of metadata repositories

• Decision Issue: routing decision algorithms

• Fault-tolerance: use of backup indexers

Page 34: Distributed Information Discovery

Distributed Metadata for Query Routing

central metadatastore

Page 35: Distributed Information Discovery

Performance-based Routing

8

present-

T

Averageresponse time

Timed low pass filter

Predictedresponse time

New = low pass filter(T, actual response time, old )