digital library architecture: a service-based approach sandra payette department of computer science...
TRANSCRIPT
Digital Library Architecture:A Service-Based Approach
Sandra PayetteDepartment of Computer Science
Cornell University
Mo i Rana, Norway
November 10, 1998
http://www2.cs.cornell.edu/payette/presentations/DL-architecture.ppt
Overview
• Why talk about DL architecture?
• Digital Libraries - the architectural perspective
• Review of service-based architecture
• NCSTRL - a working example
• Dienst - existing service-oriented architecture
• Cornell next generation (component-oriented)
• Conclusion
Why Talk about Digital Library Architecture?
• Web alone is not a digital library
• Commercial packages limited– limited flexibility– standards issues– network-enabled applications not DL architecture
• Must position for broader DL opportunities
Web by itself not a DL Architecture
• Documents - Files, CGI, MIME-Types
• Naming - URLs
• Document Servers - HTTP servers
• Resource Discovery - web crawlers
• Collections - web pages, ad-hoc
• IP - Access Control List, passwords, ad-hoc
WWW Infrastructure Evolving
• Resource Description Framework (RDF)– will allow rich metadata semantics for documents– http://www.w3.org/RDF/
• Extensible Markup Language (XML)– will allow highly structured documents and rich
linking (relationship) capabilities– http://www.w3.org/XML/
• Uniform Resource Names (URNs)– will allow for persistent, globally unique identifiers
But still need Digital Library Architecture
• Richer document model - digital objects
• Persistent, unique naming - URNs
• Well-defined digital library services
• Better facilities for resource discovery
• Flexible definition of collections
• Management of distributed content & services
• Rights management for intellectual property
NordicDigital Library
Cornell Digital Library
Digital Library Interoperability
Digital Library Architecture:Key Principles
• Open Architecture– functionality partitioned into set of well-defined services
– services accessible via well-defined protocol
• Modularization– promotes interoperability
– scalable to different clientele (research library, informal web)
• Federation – enable aggregations into logical collections
• Distribution– of content (collections) and services
– of administration and management of DL
Repository Services
Component-Ware Digital LibrariesCollection Services
Index Services
PersistentNAMES
NameService
UserInterfaceGateway
DigitalObjects
NCSTRL A Working Example
120+ Institutions in US, Europe, and Asia
A Globally Distributed Digital Library
NCSTRL Participants: collections federated
• 120+ institutions– Universities/labs - research reports– European Research Consortium for Informatics
and Mathematics (ERCIM)– Los Alamos (Physics pre-prints, ACM )– D-Lib Magazine
• 40+ independent servers
Federation of
Collections
Documents inDistributedRepositories
Multi-FormatDocument
Model
• modular system based on a standard open architecture
• study of hard, real-world problems: policy issues, quality of service, federation of publishers
• creation of a self-sustaining international federated digital collection
NCSTRLReal-world testbed for ...
Dienst NCSTRL technical base
• Implements a service-based architecture for distributed digital libraries
• Protocol and reference implementation
• Network of services
• WWW browser access
• Uniform search over distributed indexes
• Access to documents in distributed repositories
• Access to multi-formatted documents
Dienst:Service-Based Architecture
• Document model
• Naming service (CNRI’s Handle System)
• Repository service
• Indexer service
• Collection service
• User Interface service
Dienst Document Model
decompositionsrepresentations
Handle (URN)
physical logical
AS
CII
TIF
F
Pos
tScr
ipt
met
adat
a
underlying formats
Dienst: Document Protocol
• Documents addressable through their URNs
• Document service requests– get document metadata– get document formats– get document in format– get document partition (page) in format
Dienst 5.0 : Document Protocol
• More complex document model:– versions– hierarchical part specification– binders (multi-part documents)
• “Structure” service request– Reveal, in XML, full or collapsed structure of a
document• e.g., chapters, sections, figures, etc.
– Describe multiple views of a document• e.g., bibliography, content, thumbnails
Dienst: Core Services
WWWbrowser
Dienst UserInterface
Repository
IndexIndex Index
Repository Repository
receive unified hit list
send search request
send site specific search requestreceive hit list
send document requestreceive MIME-typed document
send document requestreceive MIME-typed document
Dienst ProtocolBuilding Gateways to non-Conforming Sites
FTP/HTTP “Repositories”
Standard Servers
User Interface Gateway Server
Dienst: Collection Service
Naming Service
• Documents identified by globally unique names
• Names are persistent, permanent
• Registered names resolve to specific location (URL)
cnri.dlib/april97-payette
http://www.somewebserver.org/somedirectory/somefile
NamingAuthority
ItemName
PersistentIdentifier
(e.g., URN)
Location(URL)
Identifiers: Current Initiatives
• IETF Uniform Resource Names (URN) – specification of URN framework– requirements for resolution systems– syntax definition
• Existing Systems– CNRI’s Handle System (**NCSTRL uses)– OCLC PURLs– DOI Initiative
Looking Ahead: Current Research at Cornell
• Digital Objects and Repository– FEDORA– Joint work in Interoperability with CNRI– Access Management
• Resource Discovery– STARTS (Cornell/Stanford collaboration)– Intelligent Distributed Searching
• Collection Definition
Digital Object is...
recognizable by what it can do
getChaptergetPage
getTrackgetLabel
getSectiongetArticle
getFramegetLength
Structure
Mechanism
Content-TypeInterfaces
Book
MARC
What the client sees vs.What the object is
application/MARC DS1
application/postscript DS2
GenericDisseminator
FEDORA DigitalObject
Book, DublinCore
ListContentTypes
BookDisseminator
DublinCoreDisseminator
GetChapterGetIndexGetPage
Get(Book.getPage(1))
FEDORA:Extensibility for Content Types
• Simple, familiar content types
• Complex, compound, dynamic content types
Resource Discovery
• Meta-Searching for Resource Discovery– query multiple document sources– choose best sources to evaluate a query– evaluate the query at these sources– merge the query results from these sources
• Stanford Protocol Proposal for Internet Retrieval and Search (STARTS) – www-db.stanford.edu/~gravano/starts.html
– www.cs.cornell.edu/NCSTRL/STARTS/STARTShome.htm
Distributed Collection Service Definition and Access
Central Collection
Server
Collection QueryRouter
Collection QueryRouter
Collection QueryRouter
User InterfaceIntelligent routing
based on regional conditions
Conclusions: Design with an Eye Toward the Future
• Know limitations of ad-hoc web development and commercial packages
• Embrace a service-based approach – modular designs increase flexibility, extensibility,
plug-in/plug-out– well-defined services with protocols to enable
federation and interoperability– can utilize various technologies or commercial
software underneath the service layers
• Watch Web developments in XML and RDF
Further reading• Lagoze and Payette: An Infrastructure for Open-Architecture
Digital Libraries http://ncstrl.cs.cornell.edu/Dienst/UI/1.0/Display/ncstrl.cornell/TR98-1690
• Davis and Lagoze: NCSTRL: Design and Deployment of a Globally Distributed Digital Library, Draft of submission to IEEE Computer Special Issue on Digital Libraries, February 1999.
http://www2.cs.cornell.edu/lagoze/papers/NCSTRL-IEEE3.doc
• Payette: Persistent Identifiers, RLG DigiNews http://www.rlg.org/preserv/diginews/diginews22.html
• Payette and Lagoze: Flexible and Extensible Digital Object and Repository Architecture (FEDORA)http://www2.cs.cornell.edu/NCSTRL/CDLRG/FEDORA.html