smart objects and dumb (but open!) archives

46
Smart Objects and Dumb (but Open!) Archives Michael L. Nelson NASA Langley Research Center & University of North Carolina [email protected] http://www.ils.unc.edu/~mln/ Cornell University CS 502 – Computing Methods for DLs Guest Lecture April 20, 2001

Upload: nanda

Post on 14-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Smart Objects and Dumb (but Open!) Archives. Michael L. Nelson NASA Langley Research Center & University of North Carolina [email protected] http://www.ils.unc.edu/~mln/ Cornell University CS 502 – Computing Methods for DLs Guest Lecture April 20, 2001. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Smart Objects and  Dumb (but Open!) Archives

Smart Objects and Dumb (but Open!) Archives

Michael L. NelsonNASA Langley Research Center &

University of North Carolina

[email protected]://www.ils.unc.edu/~mln/

Cornell University

CS 502 – Computing Methods for DLs

Guest Lecture

April 20, 2001

Page 2: Smart Objects and  Dumb (but Open!) Archives

Outline

• History / problem statement / motivation• Buckets: smart objects• Bucket implementation • Smart objects, dumb archives (SODA)• Open Archive Initiative (OAI)• Bucket Communication Space (BCS)• Future work• Conclusions

Page 3: Smart Objects and  Dumb (but Open!) Archives

NASA Scientific and Technical Information

• Formal publications cover a decreasing percentage of NASA’s STI output– most DLs focus only on formal publications

• Informal STI is maintained by only by a network of collegial distribution– aging and shrinking workforce weakens this network

• Customers want much more than formal publication– rather than stretch the meaning of “report” or “document”, define a

new object for DL transactions

Page 4: Smart Objects and  Dumb (but Open!) Archives

NASA LaRC Publications 1991-1999

1730 1736

1472

1333

11091053

909 875954

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1991 1992 1993 1994 1995 1996 1997 1998 1999

Page 5: Smart Objects and  Dumb (but Open!) Archives

STI Observations• Media formats are instantiations of a more general

class of information• Most DLs are uni-format, following the obsolete

media boundaries of their non-digital predecessors• “Separate but equal” DLs considered harmful

– customer should not have to re-integrate what should never have been de-integrated...

– institutional knowledge being lost because we don’t have a publishing vector established

Page 6: Smart Objects and  Dumb (but Open!) Archives

Information Lost Over Time

Project

manuscript

software

raw data

images

library

ftp site

thrown away

filing cabinent

User NewProject

Figure 7: STI Lost in Project / Archival / Reuse Process

Page 7: Smart Objects and  Dumb (but Open!) Archives

Pyramid of Scientific and Technical Information (STI)

Journal Articles

Conference Papers

Technical Reports

time

software raw data notes video / images

Information is created in a variety of formats. Formal publications, the focus of

most DL projects, are supported by a pyramid of informal information.

Page 8: Smart Objects and  Dumb (but Open!) Archives

The Tyranny of the Archive(Content is King)

The information content is more important than the systems used for its storage, management and retrieval

Objects should not be “locked” in specific DLs or archives

Page 9: Smart Objects and  Dumb (but Open!) Archives

Buckets

• Aggregation + intelligence = buckets• metadata + data + methods = buckets• Object-oriented, intelligent agent archival entities• A collection of all information about a project:

– manuscripts - software– data - images– video - etc.

• Customizable, heterogeneous– buckets can “learn”, “talk”, and “coordinate”– buckets control terms and conditions, display, etc. -- not the

archive that holds them

Page 10: Smart Objects and  Dumb (but Open!) Archives

Design Goals

• Aggregation– DLs should be shielded from the transient

nature of file formats– Prevent information hemorrhaging by archiving

all data types

• Intelligence– Aggregation (above) implies code, why stop at

passive objects? Make objects smart...– Bucket-bucket & bucket-tool intelligence

Page 11: Smart Objects and  Dumb (but Open!) Archives

Design Goals

• Self-Sufficiency– Maximum autonomy & survivability: fully self-

sufficient buckets– Option to internally store all needed materials

• Mobility– Why should an information object be stuck in

one place?– Mobility for replication, workflow, data

collection

Page 12: Smart Objects and  Dumb (but Open!) Archives

Design Goals

• Heterogeneity– One size does not fit all...– Different buckets for different applications, sites,

disciplines, etc.

• Archive Independence– Focus is on information, not yet another DL “system”

• does not require an archive to function

– “Work with everything; break nothing”

Page 13: Smart Objects and  Dumb (but Open!) Archives

Bucket Architecture

Access MethodsCNRI Handle(unique id)

Terms and Conditions

Metadata (RFC 1807, Dublin Core)

Manuscript .ps .pdf .tex .doc

Software .tar .c .java

images .gif .jpeg

data sets .xls .tar

. . .

Figure 8: A Typical Bucket Architecture

Packagesinside thebucket

Elements insidethe package

A Typical NASA DL Bucket -- Other Bucket Types Possible!

Page 14: Smart Objects and  Dumb (but Open!) Archives

A Sample Bucket

4 packages:- report (4 elements)- appendix (2 elements)- contact information (2 elements)- translation (1 element)

Page 15: Smart Objects and  Dumb (but Open!) Archives

Another Sample Bucket

2 packages:- pre-print (2 elements)- pointer to SFX reference

linking service for published and pre-print versions (2 elements)

this bucket display for the Universal Preprint Service

https://ups.cs.odu.edu/

Page 16: Smart Objects and  Dumb (but Open!) Archives

Heterogeneous Buckets

• Buckets are envisioned to locally modifiable and extensible

• There is a default set of public methods defined for buckets– additional methods can be locally defined

• Buckets can “learn” new methods– new “default” methods, or locally defined extensions

– override default methods

Page 17: Smart Objects and  Dumb (but Open!) Archives

Bucket Messages• Sample bucket messages:

http://home.larc.nasa.gov/~mln/bucket/http://home.larc.nasa.gov/~mln/bucket/?method=display

invokes the default display method

http://home.larc.nasa.gov/~mln/bucket/?method=metadatareturns the metadata for the bucket

http://home.larc.nasa.gov/~mln/bucket/?method=display&pkg_name=report&element_name=tr1253.pdf

displays a single element

http://home.larc.nasa.gov/~mln/bucket/?method=list_methodslists all the methods that this bucket implements

Page 18: Smart Objects and  Dumb (but Open!) Archives

Bucket Methods

supersedes Table 1 in NASA TM 1998 208419

add/delete_package add/delete_element

add/delete/list_principal(s) metadata/set_metadata

add/delete/list_method(s) list_source

add/delete/list_tc delete_bucket

get/set_state display

get/delete/list_log(s) id

lint version/set_version

get/set_preference pack/unpack

most methods take various arguments; see Appendix B in dissertation http://home.larc.nasa.gov/~mln/phd/

BUCKET DEMO

Page 19: Smart Objects and  Dumb (but Open!) Archives

Bucket Metadata• Due to Dienst heritage, uses RFC-1807 format

– this is likely to change in the future

• Metadata defines the content and appearance of the bucket– bibliographic and control information

• But can store any format of metadata– bucket does not need to “understand” all formats

• special purpose, legacy or obscure formats – COSATI, MARC– http://foo.edu/bucket-27/?method=metadata&format=cosati

Page 20: Smart Objects and  Dumb (but Open!) Archives

Current Implementation• File system semantics:

– 1 bucket = 1 directory– 1 package = 1 directory in bucket– 1 element = 1 file in package directory– index.cgi is the bucket “lid”

• http dependency for access• index.cgi written in Perl 5.0

• Methods should not change when the implementation changes– still use http as transport protocol– Oracle, Lotus Notes implementations being developed

• Java, PHP, Tcl, etc. implementations possible too

Page 21: Smart Objects and  Dumb (but Open!) Archives

Bucket StructureBucket

_method.pkg _http.pkg _log.pkg _tc.pkg

_md.pkg _state.pkg

source filesfor methods

http dependencyfiles

logstermsand conditions

metadata bucketstate

default bucket packages sample bucket payload

index.cgi

report.pkg appendix.pkg

software.pkg testdata.pkg

Page 22: Smart Objects and  Dumb (but Open!) Archives

Systems Tested

Architecture Operating System Perl http server

Sparc Solaris 2.7 5.005_03 Apache 1.3.9

Sparc Solaris 2.7 5.005_03 NCSA httpd 1.5.2

Sparc Red Hat 6.0 (Linux

2.2.5-15)

5.005_03 Apache 1.3.6

Intel x86 Windows NT 4.0

(1381 / SP 5)

Active Perl 5.005_03 Apache 1.3.12

Intel x86 Mandrake Linux 6.2 5.005_03 Apache 1.3.6

MIPS R10000 IRIX 6.5 5.004_04 Apache 1.3.4

RS/6000 AIX 4.2 5.002 Apache 1.3.12

PowerPC 604 Linux 2.0.33

(MkLinux)

5.004_01 Apache 1.2.6

Page 23: Smart Objects and  Dumb (but Open!) Archives

• Objects are more important than the archive that holds them– The object should be the authority on its

contents, not an archive

• We envision a general shift of intelligence from archives to the objects themselves– DL protocols should find, index, and search --

not know about file formats, policy, terms and conditions, etc.

SODA:Smart Objects, Dumb Archives

Page 24: Smart Objects and  Dumb (but Open!) Archives

Presentation Responsibility Shifts From Dienst to Buckets

userindex holdingssearch / retrieve holdingsdisplay holdings

Dienst Archive

userindex holdingssearch / retrieve holdings

display holdings

Dienst Archive

Bucket

Dienst Operation in NCSTRL

Dienst / Bucket Operation in NCSTRL+

Figure 10: Buckets, Not Dienst, Handle Display in NCSTRL+

Page 25: Smart Objects and  Dumb (but Open!) Archives

SODA

• Current DLs have tight integration between the data object, the archive it is in, and the interface used to access it– 1-1 model between DL and archive

• By decoupling these functions, we can separate their development and maintenance– N-M model between DLs and archives

Page 26: Smart Objects and  Dumb (but Open!) Archives

. . .

. . .

NASA Archive CoRRACM Archive

Library Users

NASA DLS Avionics DLS NCSTRL

All KnownBuckets(in archives and out)

Archives Managing Buckets

DLSs Building From Archivesand Buckets

. . .

Students and Educators

Researchers Corporate Developers. . . . . .

SODA

Page 27: Smart Objects and  Dumb (but Open!) Archives

“Dumb Archive”

• Archives should be little more than set managers• Several possible archive candidates

– LDAP, Dienst, Guildford Protocol, others

• Our implementation: a “modified” bucket, DA:– it has all of the regular bucket methods, plus:

• da_list - list all buckets in the archive• da_put - put a bucket in an archive• da_delete - delete a bucket from an archive• da_info - archive-level metadata• da_get - redirect to this bucket

all operationsmodulo appropriateT&C

Page 28: Smart Objects and  Dumb (but Open!) Archives

DA StructureBucket

_method.pkg _http.pkg _log.pkg _tc.pkg

_md.pkg _state.pkg

source filesfor methods

http dependencyfiles

logstermsand conditions

metadata bucketstate

default bucket packages

no bucket payload

index.cgi

holdings.pkg

DA datastructures • holdings.pkg package for DA

• does not use packages/elements•scalability concerns

• uses GDBM/NDBM files (hashes)•1 hash per argument to da_put

Page 29: Smart Objects and  Dumb (but Open!) Archives

OAI as a “Dumb Archive”

• Originally used a separate protocol & implementation for the “dumb archive”

• Now using the metadata harvesting protocol defined by the Open Archive Initiative (OAI)– OAI evolved from the Universal Preprint Service (UPS)

• http://www.dlib.org/dlib/february00/vandesompel-ups/02vandesompel-ups.html• http://ups.cs.odu.edu/• http://www.openarchives.org/

• OAI does not require smart objects, but does create a “dumb archive” layer

Page 30: Smart Objects and  Dumb (but Open!) Archives

OAI Bucket StructureBucket

_method.pkg _http.pkg _log.pkg _tc.pkg

_md.pkg _state.pkg

source filesfor methods

http dependencyfiles

logstermsand conditions

metadata bucketstate

default bucket packages bucket payload is DL specific support library

index.cgi

oai

oai.pl element is a support librarythat defines access for the specific DL

in addition to the ~ 30 bucket methods each OAI verb is implemented as a separate method

Page 31: Smart Objects and  Dumb (but Open!) Archives

Intelligence

• Shift of responsibility into the data objects opens up an entire new class of applications:

data objects as intelligent agents• Premise: instead of having the data objects

do nothing while they patiently wait to be accessed, have them do something useful while waiting ...

Page 32: Smart Objects and  Dumb (but Open!) Archives

Bucket Communication Space

• Provides a well known, shared memory model for buckets to communicate– communications model: Linda (Javaspace)

• Applications:– Bucket matching

• the same author (separated by publisher, time)

• different authors (finding similar works)

– Metadata scrubbing

– Format translation (metadata, images, documents)

– Bucket messaging• including broadcast & multicast

Page 33: Smart Objects and  Dumb (but Open!) Archives

BCS StructureBucket

_method.pkg _http.pkg _log.pkg _tc.pkg

_md.pkg _state.pkg

source filesfor methods

http dependencyfiles

logstermsand conditions

metadata bucketstate

default bucket packages

no bucket payload

index.cgi

bcs.pkg

BCS datastructuresconversionprograms

• bcs.pkg package for BCS• uses GDBM/NDBM files (hashes) for registr• included programs

• mdt (metadata conversion)• Image Alchemy (image conversion)

Page 34: Smart Objects and  Dumb (but Open!) Archives

BCS Methods

• bcs_list, bcs_register, bcs_unregister– set management

• bcs_convert_image– wrapper for Image Alchemy program– no bucket hooks in 1.6

• bcs_convert_metadata– wrapper for “mdt” program – bucket hooks in 1.6

Page 35: Smart Objects and  Dumb (but Open!) Archives

BCS Methods

• bcs_message– “search”, “search/replace”, “search/mesg”

functionality

• bcs_similarity– all x all comparison– n x all comparison (n=1 .. all)– adjustable threshold for “similarity”

BCS DEMO

Page 36: Smart Objects and  Dumb (but Open!) Archives

Similarity Results from UPS• NACA - 3036 documents• UPS Math - 3831 documents

– for 6867 documents, ran for 42 hours (561k comparisons / hour)

– used default value of 0.85 for similarity– NACA - 159 similar documents– UPS Math - 35 similar documents– No similarity between NACA & UPS Math

• Optimizations:– clustering of collection– distributed computation of similarity matrix

Page 37: Smart Objects and  Dumb (but Open!) Archives

Future Work

• Alternate implementations for buckets– Java, Oracle, Python, Tcl…

• Alternate API access– CORBA, SOAP

• New functionality for buckets– Standard packages / elements for: revisions,

citations, checksums

Page 38: Smart Objects and  Dumb (but Open!) Archives

Future Work• Security, authentication, T&C

– investigate X.509, Kerberos, MD5– formalize ACLs

• Specialized buckets– discipline- or data-specific buckets– computational buckets

• software reuse, RPC-like support

• Reduce the centralization of the BCS– cf. Berkeley’s xFS – serverless file system

• http://now.cs.berkeley.edu/Xfs/xfs.html

• Passive -> Active objects– e.g., LANL’s Active Recommendation Project

• http://www.c3.lanl.gov/~rocha/lww/

Page 39: Smart Objects and  Dumb (but Open!) Archives

Impact• SODA

• significant immediate interoperability benefits• frees the object from the tyranny of the archive

• Bucket aggregation: evolutionary concept• benefit begins immediately, continues indefinitely• no more information hemorrhaging

• Bucket intelligence: revolutionary concept• benefit is mid- to long-term• full impact unknown; a flexible framework will allow others to

innovate• make archived objects active, not passive

Page 40: Smart Objects and  Dumb (but Open!) Archives

http://dlib.cs.odu.edu/

if bucket software doesn’twork out, we’ll marketproducts with Phil’s likeness

thanks to Rod Waid for Phil

Page 41: Smart Objects and  Dumb (but Open!) Archives

Emergency backup slides...

Page 42: Smart Objects and  Dumb (but Open!) Archives

Why Digital Libraries?

• “Why not just use the WWW” ?– WWW by itself has low archival &

management characteristics

• “Why not use a RDBMS?”– In the same way that a card catalog

is not a TL, a RDBMS is candidate technology for use in DLs

• DL is the union of the content and services defined on the content

WWW (http) Access

(most common)

non-WWWAccess

(now uncommon)

OtherTechnologies

Digital Library Services

(searching, browsing, citation anlaysisusage analysis, alerts)

Vectorand/or

BooleanSearch

Engines

(traditional IR)

RDBMSFile

Systems

Content

digital library = collection of information both digitized and organized-- M. Lesk, 1997

Page 43: Smart Objects and  Dumb (but Open!) Archives

Digital Libraries?

• Ultimately, the product of a research institution is information– information objects (generally publications) are

frequently the only tangible measure of research output

(compressing an entire body of literature):

• Traditional libraries (TLs) are expensive, and less and less information is being archived by fewer and fewer TLs

Page 44: Smart Objects and  Dumb (but Open!) Archives

TLs vs. DLs

• DLs clearly better than TLs at:– Dissemination, storing information variety

• However, TL objects are more survivable– Who will archive the research information?

• the publishers?• the institutions?• the authors?

– Will the average DL object still be accessible in 10 years?

Page 45: Smart Objects and  Dumb (but Open!) Archives

Cosine Correlation With Frequency Term Weighting

n

(tdij X tdik) i=1

similarity (dj,dk) = n n

tdij2 X tdik

2

i=1 i=1

wheretdij = the ith term in the vector for document jtdik = the ith term in the vector for document k n = the number of unique terms in the data set

Adapted from Harman (1992), originally from Salton & Lesk (1968)

Page 46: Smart Objects and  Dumb (but Open!) Archives

Similarity Matrix

id-1 id-2 id-3 id-4 … id-n

id-1 1 0.298 0.783 0.267 … 0.459

id-2 1 0.976 0.732 … 0.432

id-3 1 0.868 … 0.291

id-4 1 … 0.870

… 1 0.904

id-n 1

not computed -same as abovethe diagonal