1 archival storage for digital libraries arturo crespo hector garcia-molina stanford university

26
1 Archival Storage for Digital Libraries Arturo Crespo Hector Garcia-Molina Stanford University

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

1

Archival Storage for Digital Libraries

Arturo CrespoHector Garcia-Molina

Stanford University

2

Motivation

Digital information already lost:– Early NASA records

– U.S. Census Information

– Toxic Waste records

Decay Time for common media:– Magnetic Tapes: 10-20 years

– CD-ROM: 5-50 years

– Hard Drive: 3-5 years

Obsolescence of Digital Media is even faster

3

Preservation of Digital Objects

Data Preservation

Meaning Preservation

Our work only addresses Data Preservation

4

A Case Study: Stanford/MIT CSTR

Stanford MIT

CSTR Scenario:– Need for on-line access of documents

– But also for long-term archival of document

5

Is This a Solved Database Problem?

Database systems can reliably store objects However:

– Need same or compatible system

– Migration is problematic

Our architecture coordinates database systems, it does not replace them.

6

Contribution

An architecture and algorithms for:– Long-term Archival Storage of Digital Objects

– Allowing on-line access to Digital Objects

– Preserving data as technology and organizations evolve

7

Key Concepts

Signatures as Object Handles Deletions are not allowed Reliability is achieved through Replication Layered Architecture Awareness Everywhere Disposable Auxiliary Structures

8

Signatures as Object Handles

Object Handles identify objects– Internal to the Digital Library Repository

– Users may need high level naming facilities

– Traditional approaches

Signatures:– Checksum or CRC of the object

f ( )signature

object

9

Properties of Signatures as Object handles

Each site can generate handles independently Handles can be reconstructed from the object Copies automatically have same handle Objects with different content have high probability

of having different handles Cannot modify objects

s1 s2 s1 s3 s2 s4 = s1

10

Signature Collisions A very rare event if signatures are 128 bits or more.

Assumes uniform distribution of handles and objects bigger than signatures CollectionSize

Probabilityof havingCollisions

SignatureSize

107 10-9 76 bits

107 10-24 128 bits

1010 10-18 128 bits

1010 10-57 256 bits

11

No Deletions

Objects are never (voluntarily) deleted This simplifies many things:

– Distinguishes between deleted and corrupted objects (improving reliability)

– Easier recovery from failures

However, it complicates others:– “Wasted” space

– Version management

12

No Deletions

No deletion rule is natural in Digital Libraries Wasted space is not critical as:

– Storage cost is low

– Only archiving library objects, not all possible data

13

Version Management

“Natural” structure for versions

Object O2

Object O1

VersionObject

14

Versions

• How can we find the latest version?

Object O1

Object O2Version2(latest)

Version1

Version Object

tuple

tuple

15

Sets

Object O1

Object O2Member 1

Member 2

Set Object

tuple

tuple

16

Reliability Service

Long term persistence is achieved by replication Sites establish Replication Agreements to maintain

copies of objects in a given Replication Group

Stanford MIT

17

Reliability Service

Initial State:

Stanford MIT

VVersionObject

V1

18

Reliability Service

Stanford MIT

VVersionObject

V1

VersionObject

19

Reliability Service

Stanford MIT

VVersionObject

V1

VVersionObject

V1

20

Reliability Service

Stanford MIT

VVersionObject

V1 V2

VVersionObject

V1 V3

21

Reliability Service

Stanford

V VersionObject

V1 V2 V3

MIT

V VersionObject

V1 V2 V3

22

Layered Architecture

User Access

Security and Accounting

Import

Metadata and Indexing

Reliability

Complex Object

Identity

Object Store

Data Store

23

Awareness Everywhere

Awareness services: standing orders, subscriptions, alerts, etc.

Critical for Digital Libraries Should be part of the interface of every layer.

24

Disposable Auxiliary Structures

Auxiliary Structures can be reconstructed from the Digital Objects– Avoid potential inconsistencies

– Easier to migrate objects

Example: Index of disk-ids to object handles

D1 D2IdentityLayer

Handle D1

Index

25

Related Work

Other Digital Library architectures Report of the task force on preserving digital

information Petal and Frangipani projects COLD systems

26

Conclusions

Architecture for long-term archiving of digital objects Allows efficient on-line access Simple, yet powerful repository built by:

– using signatures as handles

– not allowing deletions

– having awareness services everywhere

– using only disposable auxiliary structures