1 stanford archival repository project brian cooper arturo crespo hector garcia-molina department of...

1

Stanford Archival Repository Project

Brian Cooper

Arturo Crespo

Hector Garcia-MolinaDepartment of Computer Science

Stanford University

2

Data does not live forever

Much data is stored digitally (perhaps exclusively)– Text

– Multimedia (images, sound, etc.)

– Scientific data

But digital storage is currently unreliable– Magnetic tapes decay, break or lose magnetism

– Disks crash

– Buildings burn down

– Users delete data (accidentally

or maliciously)

3

Data does not live forever

Digital information already lost:– Early NASA records

– U.S. Census Information

– Toxic waste records

Decay time for common media:– Magnetic Tapes: 10-20 years

– CD-ROM: 5-50 years

– Hard Drive: 3-5 years

4

Digital archiving

Digital archivists need:– A reliable system to store digital data for long periods

without losing it

– Convenient tools to add new data and manage data already archived

– Methods for finding the “best” configuration» Most reliable

» Most cost effective

» Etc.

5

Archival Repository Project

Goal: Reliably archive digital information for long periods of time (decades or centuries)– Focus on “preserving bits”– Preserving meaning: future work

Strategy– Replicate objects– Automatically detect and correct errors

Our project– Stanford Archival Vault (SAV) – reliably archives data– InfoMonitor – automatically adds newly created data to

the archive– ArchSim – a simulation tool to model archives

6

Architecture

Users Users

Filesystem

InfoMonitor

SAV ArchiveSAV Archive

Archived data

Archived data

Internet

Local archive

Remote archive

7

SAV architecture

Object Store

Reliability Layer

Remote SAV Sites

Upper Layers

User InterfaceData Creation/Import

“Core” SAV components

Upper layers

Application/user level

8

Write-once repository

Deletions/modifications disallowed– Any object deleted or modified must have been corrupted,

and is replaced

Challenges– Constructing structures of objects

» Object references constrained to point from new to old objects

– Representing modifications» Archive new version of objects = version chain

– Finding objects» Indexes

9

Write once repository: Indexes

Key to performance– Locate an object quickly using its signature, “Who points

to me?” problem, etc.

Disposable indexes– Can be rebuilt at any time from SAV objects

“Bookmarks” used to find collections of objects using indexed name

10

Write once repository: Indexes

Bookmark (with well-known name)SAV

11

Replication: Site networks

Sites form “replication agreements”– Agree to replicate data

– Specify data to replicate in agreement» May be a subset of all of the data in the archive

– Periodically connect and compare data, looking for errors

Strongly connected Weakly connected

12

Replication: Data sets

SAV replicates different data sets separately– E.g., web pages under agreement A, Usenet articles under

agreement B

– “Replication sets” should grow without human intervention

– Traverse link structure to find objects in set

13

Replication: Data sets

Start traversalSAV

14

User interface

15

User interface

16

Object store performance

0

500

1000

1500

2000

2500

3000

0 2 4 6 8 10 12

Repository size (gigabytes)

Tim

e (s

eco

nd

s)

Read from disk Compute CRC Build indexes

17

Reliability layer performance

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12

Repository size (gigabytes)

Tim

e (s

eco

nd

s)

Compute set Compare sets Transfer set over netw ork

18

The InfoMonitor

Goal– Create a convenient, transparent mechanism for getting

data from existing stores into the archive

ArchitectureUsersUsers

Filesystem

InfoMonitor

SAV Archive

19

Detecting new data

Must find and archive new data– Filesystem will not signal data writes

– Users should not have to explicitly “check-in” data

Scanning– Quick scan: detect changes using timestamps

– Slow scan: detect changes using file contents

Filtering– Automatically decide what to archive

– Use filtering rules

20

User interface

21

User interface

22

InfoMonitor performance

0

2000

4000

6000

8000

10000

12000

14000

16000

0 500 1000 1500 2000 2500 3000

Data migrated (megabytes)

To

tal t

ime

(sec

on

ds)

Initial load Slow scan Quick scan

23

Designing Archival Repositories

Designer needs to answer questions like:

– What is the minimum number of copies of a documents that are needed to ensure its preservation?

– What is a more cost efficient, to store the information on one expensive disk with low failure rates or on two inexpensive disks with high failure rate?

– Are two sites enough to guarantee preservation?

– How often should we scan the repositories for errors?

– What’s the MTTF of this design?

24

Contributions

A comprehensive model for an Archival Repository

A powerful simulation tool: ArchSim, for evaluating

Archival Repositories and the available strategies.

A detailed case study for an hypothetical TR

Repository operated between Stanford and MIT

25

How important is having good disks?

MTTF of the MIT/Stanford Archival System with r sto of 60 days

0

200

400

600

800

1000

1200

1400

1600

3 5 10 20f

sto (years)

MT

TF

of

Sy

ste

m (

ye

ars

)

26

Preventive maintenance

Preventive Maintenance and Aging

0

10

20

30

40

50

60

70

80

1 3 5 10 Never

Start of aging (years)

MT

TF

(ye

ars)

1

3

5

10

Never

Preventive Maintenance Period (years)

27

Current and future work

New models for replication agreements and “data trading”

Archiving the World Wide Web Modeling cost Managing “meaning” Security Alternative object naming schemes Other “upper layers,” e.g. user access, metadata, etc.

28

Conclusion

Digital librarians need tools to preserve data Our project addresses this need

– Reliable storage: SAV

– Convenient access: InfoMonitor

– Finding the best configuration: ArchSim

More work must be done to refine these models– More automation

– More flexibility

– Answer a wider range of design questions

29

For more information

http://www-db.stanford.edu/archivalrep

Brian Cooper: [email protected]

Arturo Crespo: [email protected]

Hector Garcia-Molina: [email protected]

1 stanford archival repository project brian cooper arturo crespo hector garcia-molina department of...

Documents