1 stanford archival repository project brian cooper arturo crespo hector garcia-molina department of...

29
1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

Post on 22-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

1

Stanford Archival Repository Project

Brian Cooper

Arturo Crespo

Hector Garcia-MolinaDepartment of Computer Science

Stanford University

Page 2: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

2

Data does not live forever

Much data is stored digitally (perhaps exclusively)– Text

– Multimedia (images, sound, etc.)

– Scientific data

But digital storage is currently unreliable– Magnetic tapes decay, break or lose magnetism

– Disks crash

– Buildings burn down

– Users delete data (accidentally

or maliciously)

Page 3: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

3

Data does not live forever

Digital information already lost:– Early NASA records

– U.S. Census Information

– Toxic waste records

Decay time for common media:– Magnetic Tapes: 10-20 years

– CD-ROM: 5-50 years

– Hard Drive: 3-5 years

Page 4: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

4

Digital archiving

Digital archivists need:– A reliable system to store digital data for long periods

without losing it

– Convenient tools to add new data and manage data already archived

– Methods for finding the “best” configuration» Most reliable

» Most cost effective

» Etc.

Page 5: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

5

Archival Repository Project

Goal: Reliably archive digital information for long periods of time (decades or centuries)– Focus on “preserving bits”– Preserving meaning: future work

Strategy– Replicate objects– Automatically detect and correct errors

Our project– Stanford Archival Vault (SAV) – reliably archives data– InfoMonitor – automatically adds newly created data to

the archive– ArchSim – a simulation tool to model archives

Page 6: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

6

Architecture

Users Users

Filesystem

InfoMonitor

SAV ArchiveSAV Archive

Archived data

Archived data

Internet

Local archive

Remote archive

Page 7: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

7

SAV architecture

Object Store

Reliability Layer

Remote SAV Sites

Upper Layers

User InterfaceData Creation/Import

“Core” SAV components

Upper layers

Application/user level

Page 8: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

8

Write-once repository

Deletions/modifications disallowed– Any object deleted or modified must have been corrupted,

and is replaced

Challenges– Constructing structures of objects

» Object references constrained to point from new to old objects

– Representing modifications» Archive new version of objects = version chain

– Finding objects» Indexes

Page 9: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

9

Write once repository: Indexes

Key to performance– Locate an object quickly using its signature, “Who points

to me?” problem, etc.

Disposable indexes– Can be rebuilt at any time from SAV objects

“Bookmarks” used to find collections of objects using indexed name

Page 10: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

10

Write once repository: Indexes

Bookmark (with well-known name)SAV

Page 11: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

11

Replication: Site networks

Sites form “replication agreements”– Agree to replicate data

– Specify data to replicate in agreement» May be a subset of all of the data in the archive

– Periodically connect and compare data, looking for errors

Strongly connected Weakly connected

Page 12: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

12

Replication: Data sets

SAV replicates different data sets separately– E.g., web pages under agreement A, Usenet articles under

agreement B

– “Replication sets” should grow without human intervention

– Traverse link structure to find objects in set

Page 13: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

13

Replication: Data sets

Start traversalSAV

Page 14: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

14

User interface

Page 15: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

15

User interface

Page 16: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

16

Object store performance

0

500

1000

1500

2000

2500

3000

0 2 4 6 8 10 12

Repository size (gigabytes)

Tim

e (s

eco

nd

s)

Read from disk Compute CRC Build indexes

Page 17: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

17

Reliability layer performance

0

2

4

6

8

10

12

14

0 2 4 6 8 10 12

Repository size (gigabytes)

Tim

e (s

eco

nd

s)

Compute set Compare sets Transfer set over netw ork

Page 18: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

18

The InfoMonitor

Goal– Create a convenient, transparent mechanism for getting

data from existing stores into the archive

ArchitectureUsersUsers

Filesystem

InfoMonitor

SAV Archive

Page 19: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

19

Detecting new data

Must find and archive new data– Filesystem will not signal data writes

– Users should not have to explicitly “check-in” data

Scanning– Quick scan: detect changes using timestamps

– Slow scan: detect changes using file contents

Filtering– Automatically decide what to archive

– Use filtering rules

Page 20: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

20

User interface

Page 21: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

21

User interface

Page 22: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

22

InfoMonitor performance

0

2000

4000

6000

8000

10000

12000

14000

16000

0 500 1000 1500 2000 2500 3000

Data migrated (megabytes)

To

tal t

ime

(sec

on

ds)

Initial load Slow scan Quick scan

Page 23: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

23

Designing Archival Repositories

Designer needs to answer questions like:

– What is the minimum number of copies of a documents that are needed to ensure its preservation?

– What is a more cost efficient, to store the information on one expensive disk with low failure rates or on two inexpensive disks with high failure rate?

– Are two sites enough to guarantee preservation?

– How often should we scan the repositories for errors?

– What’s the MTTF of this design?

Page 24: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

24

Contributions

A comprehensive model for an Archival Repository

A powerful simulation tool: ArchSim, for evaluating

Archival Repositories and the available strategies.

A detailed case study for an hypothetical TR

Repository operated between Stanford and MIT

Page 25: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

25

How important is having good disks?

MTTF of the MIT/Stanford Archival System with r sto of 60 days

0

200

400

600

800

1000

1200

1400

1600

3 5 10 20f

sto (years)

MT

TF

of

Sy

ste

m (

ye

ars

)

Page 26: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

26

Preventive maintenance

Preventive Maintenance and Aging

0

10

20

30

40

50

60

70

80

1 3 5 10 Never

Start of aging (years)

MT

TF

(ye

ars)

1

3

5

10

Never

Preventive Maintenance Period (years)

Page 27: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

27

Current and future work

New models for replication agreements and “data trading”

Archiving the World Wide Web Modeling cost Managing “meaning” Security Alternative object naming schemes Other “upper layers,” e.g. user access, metadata, etc.

Page 28: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

28

Conclusion

Digital librarians need tools to preserve data Our project addresses this need

– Reliable storage: SAV

– Convenient access: InfoMonitor

– Finding the best configuration: ArchSim

More work must be done to refine these models– More automation

– More flexibility

– Answer a wider range of design questions

Page 29: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University

29

For more information

http://www-db.stanford.edu/archivalrep

Brian Cooper: [email protected]

Arturo Crespo: [email protected]

Hector Garcia-Molina: [email protected]