1 minerva the web preservation project. 2 team members library of congress roger adkins cassy ammen...

1

Minerva

The Web Preservation Project

2

Team Members

Library of Congress

Roger AdkinsCassy AmmenAllene HayesMelissa LevineDiane KreshJane MandelbaumBarbara Tillett

Cornell University

William Arms

Internet Archive

Brewster KahleScott Kirkpatrick

Main Reading Room

3

1. Open Access Materials on the Web

5

Partnership with publishers

Publishers and libraries as partners

Selective collection of open access web

Librarianship in a new domain

Bulk collection of open access web

Automated processes

Approaches to Collecting and Preservation of the Web

OPEN ACCESS

CLOSED ACCESS

6

Web Preservation Project Pilot

• Small number of web sites nominated by selection officers. Three chosen for close study.

http://www.whitehouse.gov/ http://www.algore2000.com/ http://www.georgewbush.com/

• Copies downloaded using HTTrack mirroring program. Inspected for errors, anomalies, etc.

• Catalog records created using OCLC's CORC software Loaded into Library of Congress's ILS system.

• Trial web site developed to evaluate user access.

• Discussions with Copyright Office on legal issues.

7

Example: The Internet Archive

8

Example: National Library of Australia

9

Example: National Library of Sweden

10

2. Selection and Collection

11

Collecting: Making a Snapshot

Web site

SnapshotDownload

Archive

A web site is downloaded, using a mirroring program. A snapshot is stored in an archive.

12

Collecting: Periodic Snapshots

Web site

Archive

At selected time intervals additional snapshots are made.

Snapshot 1

Snapshot 2

Snapshot 3

13

Very Rough Estimates

There are no good estimates of how many Web sites the Library of Congress would wish to collect and preserve.

OCLC's Web Characterization Project (February 2000)

Public web sites: 2,900,000Annual increase: 700,000

If the Library of Congress collects 1%

Total number of sites: 30,000Annual number new and changed: 15,000

But these numbers are very rough estimates (guesses)!

14

Selection Decisions

Which sites to collect?

• Bulk -- collect all within a certain category• Selective -- collect sites selected by a librarian

How often to make snapshots?

• Monthly, weekly, or depending on circumstances

Which content to collect?

• HTML pages only• Text and images only• Everything

15

Examples of Selection Decisions

Selection Frequency Content

Internet Archive bulk monthly HTML + images

Pandora selective varies all

Kulturarw3 bulk sweeps all

Minerva selective irregular all

16

Selection Decisions: Recommendations

The Library needs a mixed strategy:

1. Selective selection, for known important sites

2. Bulk selection for selected categories (e.g., .gov sites)

3. Bulk collection without selection for other materials

17

3. Use of the Collections for Scholarship and Research

18

Analysis by Computer

Archive

Analysis by

computer

Computer programs can be used to analyze the snapshot files.

Snapshot 1

Snapshot 2

Snapshot 3

19

Analysis by Patron

Web site

Snapshot 1

Archive

Snapshot 2

Snapshot 3

Access 1

Access 2

Access 3Analysis by patron

People can study an access version of a site

20

Access Decisions

Style of access

• Analysis of snapshot files by computer• Analysis of access version by patron

Editing

• No editing (use snapshot files)• Minimal editing to make access version• Fuller editing to maintain experience• Automatic or by hand

Policy

• Who has access to the collections?

21

Examples of Access Decisions

Style Editing

Internet Archive computer no

Pandora researcher yes

Minerva researcher yes

22

Recommendations about the Use of the Collections for Scholarship

and Research

The Library should support the use of the collection in a variety of ways.

1. Computer analysis of snapshot files

2. Automated editing to create access versions of all selected sites, without human checking.

3. Human editing of a few, very important sites.

23

4. Information Discovery

24

Options for Information Discovery

Very large numbers of Web sites will be collected and preserved. Some form of index or catalog is required.

Options

• List of sites (e.g., Internet Archive)

Access by URL + date

• Automatic index (e.g., Web search engines)

• Catalog (e.g., MARC or Dublin Core)

Catalog record for individual site or group of sites Access through Library catalog

25

Information Discovery: Web Preservation Project

Procedure

• MARC catalog records created using OCLC's CORC system.• Loaded into Library of Congress's ILS.

Observations about procedure

• Cataloguing effort similar to other electronic files.• Some similarities to serials.• No significant workflow difficulties.

26

Cataloguing Observations

• Detailed information is continually changing.

• Difficulty in selecting title (HTML <title> is often poor).

• Problems with identifiers (multiple, changing URLs).

• Collection level records suitable for special events.

It is difficult to evaluate cataloguing strategy because of lack of knowledge of user needs.

27

Recommendations about Information Discovery

1. The Library should experiment with various approaches to indexing and cataloguing Web sites, including automated indexing, Dublin Core and MARC cataloguing.

2. The Library will probably not be able to afford individual catalog records for all Web sites that are collected.

28

5. Storage and Preservation

29

Archive

AccessionControl

Web CrawlerProcess

Catalog ExternalAccess

Workflow

snapshot

Analysis by patron

Analysis by computer

Web site

30

Preservation Objective

Objective is to preserve the digital collections in a manner that makes them usable for scholarship and research in the future.

What is preserved?

• Preservation of bits

• Preservation of content

• Preservation of experience

How is it used?

• Analysis by computer program

• Viewed by human researcher

31

Process of Preservation

Version 1Time 0

Time 1

Time 2

This process may be applied to either the snapshot or the access version.

Version 2

Version 3

32

Storage Decisions: Identification

Identification of Web site

• URL, but Web sites may change their URL• URN (e.g., Handle or PURL)

Identification and provenance of versions

• Web site identifier• Collection information (date, time, etc.)• History of changes

Recommendations

1. Assign URN (e.g., Handle) to each Web site.

2. Store provenance metadata with every file.

33

Preservation Recommendations

1. Keep the unedited snapshot files by repeated refreshing.

2. Use automated migration of individual files as the basic technique for keeping Web sites (more of less) functional at moderate cost.

3. Use manual editing for a small number of particularly important sites.

In general, it is not possible to maintain the experience of using Web sites as technology changes, even with expensive editing.

34

6. General Recommendations

35

General Recommendations

1. Collection and preservation of Web materials should be seen as a single program.

2. The program needs a full-time team of librarians and technical staff.

3. Some aspects can be subcontracted to specialists (e.g., the Web crawler), but the leadership must come from the Library.

4. The Library should seek partnerships with other libraries and archives.

5. Most processes will be automatic, with skilled attention given to a small number of particularly important sites.

36

Demonstration of Pilot System

1 minerva the web preservation project. 2 team members library of congress roger adkins cassy ammen...

Documents

web slide

site slide

collection slide

research slide

web preservation project

archive snapshot

internet archive slide

patron web site snapshot