1 minerva the web preservation project. 2 team members library of congress roger adkins cassy ammen...
Post on 22-Dec-2015
221 views
TRANSCRIPT
1
Minerva
The Web Preservation Project
2
Team Members
Library of Congress
Roger AdkinsCassy AmmenAllene HayesMelissa LevineDiane KreshJane MandelbaumBarbara Tillett
Cornell University
William Arms
Internet Archive
Brewster KahleScott Kirkpatrick
Main Reading Room
3
1. Open Access Materials on the Web
4
5
Partnership with publishers
Publishers and libraries as partners
Selective collection of open access web
Librarianship in a new domain
Bulk collection of open access web
Automated processes
Approaches to Collecting and Preservation of the Web
OPEN ACCESS
CLOSED ACCESS
6
Web Preservation Project Pilot
• Small number of web sites nominated by selection officers. Three chosen for close study.
http://www.whitehouse.gov/ http://www.algore2000.com/ http://www.georgewbush.com/
• Copies downloaded using HTTrack mirroring program. Inspected for errors, anomalies, etc.
• Catalog records created using OCLC's CORC software Loaded into Library of Congress's ILS system.
• Trial web site developed to evaluate user access.
• Discussions with Copyright Office on legal issues.
7
Example: The Internet Archive
8
Example: National Library of Australia
9
Example: National Library of Sweden
10
2. Selection and Collection
11
Collecting: Making a Snapshot
Web site
SnapshotDownload
Archive
A web site is downloaded, using a mirroring program. A snapshot is stored in an archive.
12
Collecting: Periodic Snapshots
Web site
Archive
At selected time intervals additional snapshots are made.
Snapshot 1
Snapshot 2
Snapshot 3
13
Very Rough Estimates
There are no good estimates of how many Web sites the Library of Congress would wish to collect and preserve.
OCLC's Web Characterization Project (February 2000)
Public web sites: 2,900,000Annual increase: 700,000
If the Library of Congress collects 1%
Total number of sites: 30,000Annual number new and changed: 15,000
But these numbers are very rough estimates (guesses)!
14
Selection Decisions
Which sites to collect?
• Bulk -- collect all within a certain category• Selective -- collect sites selected by a librarian
How often to make snapshots?
• Monthly, weekly, or depending on circumstances
Which content to collect?
• HTML pages only• Text and images only• Everything
15
Examples of Selection Decisions
Selection Frequency Content
Internet Archive bulk monthly HTML + images
Pandora selective varies all
Kulturarw3 bulk sweeps all
Minerva selective irregular all
16
Selection Decisions: Recommendations
The Library needs a mixed strategy:
1. Selective selection, for known important sites
2. Bulk selection for selected categories (e.g., .gov sites)
3. Bulk collection without selection for other materials
17
3. Use of the Collections for Scholarship and Research
18
Analysis by Computer
Archive
Analysis by
computer
Computer programs can be used to analyze the snapshot files.
Snapshot 1
Snapshot 2
Snapshot 3
19
Analysis by Patron
Web site
Snapshot 1
Archive
Snapshot 2
Snapshot 3
Access 1
Access 2
Access 3Analysis by patron
People can study an access version of a site
20
Access Decisions
Style of access
• Analysis of snapshot files by computer• Analysis of access version by patron
Editing
• No editing (use snapshot files)• Minimal editing to make access version• Fuller editing to maintain experience• Automatic or by hand
Policy
• Who has access to the collections?
21
Examples of Access Decisions
Style Editing
Internet Archive computer no
Pandora researcher yes
Minerva researcher yes
22
Recommendations about the Use of the Collections for Scholarship
and Research
The Library should support the use of the collection in a variety of ways.
1. Computer analysis of snapshot files
2. Automated editing to create access versions of all selected sites, without human checking.
3. Human editing of a few, very important sites.
23
4. Information Discovery
24
Options for Information Discovery
Very large numbers of Web sites will be collected and preserved. Some form of index or catalog is required.
Options
• List of sites (e.g., Internet Archive)
Access by URL + date
• Automatic index (e.g., Web search engines)
• Catalog (e.g., MARC or Dublin Core)
Catalog record for individual site or group of sites Access through Library catalog
25
Information Discovery: Web Preservation Project
Procedure
• MARC catalog records created using OCLC's CORC system.• Loaded into Library of Congress's ILS.
Observations about procedure
• Cataloguing effort similar to other electronic files.• Some similarities to serials.• No significant workflow difficulties.
26
Cataloguing Observations
• Detailed information is continually changing.
• Difficulty in selecting title (HTML <title> is often poor).
• Problems with identifiers (multiple, changing URLs).
• Collection level records suitable for special events.
It is difficult to evaluate cataloguing strategy because of lack of knowledge of user needs.
27
Recommendations about Information Discovery
1. The Library should experiment with various approaches to indexing and cataloguing Web sites, including automated indexing, Dublin Core and MARC cataloguing.
2. The Library will probably not be able to afford individual catalog records for all Web sites that are collected.
28
5. Storage and Preservation
29
Archive
AccessionControl
Web CrawlerProcess
Catalog ExternalAccess
Workflow
snapshot
Analysis by patron
Analysis by computer
Web site
30
Preservation Objective
Objective is to preserve the digital collections in a manner that makes them usable for scholarship and research in the future.
What is preserved?
• Preservation of bits
• Preservation of content
• Preservation of experience
How is it used?
• Analysis by computer program
• Viewed by human researcher
31
Process of Preservation
Version 1Time 0
Time 1
Time 2
This process may be applied to either the snapshot or the access version.
Version 2
Version 3
32
Storage Decisions: Identification
Identification of Web site
• URL, but Web sites may change their URL• URN (e.g., Handle or PURL)
Identification and provenance of versions
• Web site identifier• Collection information (date, time, etc.)• History of changes
Recommendations
1. Assign URN (e.g., Handle) to each Web site.
2. Store provenance metadata with every file.
33
Preservation Recommendations
1. Keep the unedited snapshot files by repeated refreshing.
2. Use automated migration of individual files as the basic technique for keeping Web sites (more of less) functional at moderate cost.
3. Use manual editing for a small number of particularly important sites.
In general, it is not possible to maintain the experience of using Web sites as technology changes, even with expensive editing.
34
6. General Recommendations
35
General Recommendations
1. Collection and preservation of Web materials should be seen as a single program.
2. The program needs a full-time team of librarians and technical staff.
3. Some aspects can be subcontracted to specialists (e.g., the Web crawler), but the leadership must come from the Library.
4. The Library should seek partnerships with other libraries and archives.
5. Most processes will be automatic, with skilled attention given to a small number of particularly important sites.
36
Demonstration of Pilot System