![Page 1: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/1.jpg)
1
Minerva
The Web Preservation Project
![Page 2: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/2.jpg)
2
Team Members
Library of Congress
Roger AdkinsCassy AmmenAllene HayesMelissa LevineDiane KreshJane MandelbaumBarbara Tillett
Cornell University
William Arms
Internet Archive
Brewster KahleScott Kirkpatrick
Main Reading Room
![Page 3: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/3.jpg)
3
1. Open Access Materials on the Web
![Page 4: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/4.jpg)
4
![Page 5: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/5.jpg)
5
Partnership with publishers
Publishers and libraries as partners
Selective collection of open access web
Librarianship in a new domain
Bulk collection of open access web
Automated processes
Approaches to Collecting and Preservation of the Web
OPEN ACCESS
CLOSED ACCESS
![Page 6: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/6.jpg)
6
Web Preservation Project Pilot
• Small number of web sites nominated by selection officers. Three chosen for close study.
http://www.whitehouse.gov/ http://www.algore2000.com/ http://www.georgewbush.com/
• Copies downloaded using HTTrack mirroring program. Inspected for errors, anomalies, etc.
• Catalog records created using OCLC's CORC software Loaded into Library of Congress's ILS system.
• Trial web site developed to evaluate user access.
• Discussions with Copyright Office on legal issues.
![Page 7: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/7.jpg)
7
Example: The Internet Archive
![Page 8: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/8.jpg)
8
Example: National Library of Australia
![Page 9: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/9.jpg)
9
Example: National Library of Sweden
![Page 10: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/10.jpg)
10
2. Selection and Collection
![Page 11: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/11.jpg)
11
Collecting: Making a Snapshot
Web site
SnapshotDownload
Archive
A web site is downloaded, using a mirroring program. A snapshot is stored in an archive.
![Page 12: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/12.jpg)
12
Collecting: Periodic Snapshots
Web site
Archive
At selected time intervals additional snapshots are made.
Snapshot 1
Snapshot 2
Snapshot 3
![Page 13: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/13.jpg)
13
Very Rough Estimates
There are no good estimates of how many Web sites the Library of Congress would wish to collect and preserve.
OCLC's Web Characterization Project (February 2000)
Public web sites: 2,900,000Annual increase: 700,000
If the Library of Congress collects 1%
Total number of sites: 30,000Annual number new and changed: 15,000
But these numbers are very rough estimates (guesses)!
![Page 14: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/14.jpg)
14
Selection Decisions
Which sites to collect?
• Bulk -- collect all within a certain category• Selective -- collect sites selected by a librarian
How often to make snapshots?
• Monthly, weekly, or depending on circumstances
Which content to collect?
• HTML pages only• Text and images only• Everything
![Page 15: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/15.jpg)
15
Examples of Selection Decisions
Selection Frequency Content
Internet Archive bulk monthly HTML + images
Pandora selective varies all
Kulturarw3 bulk sweeps all
Minerva selective irregular all
![Page 16: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/16.jpg)
16
Selection Decisions: Recommendations
The Library needs a mixed strategy:
1. Selective selection, for known important sites
2. Bulk selection for selected categories (e.g., .gov sites)
3. Bulk collection without selection for other materials
![Page 17: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/17.jpg)
17
3. Use of the Collections for Scholarship and Research
![Page 18: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/18.jpg)
18
Analysis by Computer
Archive
Analysis by
computer
Computer programs can be used to analyze the snapshot files.
Snapshot 1
Snapshot 2
Snapshot 3
![Page 19: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/19.jpg)
19
Analysis by Patron
Web site
Snapshot 1
Archive
Snapshot 2
Snapshot 3
Access 1
Access 2
Access 3Analysis by patron
People can study an access version of a site
![Page 20: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/20.jpg)
20
Access Decisions
Style of access
• Analysis of snapshot files by computer• Analysis of access version by patron
Editing
• No editing (use snapshot files)• Minimal editing to make access version• Fuller editing to maintain experience• Automatic or by hand
Policy
• Who has access to the collections?
![Page 21: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/21.jpg)
21
Examples of Access Decisions
Style Editing
Internet Archive computer no
Pandora researcher yes
Minerva researcher yes
![Page 22: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/22.jpg)
22
Recommendations about the Use of the Collections for Scholarship
and Research
The Library should support the use of the collection in a variety of ways.
1. Computer analysis of snapshot files
2. Automated editing to create access versions of all selected sites, without human checking.
3. Human editing of a few, very important sites.
![Page 23: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/23.jpg)
23
4. Information Discovery
![Page 24: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/24.jpg)
24
Options for Information Discovery
Very large numbers of Web sites will be collected and preserved. Some form of index or catalog is required.
Options
• List of sites (e.g., Internet Archive)
Access by URL + date
• Automatic index (e.g., Web search engines)
• Catalog (e.g., MARC or Dublin Core)
Catalog record for individual site or group of sites Access through Library catalog
![Page 25: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/25.jpg)
25
Information Discovery: Web Preservation Project
Procedure
• MARC catalog records created using OCLC's CORC system.• Loaded into Library of Congress's ILS.
Observations about procedure
• Cataloguing effort similar to other electronic files.• Some similarities to serials.• No significant workflow difficulties.
![Page 26: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/26.jpg)
26
Cataloguing Observations
• Detailed information is continually changing.
• Difficulty in selecting title (HTML <title> is often poor).
• Problems with identifiers (multiple, changing URLs).
• Collection level records suitable for special events.
It is difficult to evaluate cataloguing strategy because of lack of knowledge of user needs.
![Page 27: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/27.jpg)
27
Recommendations about Information Discovery
1. The Library should experiment with various approaches to indexing and cataloguing Web sites, including automated indexing, Dublin Core and MARC cataloguing.
2. The Library will probably not be able to afford individual catalog records for all Web sites that are collected.
![Page 28: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/28.jpg)
28
5. Storage and Preservation
![Page 29: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/29.jpg)
29
Archive
AccessionControl
Web CrawlerProcess
Catalog ExternalAccess
Workflow
snapshot
Analysis by patron
Analysis by computer
Web site
![Page 30: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/30.jpg)
30
Preservation Objective
Objective is to preserve the digital collections in a manner that makes them usable for scholarship and research in the future.
What is preserved?
• Preservation of bits
• Preservation of content
• Preservation of experience
How is it used?
• Analysis by computer program
• Viewed by human researcher
![Page 31: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/31.jpg)
31
Process of Preservation
Version 1Time 0
Time 1
Time 2
This process may be applied to either the snapshot or the access version.
Version 2
Version 3
![Page 32: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/32.jpg)
32
Storage Decisions: Identification
Identification of Web site
• URL, but Web sites may change their URL• URN (e.g., Handle or PURL)
Identification and provenance of versions
• Web site identifier• Collection information (date, time, etc.)• History of changes
Recommendations
1. Assign URN (e.g., Handle) to each Web site.
2. Store provenance metadata with every file.
![Page 33: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/33.jpg)
33
Preservation Recommendations
1. Keep the unedited snapshot files by repeated refreshing.
2. Use automated migration of individual files as the basic technique for keeping Web sites (more of less) functional at moderate cost.
3. Use manual editing for a small number of particularly important sites.
In general, it is not possible to maintain the experience of using Web sites as technology changes, even with expensive editing.
![Page 34: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/34.jpg)
34
6. General Recommendations
![Page 35: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/35.jpg)
35
General Recommendations
1. Collection and preservation of Web materials should be seen as a single program.
2. The program needs a full-time team of librarians and technical staff.
3. Some aspects can be subcontracted to specialists (e.g., the Web crawler), but the leadership must come from the Library.
4. The Library should seek partnerships with other libraries and archives.
5. Most processes will be automatic, with skilled attention given to a small number of particularly important sites.
![Page 36: 1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum](https://reader036.vdocuments.mx/reader036/viewer/2022062516/56649d785503460f94a5a86b/html5/thumbnails/36.jpg)
36
Demonstration of Pilot System